aboutsummaryrefslogtreecommitdiff
path: root/doc
diff options
context:
space:
mode:
authoralandonovan <adonovan@google.com>2020-03-26 10:23:16 -0400
committerGitHub <noreply@github.com>2020-03-26 10:23:16 -0400
commit16e44b11d94568b6de240a245c9f85c415e69bbc (patch)
treec91e9af517b59d87d697303f03c001bbd5fb900b /doc
parent8dd3e2ee1dd5d034baada4c7b4fcf231294a1013 (diff)
downloadstarlark-go-16e44b11d94568b6de240a245c9f85c415e69bbc.tar.gz
syntax: strict string escapes (#265)
This change causes Starlark, like Go, to reject backslashes that are not part of an escape sequence. Previously they were treated literally, so \ ( would encode a two-character string. Many programs rely on this, especially for regular expressions and shell commands, and will be broken by this change, but the fix is simple: double each errant backslash. Python does not yet enforce this behavior, but since 3.6 has emitted a deprecation warning for it. Also, document string escapes. Related issues: - Google issue b/34519173: "bazel: Forbid undefined escape sequences in strings" - bazelbuild/starlark#38: Starlark spec: String escapes - bazelbuild/buildtools#688: Bazel: Fix string escapes - bazelbuild/bazel#8380: Bazel incompatible_restrict_string_escapes: Restrict string escapes
Diffstat (limited to 'doc')
-rw-r--r--doc/spec.md131
1 files changed, 130 insertions, 1 deletions
diff --git a/doc/spec.md b/doc/spec.md
index 2129d37..fd8b4f8 100644
--- a/doc/spec.md
+++ b/doc/spec.md
@@ -321,7 +321,135 @@ hex_digit = '0' … '9' | 'A' … 'F' | 'a' … 'f' .
binary_digit = '0' | '1' .
```
-TODO: define string_lit, indent, outdent, semicolon, newline, eof
+### String literals
+
+A Starlark string literal denotes a string value.
+In its simplest form, it consists of the desired text
+surrounded by matching single- or double-quotation marks:
+
+```python
+"abc"
+'abc'
+```
+
+Literal occurrences of the chosen quotation mark character must be
+escaped by a preceding backslash. So, if a string contains several
+of one kind of quotation mark, it may be convenient to quote the string
+using the other kind, as in these examples:
+
+```python
+'Have you read "To Kill a Mockingbird?"'
+"Yes, it's a classic."
+
+"Have you read \"To Kill a Mockingbird?\""
+'Yes, it\'s a classic.'
+```
+
+#### String escapes
+
+Within a string literal, the backslash character `\` indicates the
+start of an _escape sequence_, a notation for expressing things that
+are impossible or awkward to write directly.
+
+The following *traditional escape sequences* represent the ASCII control
+codes 7-13:
+
+```
+\a \x07 alert or bell
+\b \x08 backspace
+\f \x0C form feed
+\n \x0A line feed
+\r \x0D carriage return
+\t \x09 horizontal tab
+\v \x0B vertical tab
+```
+
+A *literal backslash* is written using the escape `\\`.
+
+An *escaped newline*---that is, a backslash at the end of a line---is ignored,
+allowing a long string to be split across multiple lines of the source file.
+
+```python
+"abc\
+def" # "abcdef"
+```
+
+An *octal escape* encodes a single byte using its octal value.
+It consists of a backslash followed by one, two, or three octal digits [0-7].
+It is error if the value is greater than decimal 255.
+
+```python
+'\0' # "\x00" a string containing a single NUL byte
+'\12' # "\n" octal 12 = decimal 10
+'\101-\132' # "A-Z"
+'\119' # "\t9" = "\11" + "9"
+```
+
+<b>Implementation note:</b>
+The Java implementation encodes strings using UTF-16,
+so an octal escape encodes a single UTF-16 code unit.
+Octal escapes for values above 127 are therefore not portable across implementations.
+There is little reason to use octal escapes in new code.
+
+A *hex escape* encodes a single byte using its hexadecimal value.
+It consists of `\x` followed by exactly two hexadecimal digits [0-9A-Fa-f].
+
+```python
+"\x00" # "\x00" a string containing a single NUL byte
+"(\x20)" # "( )" ASCII 0x20 = 32 = space
+
+red, reset = "\x1b[31m", "\x1b[0m" # ANSI terminal control codes for color
+"(" + red + "hello" + reset + ")" # "(hello)" with red text, if on a terminal
+```
+
+<b>Implementation note:</b>
+The Java implementation does not support hex escapes.
+
+An ordinary string literal may not contain an unescaped newline,
+but a *multiline string literal* may spread over multiple source lines.
+It is denoted using three quotation marks at start and end.
+Within it, unescaped newlines and quotation marks (or even pairs of
+quotation marks) have their literal meaning, but three quotation marks
+end the literal. This makes it easy to quote large blocks of text with
+few escapes.
+
+```
+haiku = '''
+Yesterday it worked.
+Today it is not working.
+That's computers. Sigh.
+'''
+```
+
+Regardless of the platform's convention for text line endings---for
+example, a linefeed (\n) on UNIX, or a carriage return followed by a
+linefeed (\r\n) on Microsoft Windows---an unescaped line ending in a
+multiline string literal always denotes a line feed (\n).
+
+Starlark also supports *raw string literals*, which look like an
+ordinary single- or double-quotation preceded by `r`. Within a raw
+string literal, there is no special processing of backslash escapes,
+other than an escaped quotation mark (which denotes a literal
+quotation mark), or an escaped newline (which denotes a backslash
+followed by a newline). This form of quotation is typically used when
+writing strings that contain many quotation marks or backslashes (such
+as regular expressions or shell commands) to reduce the burden of
+escaping:
+
+```python
+"a\nb" # "a\nb" = 'a' + '\n' + 'b'
+r"a\nb" # "a\\nb" = 'a' + '\\' + '\n' + 'b'
+
+"a\
+b" # "ab"
+r"a\
+b" # "a\\\nb"
+```
+
+It is an error for a backslash to appear within a string literal other
+than as part of one of the escapes described above.
+
+TODO: define indent, outdent, semicolon, newline, eof
## Data types
@@ -4106,6 +4234,7 @@ See [Starlark spec issue 20](https://github.com/bazelbuild/starlark/issues/20).
* `lambda` expressions are supported (option: `-lambda`).
* String elements are bytes.
* Non-ASCII strings are encoded using UTF-8.
+* Strings support octal and hex byte escapes.
* Strings have the additional methods `elem_ords`, `codepoint_ords`, and `codepoints`.
* The `chr` and `ord` built-in functions are supported.
* The `set` built-in function is provided (option: `-set`).