syntax: strict string escapes (#265)

This change causes Starlark, like Go, to reject backslashes that are not part of an escape sequence. Previously they were treated literally, so \ ( would encode a two-character string. Many programs rely on this, especially for regular expressions and shell commands, and will be broken by this change, but the fix is simple: double each errant backslash. Python does not yet enforce this behavior, but since 3.6 has emitted a deprecation warning for it. Also, document string escapes. Related issues: - Google issue b/34519173: "bazel: Forbid undefined escape sequences in strings" - bazelbuild/starlark#38: Starlark spec: String escapes - bazelbuild/buildtools#688: Bazel: Fix string escapes - bazelbuild/bazel#8380: Bazel incompatible_restrict_string_escapes: Restrict string escapes
author: alandonovan <adonovan@google.com> 2020-03-26 10:23:16 -0400
committer: GitHub <noreply@github.com> 2020-03-26 10:23:16 -0400
commit: 16e44b11d94568b6de240a245c9f85c415e69bbc (patch)
tree: c91e9af517b59d87d697303f03c001bbd5fb900b /doc
parent: 8dd3e2ee1dd5d034baada4c7b4fcf231294a1013 (diff)
download: starlark-go-16e44b11d94568b6de240a245c9f85c415e69bbc.tar.gz
1 files changed, 130 insertions, 1 deletions
diff --git a/doc/spec.md b/doc/spec.md
index 2129d37..fd8b4f8 100644
--- a/doc/spec.md
+++ b/doc/spec.md
@@ -321,7 +321,135 @@ hex_digit     = '0' … '9' | 'A' … 'F' | 'a' … 'f' .
 binary_digit  = '0' | '1' .
 ```
 
-TODO: define string_lit, indent, outdent, semicolon, newline, eof
+### String literals
+
+A Starlark string literal denotes a string value. 
+In its simplest form, it consists of the desired text 
+surrounded by matching single- or double-quotation marks:
+
+```python
+"abc"
+'abc'
+```
+
+Literal occurrences of the chosen quotation mark character must be
+escaped by a preceding backslash. So, if a string contains several
+of one kind of quotation mark, it may be convenient to quote the string
+using the other kind, as in these examples:
+
+```python
+'Have you read "To Kill a Mockingbird?"'
+"Yes, it's a classic."
+
+"Have you read \"To Kill a Mockingbird?\""
+'Yes, it\'s a classic.'
+```
+
+#### String escapes
+
+Within a string literal, the backslash character `\` indicates the
+start of an _escape sequence_, a notation for expressing things that
+are impossible or awkward to write directly.
+
+The following *traditional escape sequences* represent the ASCII control
+codes 7-13:
+
+```
+\a   \x07 alert or bell
+\b   \x08 backspace
+\f   \x0C form feed
+\n   \x0A line feed
+\r   \x0D carriage return
+\t   \x09 horizontal tab
+\v   \x0B vertical tab
+```
+
+A *literal backslash* is written using the escape `\\`.
+
+An *escaped newline*---that is, a backslash at the end of a line---is ignored,
+allowing a long string to be split across multiple lines of the source file.
+
+```python
+"abc\
+def"			# "abcdef"
+```
+
+An *octal escape* encodes a single byte using its octal value.
+It consists of a backslash followed by one, two, or three octal digits [0-7].
+It is error if the value is greater than decimal 255.
+
+```python
+'\0'			# "\x00"  a string containing a single NUL byte
+'\12'			# "\n"    octal 12 = decimal 10
+'\101-\132'		# "A-Z"
+'\119'			# "\t9"   = "\11" + "9"
+```
+
+<b>Implementation note:</b>
+The Java implementation encodes strings using UTF-16,
+so an octal escape encodes a single UTF-16 code unit.
+Octal escapes for values above 127 are therefore not portable across implementations.
+There is little reason to use octal escapes in new code.
+
+A *hex escape* encodes a single byte using its hexadecimal value.
+It consists of `\x` followed by exactly two hexadecimal digits [0-9A-Fa-f].
+
+```python
+"\x00"			# "\x00"  a string containing a single NUL byte
+"(\x20)"		# "( )"   ASCII 0x20 = 32 = space
+
+red, reset = "\x1b[31m", "\x1b[0m"	# ANSI terminal control codes for color
+"(" + red + "hello" + reset + ")"	# "(hello)" with red text, if on a terminal
+```
+
+<b>Implementation note:</b>
+The Java implementation does not support hex escapes.
+
+An ordinary string literal may not contain an unescaped newline,
+but a *multiline string literal* may spread over multiple source lines.
+It is denoted using three quotation marks at start and end.
+Within it, unescaped newlines and quotation marks (or even pairs of
+quotation marks) have their literal meaning, but three quotation marks
+end the literal. This makes it easy to quote large blocks of text with
+few escapes.
+
+```
+haiku = '''
+Yesterday it worked.
+Today it is not working.
+That's computers. Sigh.
+'''
+```
+
+Regardless of the platform's convention for text line endings---for
+example, a linefeed (\n) on UNIX, or a carriage return followed by a
+linefeed (\r\n) on Microsoft Windows---an unescaped line ending in a
+multiline string literal always denotes a line feed (\n).
+
+Starlark also supports *raw string literals*, which look like an
+ordinary single- or double-quotation preceded by `r`. Within a raw
+string literal, there is no special processing of backslash escapes,
+other than an escaped quotation mark (which denotes a literal
+quotation mark), or an escaped newline (which denotes a backslash
+followed by a newline). This form of quotation is typically used when
+writing strings that contain many quotation marks or backslashes (such
+as regular expressions or shell commands) to reduce the burden of
+escaping:
+
+```python
+"a\nb"		# "a\nb"  = 'a' + '\n' + 'b'
+r"a\nb"		# "a\\nb" = 'a' + '\\' + '\n' + 'b'
+
+"a\
+b"		# "ab"
+r"a\
+b"		# "a\\\nb"
+```
+
+It is an error for a backslash to appear within a string literal other
+than as part of one of the escapes described above.
+
+TODO: define indent, outdent, semicolon, newline, eof
 
 ## Data types
 
@@ -4106,6 +4234,7 @@ See [Starlark spec issue 20](https://github.com/bazelbuild/starlark/issues/20).
 * `lambda` expressions are supported (option: `-lambda`).
 * String elements are bytes.
 * Non-ASCII strings are encoded using UTF-8.
+* Strings support octal and hex byte escapes.
 * Strings have the additional methods `elem_ords`, `codepoint_ords`, and `codepoints`.
 * The `chr` and `ord` built-in functions are supported.
 * The `set` built-in function is provided (option: `-set`).
author	alandonovan <adonovan@google.com>	2020-03-26 10:23:16 -0400
committer	GitHub <noreply@github.com>	2020-03-26 10:23:16 -0400
commit	16e44b11d94568b6de240a245c9f85c415e69bbc (patch)
tree	c91e9af517b59d87d697303f03c001bbd5fb900b /doc
parent	8dd3e2ee1dd5d034baada4c7b4fcf231294a1013 (diff)
download	starlark-go-16e44b11d94568b6de240a245c9f85c415e69bbc.tar.gz