diff options
author | alandonovan <adonovan@google.com> | 2020-03-26 10:23:16 -0400 |
---|---|---|
committer | GitHub <noreply@github.com> | 2020-03-26 10:23:16 -0400 |
commit | 16e44b11d94568b6de240a245c9f85c415e69bbc (patch) | |
tree | c91e9af517b59d87d697303f03c001bbd5fb900b /doc | |
parent | 8dd3e2ee1dd5d034baada4c7b4fcf231294a1013 (diff) | |
download | starlark-go-16e44b11d94568b6de240a245c9f85c415e69bbc.tar.gz |
syntax: strict string escapes (#265)
This change causes Starlark, like Go, to reject backslashes that
are not part of an escape sequence. Previously they were treated
literally, so \ ( would encode a two-character string.
Many programs rely on this, especially for regular expressions and
shell commands, and will be broken by this change, but the fix is simple:
double each errant backslash.
Python does not yet enforce this behavior, but since 3.6
has emitted a deprecation warning for it.
Also, document string escapes.
Related issues:
- Google issue b/34519173: "bazel: Forbid undefined escape sequences in strings"
- bazelbuild/starlark#38: Starlark spec: String escapes
- bazelbuild/buildtools#688: Bazel: Fix string escapes
- bazelbuild/bazel#8380: Bazel incompatible_restrict_string_escapes: Restrict string escapes
Diffstat (limited to 'doc')
-rw-r--r-- | doc/spec.md | 131 |
1 files changed, 130 insertions, 1 deletions
diff --git a/doc/spec.md b/doc/spec.md index 2129d37..fd8b4f8 100644 --- a/doc/spec.md +++ b/doc/spec.md @@ -321,7 +321,135 @@ hex_digit = '0' … '9' | 'A' … 'F' | 'a' … 'f' . binary_digit = '0' | '1' . ``` -TODO: define string_lit, indent, outdent, semicolon, newline, eof +### String literals + +A Starlark string literal denotes a string value. +In its simplest form, it consists of the desired text +surrounded by matching single- or double-quotation marks: + +```python +"abc" +'abc' +``` + +Literal occurrences of the chosen quotation mark character must be +escaped by a preceding backslash. So, if a string contains several +of one kind of quotation mark, it may be convenient to quote the string +using the other kind, as in these examples: + +```python +'Have you read "To Kill a Mockingbird?"' +"Yes, it's a classic." + +"Have you read \"To Kill a Mockingbird?\"" +'Yes, it\'s a classic.' +``` + +#### String escapes + +Within a string literal, the backslash character `\` indicates the +start of an _escape sequence_, a notation for expressing things that +are impossible or awkward to write directly. + +The following *traditional escape sequences* represent the ASCII control +codes 7-13: + +``` +\a \x07 alert or bell +\b \x08 backspace +\f \x0C form feed +\n \x0A line feed +\r \x0D carriage return +\t \x09 horizontal tab +\v \x0B vertical tab +``` + +A *literal backslash* is written using the escape `\\`. + +An *escaped newline*---that is, a backslash at the end of a line---is ignored, +allowing a long string to be split across multiple lines of the source file. + +```python +"abc\ +def" # "abcdef" +``` + +An *octal escape* encodes a single byte using its octal value. +It consists of a backslash followed by one, two, or three octal digits [0-7]. +It is error if the value is greater than decimal 255. + +```python +'\0' # "\x00" a string containing a single NUL byte +'\12' # "\n" octal 12 = decimal 10 +'\101-\132' # "A-Z" +'\119' # "\t9" = "\11" + "9" +``` + +<b>Implementation note:</b> +The Java implementation encodes strings using UTF-16, +so an octal escape encodes a single UTF-16 code unit. +Octal escapes for values above 127 are therefore not portable across implementations. +There is little reason to use octal escapes in new code. + +A *hex escape* encodes a single byte using its hexadecimal value. +It consists of `\x` followed by exactly two hexadecimal digits [0-9A-Fa-f]. + +```python +"\x00" # "\x00" a string containing a single NUL byte +"(\x20)" # "( )" ASCII 0x20 = 32 = space + +red, reset = "\x1b[31m", "\x1b[0m" # ANSI terminal control codes for color +"(" + red + "hello" + reset + ")" # "(hello)" with red text, if on a terminal +``` + +<b>Implementation note:</b> +The Java implementation does not support hex escapes. + +An ordinary string literal may not contain an unescaped newline, +but a *multiline string literal* may spread over multiple source lines. +It is denoted using three quotation marks at start and end. +Within it, unescaped newlines and quotation marks (or even pairs of +quotation marks) have their literal meaning, but three quotation marks +end the literal. This makes it easy to quote large blocks of text with +few escapes. + +``` +haiku = ''' +Yesterday it worked. +Today it is not working. +That's computers. Sigh. +''' +``` + +Regardless of the platform's convention for text line endings---for +example, a linefeed (\n) on UNIX, or a carriage return followed by a +linefeed (\r\n) on Microsoft Windows---an unescaped line ending in a +multiline string literal always denotes a line feed (\n). + +Starlark also supports *raw string literals*, which look like an +ordinary single- or double-quotation preceded by `r`. Within a raw +string literal, there is no special processing of backslash escapes, +other than an escaped quotation mark (which denotes a literal +quotation mark), or an escaped newline (which denotes a backslash +followed by a newline). This form of quotation is typically used when +writing strings that contain many quotation marks or backslashes (such +as regular expressions or shell commands) to reduce the burden of +escaping: + +```python +"a\nb" # "a\nb" = 'a' + '\n' + 'b' +r"a\nb" # "a\\nb" = 'a' + '\\' + '\n' + 'b' + +"a\ +b" # "ab" +r"a\ +b" # "a\\\nb" +``` + +It is an error for a backslash to appear within a string literal other +than as part of one of the escapes described above. + +TODO: define indent, outdent, semicolon, newline, eof ## Data types @@ -4106,6 +4234,7 @@ See [Starlark spec issue 20](https://github.com/bazelbuild/starlark/issues/20). * `lambda` expressions are supported (option: `-lambda`). * String elements are bytes. * Non-ASCII strings are encoded using UTF-8. +* Strings support octal and hex byte escapes. * Strings have the additional methods `elem_ords`, `codepoint_ords`, and `codepoints`. * The `chr` and `ord` built-in functions are supported. * The `set` built-in function is provided (option: `-set`). |