diff options
Diffstat (limited to 'doc/impl.md')
-rw-r--r-- | doc/impl.md | 242 |
1 files changed, 242 insertions, 0 deletions
diff --git a/doc/impl.md b/doc/impl.md new file mode 100644 index 0000000..380e2d6 --- /dev/null +++ b/doc/impl.md @@ -0,0 +1,242 @@ + +# Starlark in Go: Implementation + +This document (a work in progress) describes some of the design +choices of the Go implementation of Starlark. + + * [Scanner](#scanner) + * [Parser](#parser) + * [Resolver](#resolver) + * [Evaluator](#evaluator) + * [Data types](#data-types) + * [Freezing](#freezing) + * [Testing](#testing) + + +## Scanner + +The scanner is derived from Russ Cox's +[buildifier](https://github.com/bazelbuild/buildtools/tree/master/buildifier) +tool, which pretty-prints Bazel BUILD files. + +Most of the work happens in `(*scanner).nextToken`. + +## Parser + +The parser is hand-written recursive-descent parser. It uses the +technique of [precedence +climbing](http://www.engr.mun.ca/~theo/Misc/exp_parsing.htm#climbing) +to reduce the number of productions. + +In some places the parser accepts a larger set of programs than are +strictly valid, leaving the task of rejecting them to the subsequent +resolver pass. For example, in the function call `f(a, b=c)` the +parser accepts any expression for `a` and `b`, even though `b` may +legally be only an identifier. For the parser to distinguish these +cases would require additional lookahead. + +## Resolver + +The resolver reports structural errors in the program, such as the use +of `break` and `continue` outside of a loop. + +Starlark has stricter syntactic limitations than Python. For example, +it does not permit `for` loops or `if` statements at top level, nor +does it permit global variables to be bound more than once. +These limitations come from the Bazel project's desire to make it easy +to identify the sole statement that defines each global, permitting +accurate cross-reference documentation. + +In addition, the resolver validates all variable names, classifying +them as references to universal, global, local, or free variables. +Local and free variables are mapped to a small integer, allowing the +evaluator to use an efficient (flat) representation for the +environment. + +Not all features of the Go implementation are "standard" (that is, +supported by Bazel's Java implementation), at least for now, so +non-standard features such as `lambda`, `float`, and `set` +are flag-controlled. The resolver reports +any uses of dialect features that have not been enabled. + + +## Evaluator + +### Data types + +<b>Integers:</b> Integers are representing using `big.Int`, an +arbitrary precision integer. This representation was chosen because, +for many applications, Starlark must be able to handle without loss +protocol buffer values containing signed and unsigned 64-bit integers, +which requires 65 bits of precision. + +Small integers (<256) are preallocated, but all other values require +memory allocation. Integer performance is relatively poor, but it +matters little for Bazel-like workloads which depend much +more on lists of strings than on integers. (Recall that a typical loop +over a list in Starlark does not materialize the loop index as an `int`.) + +An optimization worth trying would be to represent integers using +either an `int32` or `big.Int`, with the `big.Int` used only when +`int32` does not suffice. Using `int32`, not `int64`, for "small" +numbers would make it easier to detect overflow from operations like +`int32 * int32`, which would trigger the use of `big.Int`. + +<b>Floating point</b>: +Floating point numbers are represented using Go's `float64`. +Again, `float` support is required to support protocol buffers. The +existence of floating-point NaN and its infamous comparison behavior +(`NaN != NaN`) had many ramifications for the API, since we cannot +assume the result of an ordered comparison is either less than, +greater than, or equal: it may also fail. + +<b>Strings</b>: + +TODO: discuss UTF-8 and string.bytes method. + +<b>Dictionaries and sets</b>: +Starlark dictionaries have predictable iteration order. +Furthermore, many Starlark values are hashable in Starlark even though +the Go values that represent them are not hashable in Go: big +integers, for example. +Consequently, we cannot use Go maps to implement Starlark's dictionary. + +We use a simple hash table whose buckets are linked lists, each +element of which holds up to 8 key/value pairs. In a well-distributed +table the list should rarely exceed length 1. In addition, each +key/value item is part of doubly-linked list that maintains the +insertion order of the elements for iteration. + +<b>Struct:</b> +The `starlarkstruct` Go package provides a non-standard Starlark +extension data type, `struct`, that maps field identifiers to +arbitrary values. Fields are accessed using dot notation: `y = s.f`. +This data type is extensively used in Bazel, but its specification is +currently evolving. + +Starlark has no `class` mechanism, nor equivalent of Python's +`namedtuple`, though it is likely that future versions will support +some way to define a record data type of several fields, with a +representation more efficient than a hash table. + + +### Freezing + +All mutable values created during module initialization are _frozen_ +upon its completion. It is this property that permits a Starlark module +to be referenced by two Starlark threads running concurrently (such as +the initialization threads of two other modules) without the +possibility of a data race. + +The Go implementation supports freezing by storing an additional +"frozen" Boolean variable in each mutable object. Once this flag is set, +all subsequent attempts at mutation fail. Every value defines a +Freeze method that sets its own frozen flag if not already set, and +calls Freeze for each value that it contains. +For example, when a list is frozen, it freezes each of its elements; +when a dictionary is frozen, it freezes each of its keys and values; +and when a function value is frozen, it freezes each of the free +variables and parameter default values implicitly referenced by its closure. +Application-defined types must also follow this discipline. + +The freeze mechanism in the Go implementation is finer grained than in +the Java implementation: in effect, the latter has one "frozen" flag +per module, and every value holds a reference to the frozen flag of +its module. This makes setting the frozen flag more efficient---a +simple bit flip, no need to traverse the object graph---but coarser +grained. Also, it complicates the API slightly because to construct a +list, say, requires a reference to the frozen flag it should use. + +The Go implementation would also permit the freeze operation to be +exposed to the program, for example as a built-in function. +This has proven valuable in writing tests of the freeze mechanism +itself, but is otherwise mostly a curiosity. + + +### Fail-fast iterators + +In some languages (such as Go), a program may mutate a data structure +while iterating over it; for example, a range loop over a map may +delete map elements. In other languages (such as Java), iterators do +extra bookkeeping so that modification of the underlying collection +invalidates the iterator, and the next attempt to use it fails. +This often helps to detect subtle mistakes. + +Starlark takes this a step further. Instead of mutation of the +collection invalidating the iterator, the act of iterating makes the +collection temporarily immutable, so that an attempt to, say, delete a +dict element while looping over the dict, will fail. The error is +reported against the delete operation, not the iteration. + +This is implemented by having each mutable iterable value record a +counter of active iterators. Starting a loop increments this counter, +and completing a loop decrements it. A collection with a nonzero +counter behaves as if frozen. If the collection is actually frozen, +the counter bookkeeping is unnecessary. (Consequently, iterator +bookkeeping is needed only while objects are still mutable, before +they can have been published to another thread, and thus no +synchronization is necessary.) + +A consequence of this design is that in the Go API, it is imperative +to call `Done` on each iterator once it is no longer needed. + +``` +TODO +starlark.Value interface and subinterfaces +argument passing to builtins: UnpackArgs, UnpackPositionalArgs. +``` + +<b>Evaluation strategy:</b> +The evaluator uses a simple recursive tree walk, returning a value or +an error for each expression. We have experimented with just-in-time +compilation of syntax trees to bytecode, but two limitations in the +current Go compiler prevent this strategy from outperforming the +tree-walking evaluator. + +First, the Go compiler does not generate a "computed goto" for a +switch statement ([Go issue +5496](https://github.com/golang/go/issues/5496)). A bytecode +interpreter's main loop is a for-loop around a switch statement with +dozens or hundreds of cases, and the speed with which each case can be +dispatched strongly affects overall performance. +Currently, a switch statement generates a binary tree of ordered +comparisons, requiring several branches instead of one. + +Second, the Go compiler's escape analysis assumes that the underlying +array from a `make([]Value, n)` allocation always escapes +([Go issue 20533](https://github.com/golang/go/issues/20533)). +Because the bytecode interpreter's operand stack has a non-constant +length, it must be allocated with `make`. The resulting allocation +adds to the cost of each Starlark function call; this can be tolerated +by amortizing one very large stack allocation across many calls. +More problematic appears to be the cost of the additional GC write +barriers incurred by every VM operation: every intermediate result is +saved to the VM's operand stack, which is on the heap. +By contrast, intermediate results in the tree-walking evaluator are +never stored to the heap. + +``` +TODO +frames, backtraces, errors. +threads +Print +Load +``` + +## Testing + +``` +TODO +starlarktest package +`assert` module +starlarkstruct +integration with Go testing.T +``` + + +## TODO + + +``` +Discuss practical separation of code and data. +``` |