From f0be43de02a1e07308d3d95408349c3c7f973430 Mon Sep 17 00:00:00 2001
From: Raph Levien <raph@google.com>
Date: Thu, 27 Aug 2015 13:50:00 -0700
Subject: Binary format for hyphenation patterns

In the current state, hyphenation in all languages than Sanskrit seems
to work (case-folding edge cases). Thus, we just disable Sanskrit.
Packed tries are implemented, but not the finite state machine
(space/speed tradeoff).

This commit contains a throw-away test app, which runs on the host.
I think I want to replace it with unit tests, but I'm including it in
the CL because it's useful during development.

Bug: 21562869
Bug: 21826930
Bug: 23317038
Bug: 23317904

Change-Id: I7479a565a4a062fa319651c2c14c0fa18c5ceaea
---
 doc/hyb_file_format.md | 135 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 135 insertions(+)
 create mode 100644 doc/hyb_file_format.md

(limited to 'doc')

diff --git a/doc/hyb_file_format.md b/doc/hyb_file_format.md
new file mode 100644
index 0000000..3065a6f
--- /dev/null
+++ b/doc/hyb_file_format.md
@@ -0,0 +1,135 @@
+# Hyb (hyphenation pattern binary) file format
+
+The hyb file format is how hyphenation patterns are stored in the system image.
+
+Goals include:
+
+* Concise (system image space is at a premium)
+* Usable when mmap'ed, so it doesn't take significant physical RAM
+* Fast to compute
+* Simple
+
+It is _not_ intended as an interchange format, so there is no attempt to make the format
+extensible or facilitate backward and forward compatibility.
+
+Further, at some point we will probably pack patterns for multiple languages into a single
+physical file, to reduce number of open mmap'ed files. This document doesn't cover that.
+
+## Theoretical basis
+
+At heart, the file contains packed tries with suffix compression, actually quite similar
+to the implementation in TeX.
+
+The file contains three sections. The first section represents the "alphabet," including
+case folding. It is effectively a map from Unicode code point to a small integer.
+
+The second section contains the trie in packed form. It is an array of 3-tuples, packed
+into a 32 bit integer. Each (suffix-compressed) trie node has a unique index within this
+array, and the pattern field in the tuple is the pattern for that node. Further, each edge
+in the trie has an entry in the array, and the character and link fields in the tuple
+represent the label and destination of the edge. The packing strategy is as in
+[Word Hy-phen-a-tion by Com-put-er](http://www.tug.org/docs/liang/liang-thesis.pdf) by
+Franklin Mark Liang.
+
+The trie representation is similar but not identical to the "double-array trie".
+The fundamental operation of lookup of the edge from `s` to `t` with label `c` is
+to compare `c == character[s + c]`, and if so, `t = link[s + c]`.
+
+The third section contains the pattern strings. This section is in two parts: first,
+an array with a 3-tuple for each pattern (length, number of trailing 0's, and offset
+into the string pool); and second, the string pool. Each pattern is encoded as a byte
+(packing 2 per byte would be possible but the space savings would not be signficant).
+
+As much as possible of the file is represented as 32 bit integers, as that is especially
+efficent to access. All are little-endian (this could be revised if the code ever needs
+to be ported to big-endian systems).
+
+## Header
+
+```
+uint32_t magic == 0x62ad7968
+uint32_t version = 0
+uint32_t alphabet_offset (in bytes)
+uint32_t trie_offset (in bytes)
+uint32_t pattern_offset (in bytes)
+uint32_t file_size (in bytes)
+```
+
+Offsets are from the front of the file, and in bytes.
+
+## Alphabet
+
+The alphabet table comes in two versions. The first is well suited to dense Unicode
+ranges and is limited to 256. The second is more general, but lookups will be slower.
+
+### Alphabet, direct version
+
+```
+uint32_t version = 0
+uint32_t min_codepoint
+uint32_t max_codepoint (exclusive)
+uint8_t[] data
+```
+
+The size of the data array is max_codepoint - min_codepoint. 0 represents an unmapped
+character. Note that, in the current implementation, automatic hyphenation is disabled
+for any word containing an unmapped character.
+
+In general, pad bytes follow this table, aligning the next table to a 4-byte boundary.
+
+### Alphabet, general version
+
+```
+uint32_t version = 1
+uint32_t n_entries
+uint32_t[n_entries] data
+```
+
+Each element in the data table is `(codepoint << 11) | value`. Note that this is
+restricted to 11 bits (2048 possible values). The largest known current value is 483
+(for Sanskrit).
+
+The entries are sorted by codepoint, to facilitate binary search. Another reasonable
+implementation for consumers of the data would be to build a hash table at load time.
+
+## Trie
+
+```
+uint32_t version = 0
+uint32_t char_mask
+uint32_t link_shift
+uint32_t link_mask
+uint32_t pattern_shift
+uint32_t n_entries
+uint32_t[n_entries] data
+```
+
+Each element in the data table is `(pattern << pattern_shift) | (link << link_shift) | char`.
+
+All known pattern tables fit in 32 bits total. If this is exceeded, there is a fairly
+straightforward tweak, where each node occupies a slot by itself (as opposed to sharing
+it with edge slots), which would require very minimal changes to the implementation (TODO
+present in more detail).
+
+## Pattern
+
+```
+uint32_t version = 0
+uint32_t n_entries
+uint32_t pattern_offset (in bytes)
+uint32_t pattern_size (in bytes)
+uint32_t[n_entries] data
+uint8_t[] pattern_buf
+```
+
+Each element in data table is `(len << 26) | (shift << 20) | offset`, where an offset of 0
+points to the first byte of pattern_buf.
+
+Generally pattern_offset is `16 + 4 * n_entries`.
+
+For example, 'a4m5ato' would be represented as `[4, 5, 0, 0, 0]`, then len = 2, shift = 3, and
+offset points to [4, 5] in the pattern buffer.
+
+Future extension: additional data representing nonstandard hyphenation. See
+[Automatic non-standard hyphenation in OpenOffice.org](https://www.tug.org/TUGboat/tb27-1/tb86nemeth.pdf)
+for more information about that issue.
-- 
cgit v1.2.3