diff options
author | Alexander Gutkin <agutkin@google.com> | 2014-02-28 11:33:45 +0000 |
---|---|---|
committer | Alexander Gutkin <agutkin@google.com> | 2014-02-28 11:33:45 +0000 |
commit | 439f3d1f87279a8be383ee01ef98cb9a5ca68573 (patch) | |
tree | de42c34fb1e2a4f5997782c71730ffea07667d52 /utf.7 | |
parent | 86456b0f43cde20930f39abe12c9255c3e185712 (diff) | |
download | libutf-439f3d1f87279a8be383ee01ef98cb9a5ca68573.tar.gz |
Initial revision of libutf library.
Libutf is a port of Plan 9's support library for UTF-8 and Unicode.
Downloaded from http://swtch.com/plan9port/unix/libutf.tgz. No
modifications required to compile.
Change-Id: I5646bc8709bafc14039d30e28a0c69a804e78548
Diffstat (limited to 'utf.7')
-rw-r--r-- | utf.7 | 99 |
1 files changed, 99 insertions, 0 deletions
@@ -0,0 +1,99 @@ +.deEX +.ift .ft5 +.nf +.. +.deEE +.ft1 +.fi +.. +.TH UTF 7 +.SH NAME +UTF, Unicode, ASCII, rune \- character set and format +.SH DESCRIPTION +The Plan 9 character set and representation are +based on the Unicode Standard and on the ISO multibyte +.SM UTF-8 +encoding (Universal Character +Set Transformation Format, 8 bits wide). +The Unicode Standard represents its characters in 16 +bits; +.SM UTF-8 +represents such +values in an 8-bit byte stream. +Throughout this manual, +.SM UTF-8 +is shortened to +.SM UTF. +.PP +In Plan 9, a +.I rune +is a 16-bit quantity representing a Unicode character. +Internally, programs may store characters as runes. +However, any external manifestation of textual information, +in files or at the interface between programs, uses a +machine-independent, byte-stream encoding called +.SM UTF. +.PP +.SM UTF +is designed so the 7-bit +.SM ASCII +set (values hexadecimal 00 to 7F), +appear only as themselves +in the encoding. +Runes with values above 7F appear as sequences of two or more +bytes with values only from 80 to FF. +.PP +The +.SM UTF +encoding of the Unicode Standard is backward compatible with +.SM ASCII\c +: +programs presented only with +.SM ASCII +work on Plan 9 +even if not written to deal with +.SM UTF, +as do +programs that deal with uninterpreted byte streams. +However, programs that perform semantic processing on +.SM ASCII +graphic +characters must convert from +.SM UTF +to runes +in order to work properly with non-\c +.SM ASCII +input. +See +.IR rune (3). +.PP +Letting numbers be binary, +a rune x is converted to a multibyte +.SM UTF +sequence +as follows: +.PP +01. x in [00000000.0bbbbbbb] → 0bbbbbbb +.br +10. x in [00000bbb.bbbbbbbb] → 110bbbbb, 10bbbbbb +.br +11. x in [bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb, 10bbbbbb +.br +.PP +Conversion 01 provides a one-byte sequence that spans the +.SM ASCII +character set in a compatible way. +Conversions 10 and 11 represent higher-valued characters +as sequences of two or three bytes with the high bit set. +Plan 9 does not support the 4, 5, and 6 byte sequences proposed by X-Open. +When there are multiple ways to encode a value, for example rune 0, +the shortest encoding is used. +.PP +In the inverse mapping, +any sequence except those described above +is incorrect and is converted to rune hexadecimal 0080. +.SH "SEE ALSO" +.IR ascii (1), +.IR tcs (1), +.IR rune (3), +.IR "The Unicode Standard" . |