aboutsummaryrefslogtreecommitdiff
path: root/utf.7
diff options
context:
space:
mode:
authorAlexander Gutkin <agutkin@google.com>2014-02-28 11:33:45 +0000
committerAlexander Gutkin <agutkin@google.com>2014-02-28 11:33:45 +0000
commit439f3d1f87279a8be383ee01ef98cb9a5ca68573 (patch)
treede42c34fb1e2a4f5997782c71730ffea07667d52 /utf.7
parent86456b0f43cde20930f39abe12c9255c3e185712 (diff)
downloadlibutf-439f3d1f87279a8be383ee01ef98cb9a5ca68573.tar.gz
Initial revision of libutf library.
Libutf is a port of Plan 9's support library for UTF-8 and Unicode. Downloaded from http://swtch.com/plan9port/unix/libutf.tgz. No modifications required to compile. Change-Id: I5646bc8709bafc14039d30e28a0c69a804e78548
Diffstat (limited to 'utf.7')
-rw-r--r--utf.799
1 files changed, 99 insertions, 0 deletions
diff --git a/utf.7 b/utf.7
new file mode 100644
index 0000000..13eea25
--- /dev/null
+++ b/utf.7
@@ -0,0 +1,99 @@
+.deEX
+.ift .ft5
+.nf
+..
+.deEE
+.ft1
+.fi
+..
+.TH UTF 7
+.SH NAME
+UTF, Unicode, ASCII, rune \- character set and format
+.SH DESCRIPTION
+The Plan 9 character set and representation are
+based on the Unicode Standard and on the ISO multibyte
+.SM UTF-8
+encoding (Universal Character
+Set Transformation Format, 8 bits wide).
+The Unicode Standard represents its characters in 16
+bits;
+.SM UTF-8
+represents such
+values in an 8-bit byte stream.
+Throughout this manual,
+.SM UTF-8
+is shortened to
+.SM UTF.
+.PP
+In Plan 9, a
+.I rune
+is a 16-bit quantity representing a Unicode character.
+Internally, programs may store characters as runes.
+However, any external manifestation of textual information,
+in files or at the interface between programs, uses a
+machine-independent, byte-stream encoding called
+.SM UTF.
+.PP
+.SM UTF
+is designed so the 7-bit
+.SM ASCII
+set (values hexadecimal 00 to 7F),
+appear only as themselves
+in the encoding.
+Runes with values above 7F appear as sequences of two or more
+bytes with values only from 80 to FF.
+.PP
+The
+.SM UTF
+encoding of the Unicode Standard is backward compatible with
+.SM ASCII\c
+:
+programs presented only with
+.SM ASCII
+work on Plan 9
+even if not written to deal with
+.SM UTF,
+as do
+programs that deal with uninterpreted byte streams.
+However, programs that perform semantic processing on
+.SM ASCII
+graphic
+characters must convert from
+.SM UTF
+to runes
+in order to work properly with non-\c
+.SM ASCII
+input.
+See
+.IR rune (3).
+.PP
+Letting numbers be binary,
+a rune x is converted to a multibyte
+.SM UTF
+sequence
+as follows:
+.PP
+01. x in [00000000.0bbbbbbb] → 0bbbbbbb
+.br
+10. x in [00000bbb.bbbbbbbb] → 110bbbbb, 10bbbbbb
+.br
+11. x in [bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb, 10bbbbbb
+.br
+.PP
+Conversion 01 provides a one-byte sequence that spans the
+.SM ASCII
+character set in a compatible way.
+Conversions 10 and 11 represent higher-valued characters
+as sequences of two or three bytes with the high bit set.
+Plan 9 does not support the 4, 5, and 6 byte sequences proposed by X-Open.
+When there are multiple ways to encode a value, for example rune 0,
+the shortest encoding is used.
+.PP
+In the inverse mapping,
+any sequence except those described above
+is incorrect and is converted to rune hexadecimal 0080.
+.SH "SEE ALSO"
+.IR ascii (1),
+.IR tcs (1),
+.IR rune (3),
+.IR "The Unicode Standard" .