diff options
author | jshin@chromium.org <jshin@chromium.org> | 2014-10-13 23:06:48 +0000 |
---|---|---|
committer | jshin@chromium.org <jshin@chromium.org> | 2014-10-13 23:06:48 +0000 |
commit | 6ea11f3e257a813015220ff23d7a105d680ebb73 (patch) | |
tree | 8c25cb51f25029565c38864f19761909d70afd70 /scripts | |
parent | 8ac906faf7b66180f2208380c35ae1e07136c5cc (diff) | |
download | icu-6ea11f3e257a813015220ff23d7a105d680ebb73.tar.gz |
Make all the single byte encodings compliant to the encoding spec.
1. Replace the current encoding alias list (heavily patched) with our own
HTML5-specific alias list. It's mostly generated from encoding.json, which
is in turn derived from the WHATWG Encoding living standard. The most notable
difference is that UTF-32 entries are kept until bug 417850 is resolved.
Two other differences are:
a. Two aliases for iso-8859-8-i (logical and csiso88598i) are not listed.
They're dealt with in Blink.
b. Chinese (gb*, big5*) aliases are not yet aligned to the encoding
spec pending our decision on the unification of Big5 / Big5-HKSCS and
GBK / GB18030.
2. Replace all the single-byte mapping tables with what's automatically
generated with scripts/single-byte-gen.sh that uses index-* files downloaded
from the WHATWG spec site. This will fix the decoding (ToUnicode) of
windows-874 and windows-1253 while removing a lot of fallback/spurrious
mapping entries in encoding direction ('FromUnicode') in a number of encodings.
3. Regenerate the ICU binary data files for Linux/Mac/Android/Windows/CrOS.
4. Remove now obsolete noop-*ucm files used to make ISO-2022-CN* decoder
to turn an empty string. They're not necessary any more because ISO-2022-CN*
were made 'replacement' encodings in Blink and our version of ICU does not
have any code for ISO-2022-CN* any more.
This cuts down the data size by 15kB. On Android, there's virtually
no change in the data size because the previous data file on Android
accidentally had smaller locale data for nb and ms.
BUG=412053
TEST=browser_tests --gtest_filter="*ncoding*"
TEST=net_unittest --gtest_filter="*ilenameUtil*"
TEST=base_unittests --gtest_filter="*Conv*"
TEST=Blink: fast/encoding/*
TEST=http://www.w3.org/International/tests/repository/encoding/indexes/results-indexes
TEST=http://www.w3.org/International/tests/repository/encoding/indexes/results-aliases
TEST=http://www.w3.org/International/tests/repository/run?manifest=encoding/indexes&test=windows-1253_test
TEST=http://www.w3.org/International/tests/repository/run?manifest=encoding/indexes&test=windows-874_test
R=jsbell@chromium.org
Review URL: https://codereview.chromium.org/598383002
git-svn-id: http://src.chromium.org/svn/trunk/deps/third_party/icu52@292447 4ff67af0-8c30-449e-8e8b-ad334ec8d88c
Diffstat (limited to 'scripts')
-rwxr-xr-x | scripts/single_byte_gen.sh | 62 |
1 files changed, 62 insertions, 0 deletions
diff --git a/scripts/single_byte_gen.sh b/scripts/single_byte_gen.sh new file mode 100755 index 0000000..b5a5514 --- /dev/null +++ b/scripts/single_byte_gen.sh @@ -0,0 +1,62 @@ +#!/bin/bash +# Copyright (c) 2014 The Chromium Authors. All rights reserved. +# Use of this source code is governed by a BSD-style license that can be +# found in the LICENSE file. + +function preamble { + +encoding="$1" +cat <<PREAMBLE +# *************************************************************************** +# * +# * Generated from index-$encoding.txt ( +# * https://encoding.spec.whatwg.org/index-${encoding}.txt ) +# * following the algorithm for the single byte legacy encoding +# * described at http://encoding.spec.whatwg.org/#single-byte-decoder +# * +# *************************************************************************** +<code_set_name> "${encoding}-html" +<char_name_mask> "AXXXX" +<mb_cur_max> 1 +<mb_cur_min> 1 +<uconv_class> "SBCS" +<subchar> \x3F +<icu:charsetFamily> "ASCII" + +CHARMAP +PREAMBLE + +} + +# The list of html5 encodings. Note that iso-8859-8-i is not listed here +# because its mapping table is exactly the same as iso-8859-8. The difference +# is BiDi handling (logical vs visual). +encodings="ibm866 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6\ + iso-8859-7 iso-8859-8 iso-8859-10 iso-8859-13 iso-8859-14\ + iso-8859-15 iso-8859-16 koi8-r koi8-u macintosh\ + windows-874 windows-1250 windows-1251 windows-1252 windows-1253\ + windows-1254 windows-1255 windows-1256 windows-1257 windows-1258\ + x-mac-cyrillic" + +ENCODING_DIR="$(dirname $0)/../source/data/mappings" +for e in ${encodings} +do + output="${ENCODING_DIR}/${e}-html.ucm" + index="index-${e}.txt" + indexurl="https://encoding.spec.whatwg.org/index-${e}.txt" + curl -o ${index} "${indexurl}" + preamble ${e} > ${output} + awk 'BEGIN \ + { \ + for (i=0; i < 0x80; ++i) \ + { \ + printf("<U%04X> \\x%02X |0\n", i, i);} \ + } \ + !/^#/ && !/^$/ \ + { + printf ("<U%4s> \\x%02X |0\n", substr($2, 3), $1 + 0x80); \ + }' ${index} | sort >> ${output} + echo 'END CHARMAP' >> ${output} + rm ${index} +done + |