aboutsummaryrefslogtreecommitdiff
path: root/scripts
diff options
context:
space:
mode:
authorjshin@chromium.org <jshin@chromium.org>2014-10-13 23:06:48 +0000
committerjshin@chromium.org <jshin@chromium.org>2014-10-13 23:06:48 +0000
commit6ea11f3e257a813015220ff23d7a105d680ebb73 (patch)
tree8c25cb51f25029565c38864f19761909d70afd70 /scripts
parent8ac906faf7b66180f2208380c35ae1e07136c5cc (diff)
downloadicu-6ea11f3e257a813015220ff23d7a105d680ebb73.tar.gz
Make all the single byte encodings compliant to the encoding spec.
1. Replace the current encoding alias list (heavily patched) with our own HTML5-specific alias list. It's mostly generated from encoding.json, which is in turn derived from the WHATWG Encoding living standard. The most notable difference is that UTF-32 entries are kept until bug 417850 is resolved. Two other differences are: a. Two aliases for iso-8859-8-i (logical and csiso88598i) are not listed. They're dealt with in Blink. b. Chinese (gb*, big5*) aliases are not yet aligned to the encoding spec pending our decision on the unification of Big5 / Big5-HKSCS and GBK / GB18030. 2. Replace all the single-byte mapping tables with what's automatically generated with scripts/single-byte-gen.sh that uses index-* files downloaded from the WHATWG spec site. This will fix the decoding (ToUnicode) of windows-874 and windows-1253 while removing a lot of fallback/spurrious mapping entries in encoding direction ('FromUnicode') in a number of encodings. 3. Regenerate the ICU binary data files for Linux/Mac/Android/Windows/CrOS. 4. Remove now obsolete noop-*ucm files used to make ISO-2022-CN* decoder to turn an empty string. They're not necessary any more because ISO-2022-CN* were made 'replacement' encodings in Blink and our version of ICU does not have any code for ISO-2022-CN* any more. This cuts down the data size by 15kB. On Android, there's virtually no change in the data size because the previous data file on Android accidentally had smaller locale data for nb and ms. BUG=412053 TEST=browser_tests --gtest_filter="*ncoding*" TEST=net_unittest --gtest_filter="*ilenameUtil*" TEST=base_unittests --gtest_filter="*Conv*" TEST=Blink: fast/encoding/* TEST=http://www.w3.org/International/tests/repository/encoding/indexes/results-indexes TEST=http://www.w3.org/International/tests/repository/encoding/indexes/results-aliases TEST=http://www.w3.org/International/tests/repository/run?manifest=encoding/indexes&test=windows-1253_test TEST=http://www.w3.org/International/tests/repository/run?manifest=encoding/indexes&test=windows-874_test R=jsbell@chromium.org Review URL: https://codereview.chromium.org/598383002 git-svn-id: http://src.chromium.org/svn/trunk/deps/third_party/icu52@292447 4ff67af0-8c30-449e-8e8b-ad334ec8d88c
Diffstat (limited to 'scripts')
-rwxr-xr-xscripts/single_byte_gen.sh62
1 files changed, 62 insertions, 0 deletions
diff --git a/scripts/single_byte_gen.sh b/scripts/single_byte_gen.sh
new file mode 100755
index 0000000..b5a5514
--- /dev/null
+++ b/scripts/single_byte_gen.sh
@@ -0,0 +1,62 @@
+#!/bin/bash
+# Copyright (c) 2014 The Chromium Authors. All rights reserved.
+# Use of this source code is governed by a BSD-style license that can be
+# found in the LICENSE file.
+
+function preamble {
+
+encoding="$1"
+cat <<PREAMBLE
+# ***************************************************************************
+# *
+# * Generated from index-$encoding.txt (
+# * https://encoding.spec.whatwg.org/index-${encoding}.txt )
+# * following the algorithm for the single byte legacy encoding
+# * described at http://encoding.spec.whatwg.org/#single-byte-decoder
+# *
+# ***************************************************************************
+<code_set_name> "${encoding}-html"
+<char_name_mask> "AXXXX"
+<mb_cur_max> 1
+<mb_cur_min> 1
+<uconv_class> "SBCS"
+<subchar> \x3F
+<icu:charsetFamily> "ASCII"
+
+CHARMAP
+PREAMBLE
+
+}
+
+# The list of html5 encodings. Note that iso-8859-8-i is not listed here
+# because its mapping table is exactly the same as iso-8859-8. The difference
+# is BiDi handling (logical vs visual).
+encodings="ibm866 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6\
+ iso-8859-7 iso-8859-8 iso-8859-10 iso-8859-13 iso-8859-14\
+ iso-8859-15 iso-8859-16 koi8-r koi8-u macintosh\
+ windows-874 windows-1250 windows-1251 windows-1252 windows-1253\
+ windows-1254 windows-1255 windows-1256 windows-1257 windows-1258\
+ x-mac-cyrillic"
+
+ENCODING_DIR="$(dirname $0)/../source/data/mappings"
+for e in ${encodings}
+do
+ output="${ENCODING_DIR}/${e}-html.ucm"
+ index="index-${e}.txt"
+ indexurl="https://encoding.spec.whatwg.org/index-${e}.txt"
+ curl -o ${index} "${indexurl}"
+ preamble ${e} > ${output}
+ awk 'BEGIN \
+ { \
+ for (i=0; i < 0x80; ++i) \
+ { \
+ printf("<U%04X> \\x%02X |0\n", i, i);} \
+ } \
+ !/^#/ && !/^$/ \
+ {
+ printf ("<U%4s> \\x%02X |0\n", substr($2, 3), $1 + 0x80); \
+ }' ${index} | sort >> ${output}
+ echo 'END CHARMAP' >> ${output}
+ rm ${index}
+done
+