[I18n-sig] First draft of Unicode howto
"Martin v. Löwis"
martin at v.loewis.de
Sun Aug 7 17:35:25 CEST 2005
A.M. Kuchling wrote:
> The 'Tips for Writing Unicode-aware Programs' is also very sparse,
> because I couldn't come up with much of anything very helpful.
> Suggestions for this section would also be appreciated.
Some remarks as I go through:
- UTF-8 uses 4 bytes, for characters above U+10000 (i.e. non-BMP
characters), and 3 bytes in the range U+0800...U+FFFF
- if you want to, you can further restrict the value ranges for
the UTF-8 bytes: the 2nd, 3rd, fourth byte are always between
128 and 191; the first byte is 192..223 for two-byte, 224..239
for three-byte, and 240..247 for four-byte sequences.
Because of this property, you can resynchronize (not that I'm
aware of any application that commonly uses resynchronization).
But, for the same reason, it is unlikely that you encounter
bytes that look like UTF-8 but aren't.
- The example for Unicode literals with encoding errors renders
incorrectly (I see a question mark)
- If you mention Unicode character categories, you should elaborate
a bit. Unicode categories are things like "Letter", "Symbol",
"Punctuation", with subcategories like "Uppercase" or "Dash".
The list of all categories is at
- reading data: you could point out that IO libraries sometimes
already input and output Unicode directly, with the most
prominent examples being GUI, XML, and databases; developers
should check whether their library supports Unicode.
More information about the I18n-sig