A few questiosn about encoding
Nobody
nobody at nowhere.com
Wed Jun 12 16:30:23 EDT 2013
On Wed, 12 Jun 2013 14:23:49 +0300, Νικόλαος Κούρας wrote:
> So, how many bytes does UTF-8 stored for codepoints > 127 ?
U+0000..U+007F 1 byte
U+0080..U+07FF 2 bytes
U+0800..U+FFFF 3 bytes
>=U+10000 4 bytes
So, 1 byte for ASCII, 2 bytes for other Latin characters, Greek, Cyrillic,
Arabic, and Hebrew, 3 bytes for Chinese/Japanese/Korean, 4 bytes for dead
languages and mathematical symbols.
The mechanism used by UTF-8 allows sequences of up to 6 bytes, for a total
of 31 bits, but UTF-16 is limited to U+10FFFF (slightly more than 20 bits).
More information about the Python-list
mailing list