recycling internationalized garbage
rridge at csclub.uwaterloo.ca
Fri Mar 17 00:11:06 CET 2006
Martin v. Löwis wrote:
> So "valid" yes; "meaningful" no. Therefore, for all practical
> purposes, 8-bit single-byte characters sets *will not* produce
> byte sequences that are valid in UTF-8 (although they could -
> it just won't happen).
> > In fact I can't think of any multi-byte encoding that can't produce
> > valid UTF-8 byte sequence.
> The same reasoning applies for them.
While you're reasoning may apply to European single-byte character
sets, it doesn't apply as well to Far East multi-byte encodings. Take
ISO 2202-JP (RFC 1468) for example where any string is valid UTF-8 as
far as Python is concerned. About 1% of the EUC-JP encoded words and
phrases listed in EDICT, a Japanese-English dictionary decode as valid
UTF-8 strings. I get similar results with CEDICT, a Chinese-English
dictionary, about 1% for the Big5 encoded version of the file and about
4.5% for the GB 2312 version.
It would be nearly impossible to find all the strings in in Freedb that
decode as UTF-8 but aren't really encoded in UTF-8, but they do exist.
One example I managed to find are the GB 2312 encoded TTITLE5 and
TTITLE13 records of disc id 020f5210.
More information about the Python-list