recycling internationalized garbage

Ross Ridge rridge at csclub.uwaterloo.ca
Fri Mar 17 00:11:06 CET 2006


Martin v. Löwis wrote:
> So "valid" yes; "meaningful" no. Therefore, for all practical
> purposes, 8-bit single-byte characters sets *will not* produce
> byte sequences that are valid in UTF-8 (although they could -
> it just won't happen).
>
> > In fact I can't think of any multi-byte encoding that can't produce
> > valid UTF-8 byte sequence.
>
> The same reasoning applies for them.

While you're reasoning may apply to European single-byte character
sets, it doesn't apply as well to Far East multi-byte encodings.  Take
ISO 2202-JP (RFC 1468) for example where any string is valid UTF-8 as
far as Python is concerned.  About 1% of the EUC-JP encoded words and
phrases listed in EDICT, a Japanese-English dictionary decode as valid
UTF-8 strings.  I get similar results with CEDICT, a Chinese-English
dictionary, about 1% for the Big5 encoded version of the file and about
4.5% for the GB 2312 version.

It would be nearly impossible to find all the strings in in Freedb that
decode as UTF-8 but aren't really encoded in UTF-8, but they do exist.
One example I managed to find are the GB 2312 encoded TTITLE5 and
TTITLE13 records of disc id 020f5210.

                   Ross Ridge




More information about the Python-list mailing list