Is there really a default source encoding?

Brian Quinlan brian at sweetapp.com
Fri Jan 24 08:20:14 CET 2003


> > UTF-8 is certainly not "anglo-neutral". It is often 
> > prohibitively expensive to encode Japanese and 
> > Chinese text in UTF-8 (UTF-16 is much more popular).
> 
> When did "anglo" come to mean "Japanese and Chinese"?

I thought that the OP meant that it was not biased towards English. My
mistake.
 
> UTF-16 is a horrible hack to work around Unicode failings...i.e., 
> that it started out as a 16 bit system and ended up morphing into
> ISO-10646...which UTF-16 doesn't actually solve, anyway, besides 
> being more prohibitively expensive for non-CJK users than UTF-8 is 
> for them.

UTF-16 is a compromise encoding. It is equally crappy for almost
everyone. It is also a lot easier to process than UTF-8 for most CJK
applications e.g. it can often be processed as UCS-2. 

> [Why UTF-16, rather than UCS-2, though?  Is there something in the
> UTF-16 accessible-only-by-surrogate region that CJK users should care
> about?]

Some of the new Japanese characters (i.e. dentistry symbols) are only
available through surrogates.

> Is there such a thing as UTF-32?  You mean UCS-4?
> And you said UTF-8 was "prohibitively expensive" ???

No, UTF-32 exists. For Japanese, UTF-8 requires (at minimum) 50% more
space per character than UTF-8. I was being facetious with my UTF-32
comment. But UTF-32 may become more efficient than UTF-16, for some
languages (e.g. Sancrit), in the future.

> >> Great. Only are you sure that BOMs are such a great idea?
> 
> It's an immensely stupid idea in a byte-oriented encoding like UTF-8.
> [Though it's pretty dumb in "wide" encodings, too]

I don't understand. In UTF-8, the BOM allows you to easily distinguish
between documents with UTF-8 encoding and a locale dependant
byte-encoding. For multibyte encodings (e.g. UTF-16) it is impossible to
determine the encoding without knowing the byte order. Do you have some
other solution with a feasible implementation?

> > I don't really care about how screwed-up Unix Unicode handling is.
> 
> What's "screwed up" about it?

Do you read the OP's link?

Cheers,
Brian






More information about the Python-list mailing list