[Python-Dev] Unicode howto in the works - feedback appreciated
Wed, 01 May 2002 11:06:50 +0200
"Stephen J. Turnbull" wrote:
> [What Is Unicode?]
> 1. Characters are "atomic units of text" that have properties. Since
> they're atoms, we represent them by integers in computer programs.
> Among the properties are their glyphs (graphical representation),
> classes (alpha, num, whitespace, etc), and so on. It is a bad
> idea to identify characters with their glyphs.
> 2. Alphabets are abstract sets of characters. Coded character sets
> map characters to integer representations. "Encoding" is a
> reasonable synonym for "coded character set". Avoid "charset"
> except when talking about the charset parameter of Content-Type.
> 3. Typo in last sentence "I will suggest that YOU should use UTF-8."
You might also want to grab some ideas from my "Python and Unicode"
presentation I gave at Bordeaux last year:
This also explains the various terms used in Unicode space
and how they relate to Python.
> 1. If you don't get a Content-Type charset parameter, you _must_ assume
1. If you don't get a Content-Type charset parameter in an HTTP request,
you _must_ assume Latin-1.
I'd suggest to change the order of encodings (e.g. putting
Latin-1 near the end isn't a good idea). Also, the fact that
decoding works doesn't necessarily mean that the input did
in fact use that encoding. A more appropriate way would be
to try to reencode the decoded data in the given encoding
since that is likely to fail for e.g. CP-1252 vs. Latin-1
if people use accented characters.
If you're more into guessing an encoding, you should probably
use an entropy approach:
> [Mildly Corrupt Data]
Same comment here: you have to test round-trips, not just
whether decoding fails. (Please note that not all codecs
are round-trip safe -- see test_unicode.py for a list
of ones that are)
CEO eGenix.com Software GmbH
Company & Consulting: http://www.egenix.com/
Python Software: http://www.egenix.com/files/python/