[Python-Dev] Unicode howto in the works - feedback appreciated

M.-A. Lemburg mal@lemburg.com
Wed, 01 May 2002 11:06:50 +0200


"Stephen J. Turnbull" wrote:
> 
> [What Is Unicode?]
> 
> 1.  Characters are "atomic units of text" that have properties.  Since
>     they're atoms, we represent them by integers in computer programs.
>     Among the properties are their glyphs (graphical representation),
>     classes (alpha, num, whitespace, etc), and so on.  It is a bad
>     idea to identify characters with their glyphs.
> 
> 2.  Alphabets are abstract sets of characters.  Coded character sets
>     map characters to integer representations.  "Encoding" is a
>     reasonable synonym for "coded character set".  Avoid "charset"
>     except when talking about the charset parameter of Content-Type.
> 
> 3.  Typo in last sentence "I will suggest that YOU should use UTF-8."

You might also want to grab some ideas from my "Python and Unicode"
presentation I gave at Bordeaux last year:

	http://www.egenix.com/files/python/Unicode-Talk.pdf
 
This also explains the various terms used in Unicode space
and how they relate to Python.

> [Email]
> 
> 1.  If you don't get a Content-Type charset parameter, you _must_ assume
>     US-ASCII.

[WWW]

1. If you don't get a Content-Type charset parameter in an HTTP request,
   you _must_ assume Latin-1.

[Console Input]

I'd suggest to change the order of encodings (e.g. putting
Latin-1 near the end isn't a good idea). Also, the fact that
decoding works doesn't necessarily mean that the input did
in fact use that encoding. A more appropriate way would be
to try to reencode the decoded data in the given encoding
since that is likely to fail for e.g. CP-1252 vs. Latin-1
if people use accented characters.

If you're more into guessing an encoding, you should probably
use an entropy approach:

	http://www.familie-holtwick.de/python/
 
> [Mildly Corrupt Data]

Same comment here: you have to test round-trips, not just
whether decoding fails. (Please note that not all codecs
are round-trip safe -- see test_unicode.py for a list
of ones that are)

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                   http://www.egenix.com/files/python/