Becoming Unicode Aware

Carl Banks imbosol at
Thu Oct 28 11:13:46 CEST 2004

fuzzyman at (Michael Foord) wrote in message news:<6f402501.0410270256.13cf5727 at>...
> My main problem with udnerstanding unicode is what to do with
> arbitrary text without an encoding specified. To the best of my
> knowledge the technical term for this situation is 'buggered'. E.g. I
> have a CGI guestbook script. Is the only way of knowing what encodign
> the user is typing in, to ask them ?

Generally speaking, you have to ask (either the user or the software).
 There's no reliable way to tell what encoding you're looking at
without someone or something telling you; you might be able to make a
heuristical guess, but that's it.

> Anyway - ConfigObj reads config files from plain text files. Is there
> a standard for specifying the encoding within the text file ? I know
> python scripts have a method - should I just use that ?

It's a good method if you expect people to be editing the config file
with Emacs.  It's a good enough method if you haven't any good reason
to use another method.

> Also - suppose I know the encoding, or let the programmer specify, is
> the following sufficient for reading the files in :
> def afunction(setoflines, encoding='ascii'):
>     for line in setoflines:
>         if encoding:
>             line = line.decode(encoding)

For most encodings, this'll work fine.  But there are some encodings,
for example UTF-16, that won't work with it.  UTF-16 fails for two
reasons: the two-byte characters interfere with the line buffering,
and UTF-16 strings must be preceded by a two-byte code indicating
endianness, which would be at the beginning of the file but not of
each line.

Fortunately, most text files aren't in UTF-16.  I mention this so that
you are aware that, although afunction works in most cases, it is not

I believe it's the purpose of the StreamReader and StreamWriter
classes in the codecs module to deal with such situations.


More information about the Python-list mailing list