Becoming Unicode Aware
imbosol at aerojockey.com
Thu Oct 28 11:13:46 CEST 2004
fuzzyman at gmail.com (Michael Foord) wrote in message news:<6f402501.0410270256.13cf5727 at posting.google.com>...
> My main problem with udnerstanding unicode is what to do with
> arbitrary text without an encoding specified. To the best of my
> knowledge the technical term for this situation is 'buggered'. E.g. I
> have a CGI guestbook script. Is the only way of knowing what encodign
> the user is typing in, to ask them ?
Generally speaking, you have to ask (either the user or the software).
There's no reliable way to tell what encoding you're looking at
without someone or something telling you; you might be able to make a
heuristical guess, but that's it.
> Anyway - ConfigObj reads config files from plain text files. Is there
> a standard for specifying the encoding within the text file ? I know
> python scripts have a method - should I just use that ?
It's a good method if you expect people to be editing the config file
with Emacs. It's a good enough method if you haven't any good reason
to use another method.
> Also - suppose I know the encoding, or let the programmer specify, is
> the following sufficient for reading the files in :
> def afunction(setoflines, encoding='ascii'):
> for line in setoflines:
> if encoding:
> line = line.decode(encoding)
For most encodings, this'll work fine. But there are some encodings,
for example UTF-16, that won't work with it. UTF-16 fails for two
reasons: the two-byte characters interfere with the line buffering,
and UTF-16 strings must be preceded by a two-byte code indicating
endianness, which would be at the beginning of the file but not of
Fortunately, most text files aren't in UTF-16. I mention this so that
you are aware that, although afunction works in most cases, it is not
I believe it's the purpose of the StreamReader and StreamWriter
classes in the codecs module to deal with such situations.
More information about the Python-list