[Tutor] Standardizing on Unicode and utf8

Fri Feb 20 12:45:05 CET 2009

On Fri, Feb 20, 2009 at 5:52 AM, Dinesh B Vadhia
<dineshbvadhia at hotmail.com> wrote:
> We want to standardize on unicode and utf8 and would like to clarify and
> verify their use to minimize encode()/decode()'ing:
>
> 1.  Python source files
> Use the header: # -*- coding: utf8 -*-
>
> 2.  Reading files
> In most cases, we don't know the source encoding of the files being read.
> Do we have to decode('utf8') after reading from file?

If you don't know the encoding of the file being read, it is difficult
to handle it correctly. A simple strategy is to try several encodings
and use the first one that reads without error. Note that *any* text
can be decoded using iso-8859-1 (latin-1) or cp1252 so they must be
last in the tests. This strategy can distinguish utf-16-be, utf-16-le,
utf-8, iso-8859-1 but it can't discriminate between any of the
iso-8859-x variants because they all will decode anything (they have
characters at every code point).

A more sophisticated strategy is to look for character patterns, see
Mark Pilgrim's Universal Encoding Detector:
http://chardet.feedparser.org/docs/

Best is not to get into this situation to begin with... Where are the
files coming from? If they are from web pages, they often have
metadata which gives the charset.

> 3. Writing files
> We will always write to files in utf8.  Do we have to encode('utf8') before
> writing to file?

Yes. The codecs module can help with reading and writing files, it
creates file-like objects that encode/decode on the fly.

> Is there anything else that we have to consider?

Console output also has to be decoded to the charset of the console
(sys.stdout.encoding).

Kent