[Python-3000] Pre-PEP: Easy Text File Decoding

Marcin 'Qrczak' Kowalczyk qrczak at knm.org.pl
Mon Sep 11 12:38:49 CEST 2006


"Paul Prescod" <paul at prescod.net> writes:

> Guido's goal was that quick and dirty text processing should "just
> work" for newbies and encoding-disintererested expert programmers.

What does 'guess' mean for creating files?

Consider a program which reads one file and writes data extracted
from it (e.g. with lines beginning with '#' removed) to another file.

With my proposal it will work if the encoding of the file is the same
as the locale encoding (or if they can be harmlessly confused).
It will just work most of the time.

It will not work in general if the encodings are different. In this
case the user of the script can override the encoding assumption
by temporarily changing the locale or by changing an environment
variable.



OTOH when the encoding is guessed from file contents, what happens
depending on how it's designed. If the locale is ISO-8859-x:

1. Files are created in the locale encoding.

   Then some UTF-8 files will be silently recoded to a different
   encoding, and for other UTF-8 files writing will fail (if they
   contain characters not expressible in the locale encoding).

2. Files are created in UTF-8.

   Then files encoded with the locale encoding will be silently
   recoded to UTF-8, causing trouble for further work with the file
   (it can't be even typed to the terminal).

If the locale is UTF-8, but the reader assumes e.g. ISO-8859-1 when
it can't decode as UTF-8, there will be a silent recoding for these
files. If the file is in fact encoded in ISO-8859-2, the result will
be nonsensical: looking as UTF-8 but with characters substituted
according to ISO-8859-2/1 differences.

In either case it's not clear what the user of the script can do
to preserve the encoding in the output file.

I claim that in my design the result is more easily predictable
and easier to fix when it goes wrong.



I've implemented a hack which allows simple programs to "just work" in
case of UTF-8. It's a modified encoder/decoder which escapes malformed
UTF-8 sequences with '\0' bytes, and thus allows arbitrary byte
sequences to round-trip UTF-8 decoding and encoding. It's not used by
default and it's never used when "UTF-8" is specified explicitly,
because it's not the true UTF-8, but I have an environment variable
which says "if the locale is UTF-8, use the modified UTF-8 as the
default encoding".

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/


More information about the Python-3000 mailing list