[Python-3000] Pre-PEP: Easy Text File Decoding

Sun Sep 10 18:08:14 CEST 2006

"Paul Prescod" <paul at prescod.net> writes:

> The type could be a true encoding or one of a small set of additional
> symbolic values. The two main symbolic values are:

Here is a counter-proposal.

There is a variable sys.default_encoding. It's used by file opening
functions when the encoding is not specified explicitly, among others.
Its initial value is set in site.py with a site-specific algorithm.

Two variants of the proposal:

1. The default site-specific algorithm queries the locale on Unix,
   uses "mbcs" on Windows (which is a special encoding which causes
   to use MultiByteToWideChar as the decoding function), and something
   appropriate on other systems.

2. The default initial value is "locale" (or "system" or "default" or
   whatever, but the spelling is fixed), which is a special encoding
   name which means to use the system-specific encoding, as above.

I prefer variant 1: it's simpler and it allows programs to examine the
choice on Unix.

A Python-specific environment variable could be defined to override
the system-specific choice.

If MultiByteToWideChar on Windows doesn't handle UTF-8 even with a BOM
(I don't know whether it does), then the Windows default could be an
encoding which assumes UTF-8 when a UTF-8 BOM is present, and uses
MultiByteToWideChar otherwise. This applies only to Windows; Unix
rarely uses a BOM, OTOH on Unix you can have UTF-8 locales which
Windows doesn't have as far as I know.

Other than that, guessing the encoding from the contents of the text
stream, especially statistical guessing basing on well-formed UTF-8
non-ASCII characters, shouldn't be encouraged, because it's effect is
not predictable. There can be a separate function which guesses the
encoding for those who really want to do this.

If Python ever has dynamically-scoped variables, sys.default_encoding
should be dynamically scoped, so it's possible to set for the context
of a block of code.

sys.default_encoding also applies to filenames, to names and values of
environment variables, to program invocation parameters (both sys.argv
and os.exec*), to pwd.struct_passwd.pw_gecos, etc. There is a number
of Unix interfaces which doesn't specify the encoding of texts they
exchange (and of course pw_gecos doesn't contain a BOM if it's UTF-8).

Antoine Pitrou <solipsis at pitrou.net> writes:

> sys.stdin and sys.stdout are for now, concretely, byte streams (AFAIK,
> at least under Unix). Yet it must be possible to read/write text to and
> from them.

Here is what my language Kogut does this:

RawStdIn etc. are the underlying raw files (thin wrappers over file
descriptors). StdIn etc. are text files with encoding, buffering etc.
They are initialized the first time they are used, i.e. the first time
the StdIn variable is read. They are constructed with the default
encoding from that time.

This allows a script to set the default encoding before accessing
standard text streams.

I don't know wheter Python typically accesses stdin/stdout during
initialization, before the first line of the script is executed.
If it does, this design can't be used until this is changed.

> Also, consider a "script.py" beginning with:
>
> import sys, text
> if len(sys.argv) > 1:
>     f = textfile(sys.argv[1], "r")
> else:
>     f = text.stdin
>
> Should encoding policy be chosen differently depending on whether the
> script is called with:
>     python script.py in.txt
> or with:
>     python script.py < in.txt
> ?

With my design it's the same. It's also the same if the script does
sys.default_encoding = 'ISO-8859-1' at the beginning.

Note: in my design sys.argv is also initialized lazily (in fact each
time it is accessed, until it's assigned to where it starts to behave
as a normal variable).

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/