[Python-3000] locale-aware strings ?

Paul Prescod paul at prescod.net
Wed Sep 6 19:15:44 CEST 2006

On 9/6/06, Marcin 'Qrczak' Kowalczyk <qrczak at knm.org.pl> wrote:
> "Paul Prescod" <paul at prescod.net> writes:
> > Windows users do not "tell each program separately about the
> > encoding." The encoding varies by file type.
> There are lots of Unix file types which are based on text files
> and their encoding is not specified explicitly.

Of course. But you asserted that the Windows world was insane and I
made the point that it is not. They've just consciously and explicitly
moved away from the situation where the encoding is inferred from the
environment instead of from the file's context. I'm not starting a
Windows versus Unix debate. I'm talking about the direction that the
world is working.

Python need not move forward in that direction but it should not move
backwards.Today, Python does not use the locale in inferring a file's
type. Python also explicitly chose not to use the locale in inferring
string encodings when Unicode was added.

I'm not saying that Python programmers should be disallowed from using
the system locale. I'm saying that Python itself should "resist the
urge to guess" encodings. Python programmers who want to guess could
have an easy, one-line way, as C# programmers do.

> But they do. It's a fact which is impossible to change with a
> decree.

I'm not trying to change tools. I'm asking that Python not emulate
their broken behaviour. If a Python programmer wants to do so, then
they should add one line of code.

> > What percentage of Ubuntu users change it froom UTF-8 to something
> > specific?
> Why would it matter?

I said explicitly why it matters in my first program. If most Unix
uses just accept system defaults then the feature is of no value to
them. If the feature actively hurts Windows programmers. So you have
decreasing value on one side and a steady amount of pain on the other.

> If a program can't read my text files or filenames or environment
> variables or program invocation arguments, while they are encoded
> according to the locale, then the program is broken.

Either you are saying that Python is broken today, or you are saying
that Python should allow people to write programs that are "not
broken" according to your definition. In the former case, I disagree.
In the latter case, I agree. The only thing we could disagree on is
whether Python's default behaviour should be to guess the encodings
based upon locale, despite Python's long history of avoiding guessing
in general and guessing encodings in particular.

> If a language requires extra steps in order to make the locale
> encoding work, then it's unhelpful. Most programmers won't bother,
> and their programs will work most of the time when they test it,
> assuming they use it with English texts. Such programs suddenly break
> when used in a non-English speaking country.

Loudly and suddenly breaking is better than silently munging data.
There are vast application classes where using the system encoding is
the wrong thing. For example, an FTP server. An application working
with data from a remote socket. An application working with a file
from a remote server. An application working with incoming email.
Python cannot know whether you are building a client/server
application or a script for working with local files. It can't even
really know whether a file that it opens is truly local. So it
shouldn't guess.

> > If the answers are "few", then we are talking about a feature that
> > will break Windows programs and offer little value to Unix and
> > Macintosh users.
> How does it break more programs than assuming ASCII does? All
> encodings suitable as a system encoding are ASCII supersets, so if
> a file can't be read using the locale encoding, it can't be read
> in ASCII either.

If a program expecting ASCII sees an unknown character then it can
throw an exception and say: "You haven't thought through the
internationalization aspects properly. Read the Python docs for more
information." Silently munging data is worse. "In the face of
ambiguity, refuse the temptation to guess."

 Paul Prescod

More information about the Python-3000 mailing list