[Python-3000] locale-aware strings ?

Paul Prescod paul at prescod.net
Wed Sep 6 12:08:21 CEST 2006


On 9/5/06, Marcin 'Qrczak' Kowalczyk <qrczak at knm.org.pl> wrote:
> David Hopwood <david.nospam.hopwood at blueyonder.co.uk> writes:
>
> > The whole idea of a default encoding is flawed. Ideally there would be
> > no default; programmers should be forced to think about the issue
> > on a case-by-case basis. In some cases they might choose to open a file
> > with the system encoding, but that should be an explicit decision.
>
> Perhaps this is shows a difference between Unix and Windows culture.
>
> On Unix there is definitely a default encoding; this is what most good
> programs operating on text files assume by default. It would be insane
> to have to tell each program separately about the encoding. Locale is
> the OS mechanism used to provide this information in a uniform way.

Windows users do not "tell each program separately about the
encoding." The encoding varies by file type. It makes no more sense to
have a global variable that says "all of my files are Shift-JIS" than
it does to say "all of my files are PowerPoint files." Because someday
somebody is going to email you a Big-5 file (or a zipfile) and that
setting will be wrong. Once you know that a file is of type Zip then
you know that the "encoding" is zipped binary. Once you know that it
is an Office 2007 file, then you know that the encoding is Zipped XML
and that the XML will have its own encoding declaration. Once you know
that it is HTML, then you look for meta tags.

This is how real-world programs work. They shouldn't guess based on
system global variables.

May I ask an empircal question? In your experience, what percentage of
Macintosh users change the default encoding from US-ASCII to something
specific to their culture? What percentage of Ubuntu users change it
froom UTF-8 to something specific?

If the answers are "few", then we are talking about a feature that
will break Windows programs and offer little value to Unix and
Macintosh users.

If "many" users change the global system encoding on their modern Unix
distributions then I propose the following. There should be a property
called something like "encodings.recommendedEncoding". On Windows it
should be ASCII. On Unix-like platforms it can be inferred from the
locale. Programmers who know what it means and want to take advantage
of it can do so like this:

opentext(filename, "r", encoding=encodings.recommendedEncoding)

This is almost exactly how C# does it, though it uses the confusing
term "defaut encoding" which implies a default behaviour.

The lack of an encoding argument should default to ASCII or perhaps
UTF-8. (either one is relatively safe about not processing data
incorrectly by accident)

 Paul Prescod


More information about the Python-3000 mailing list