[Python-3000] locale-aware strings ?

Tue Sep 5 22:17:47 CEST 2006

On 9/5/06, Guido van Rossum <guido at python.org> wrote:
>
> On 9/5/06, Paul Prescod <paul at prescod.net> wrote:
> > Beyond all of that: It just seems wrong to me that I could send someone
> a
> > bunch of files and a Python program and their results processing them
> would
> > be different from mine, despite the fact that we run the same version of
> > Python on the same operating system.
>
> And it seems just as wrong if Python doesn't do what the user expects.
> If I were a beginning Python user, I'd hate it if I had prepared a
> simple data file in vi or notepad and my Python program wouldn't read
> it right because Python's idea of encoding differs from my editor's.

My point is that most textual content in the world is NOT produced in vi or
notepad or other applications that read the system encoding. Most content is
produced in Word (future Word files will be zipped Unicode, not opaque
binary), OpenOffice, DreamWeaver, web services, gmail, Thunderbird, phpbb,
etc.

I haven't created locale-relevant content in a generic text editor in a
very, very long time.

Applications like vi and emacs that "help" you to create content that other
people can't consume are not really helping at all. After all, we (now!)
live in a networked era and people don't just create documents and then
print them out on their local printers. Most of the time when I use text
editors I am editing HTML, XML or Python and using the default of CP437 is
wrong for all of those.

Even Python will puke if you take a naive approach to text encodings in
creating a Python program.

sys:1: DeprecationWarning: Non-ASCII character '\xe0' in file
c:\temp\testencoding.py on line 1, but no encoding declared; see
http://www.python.org/peps/pep-0263.html for details

Are you going to change the Python interpreter so that it will "just work"
with content created in vi and notepad? Otherwise you're saying that Python
will take a modern collaboration-roeitend approach to text processing but
encourage Python programmers to take a naive obsolete approach.

It also isn't just a question of flexibility. I think that Brian Quinlan
made the good point that most English Windows users do not know what
encoding their computer is using. If this represents 25% of the world's
Python users, and these users run into UTF-8 data more often than CP437 then
Python will guess wrong more often than it will guess right for 25% of its
users. This is really dangerous because CP437 will happily read and munge
UTF-8 (or even UCS-2 or binary) data. This makes CP437 a terrible default
for that 25%.

But it's worse than even that. GUI applications on Windows use a different
encoding than command line ones. So on the same box, Python-in-Tk and
Python-on-command line will answer that the system encoding is "cp437"
versus "cp1252". I just tested it.

http://blogs.msdn.com/oldnewthing/archive/2005/03/08/389527.aspx

Were it not for these issue I would say that it "isn't a big deal" because
modern Linux distributions are moving to UTF-8 default anyhow, and the Mac
seems to use ASCII. So we're moving to international standards regardless.
But default encoding on Windows is totally broken.

The Mac is not totally consistent either. The console decodes UTF-8 for
display. Textedit and vim munge the display in different ways (same GUI
versus command-line issue again, I guess)

A question: what happens when Python is reading data from a socket or other
file-like object? Will that data also be decoded as if it came from the
user's locale?

I don't think that this discussion really has anything to do with being
compatible with "most of the files on a computer". It is about being
compatible with a certain set of Unix text processing applications.

 Paul Prescod
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060905/064cd1a7/attachment-0001.html