On 9/5/06, <b class="gmail_sendername">Guido van Rossum</b> &lt;<a href="mailto:guido@python.org">guido@python.org</a>&gt; wrote:<div><span class="gmail_quote"></span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

On 9/5/06, Paul Prescod &lt;<a href="mailto:paul@prescod.net">paul@prescod.net</a>&gt; wrote:<br>&gt; Beyond all of that: It just seems wrong to me that I could send someone a<br>&gt; bunch of files and a Python program and their results processing them would

<br>&gt; be different from mine, despite the fact that we run the same version of<br>&gt; Python on the same operating system.<br><br>And it seems just as wrong if Python doesn't do what the user expects.<br>If I were a beginning Python user, I'd hate it if I had prepared a

simple data file in vi or notepad and my Python program wouldn't read it right because Python's idea of encoding differs from my editor's.</blockquote><div> My point is that most textual content in the world is NOT produced in vi or notepad or other applications that read the system encoding. Most content is produced in Word (future Word files will be zipped Unicode, not opaque binary), OpenOffice, DreamWeaver, web services, gmail, Thunderbird, phpbb, etc.

I haven't created locale-relevant content in a generic text editor in a very, very long time. Applications like vi and emacs that &quot;help&quot; you to create content that other people can't consume are not really helping at all. After all, we (now!) live in a networked era and people don't just create documents and then print them out on their local printers. Most of the time when I use text editors I am editing HTML, XML or Python and using the default of CP437 is wrong for all of those.

<br><br>Even Python will puke if you take a naive approach to text encodings in creating a Python program.<br><br>sys:1: DeprecationWarning: Non-ASCII character '\xe0' in file c:\temp\testencoding.py on line 1, but no encoding declared; see 

<a href="http://www.python.org/peps/pep-0263.html">http://www.python.org/peps/pep-0263.html</a> for details Are you going to change the Python interpreter so that it will &quot;just work&quot; with content created in vi and notepad? Otherwise you're saying that Python will take a modern collaboration-roeitend approach to text processing but encourage Python programmers to take a naive obsolete approach.

<br><br>It also isn't just a question of flexibility. I think that Brian Quinlan made the good point that most English Windows users do not know what encoding their computer is using. If this represents 25% of the world's Python users, and these users run into UTF-8 data more often than CP437 then Python will guess wrong more often than it will guess right for 25% of its users. This is really dangerous because CP437 will happily read and munge UTF-8 (or even UCS-2 or binary) data. This makes CP437 a terrible default for that 25%. 

But it's worse than even that. GUI applications on Windows use a different encoding than command line ones. So on the same box, Python-in-Tk and Python-on-command line will answer that the system encoding is &quot;cp437&quot; versus &quot;cp1252&quot;. I just tested it.

<a href="http://blogs.msdn.com/oldnewthing/archive/2005/03/08/389527.aspx">http://blogs.msdn.com/oldnewthing/archive/2005/03/08/389527.aspx</a> Were it not for these issue I would say that it &quot;isn't a big deal&quot; because modern Linux distributions are moving to UTF-8 default anyhow, and the Mac seems to use ASCII. So we're moving to international standards regardless. But default encoding on Windows is totally broken.

The Mac is not totally consistent either. The console decodes UTF-8 for display. Textedit and vim munge the display in different ways (same GUI versus command-line issue again, I guess) A question: what happens when Python is reading data from a socket or other file-like object? Will that data also be decoded as if it came from the user's locale?

I don't think that this discussion really has anything to do with being compatible with &quot;most of the files on a computer&quot;. It is about being compatible with a certain set of Unix text processing applications.

<br><br>&nbsp;Paul Prescod<br><br></div></div>