On 9/5/06, <b class="gmail_sendername">Guido van Rossum</b> <<a href="mailto:firstname.lastname@example.org">email@example.com</a>> wrote:<div><span class="gmail_quote"></span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
On 9/5/06, David Hopwood <<a href="mailto:firstname.lastname@example.org">email@example.com</a>> wrote:<br>> Guido van Rossum wrote:<br>> > On 9/5/06, Paul Prescod <<a href="mailto:firstname.lastname@example.org">
email@example.com</a>> wrote:<br>> ><br>> >> Beyond all of that: It just seems wrong to me that I could send someone a<br>> >> bunch of files and a Python program and their results processing them
<br>> >> would be different from mine, despite the fact that we run the same version of<br>> >> Python on the same operating system.<br>> ><br>> > And it seems just as wrong if Python doesn't do what the user expects.
<br>> > If I were a beginning Python user, I'd hate it if I had prepared a<br>> > simple data file in vi or notepad and my Python program wouldn't read<br>> > it right because Python's idea of encoding differs from my editor's.
<br>><br>> I don't know about vi, but notepad will open and save files that are not in<br>> the system ("ANSI") encoding just fine. On opening it checks for a BOM and<br>> auto-detects UTF-8 and UTF-16; on saving it will write a BOM if you choose
<br>> "Unicode" (UTF-16LE), "Unicode big-endian" (UTF-16BE), or UTF-8 in the<br>> Encoding drop-down box.<br>><br>> This is exactly the behaviour that most users would expect of a well-behaved
<br>> Unicode-aware app. It should be as easy as possible to match this behaviour<br>> in a Python program.<br><br>And this is exactly why I want the determination of the default<br>encoding (i.e. the encoding to be used when opening a file when no
<br>explicit encoding is specified by the Python code that does the<br>opening) to be open-ended, rather than picking some standard default<br>like UTF-8 and saying (like Paul seems to want to say) "this is it".
I never suggested that UTF-8 should be the default. In fact, I think it
was very wise of Python 2.x to make ASCII the default and I'm astounded
to hear that you regret that decision. "<a name="zen">In the face of ambiguity, refuse the temptation to guess."<br>
Python 2.x provided an option to allow users to change the default
system-wide and ever since then we've (almost unanimously) counselled
users against changing it.<br>
</a></div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">> > Sorry Paul, I appreciate your standards-driven perspective, but in<br>
> > this area I'd rather build in more flexibility than strictly needed,<br>> > than too little. If it turns out that on a particular platform all<br>> > files are in UTF-8, making Python *on that platform* always choose
<br>> > UTF-8 is simple enough.<br>><br>> The problem is not the systems where all files are UTF-8, or all files are<br>> another known charset. The problem is the platforms where half of the files<br>> are UTF-8 and half are in some other charset, determined either by type or by
<br>> presence of a UTF-8 BOM. This is a *very* common situation, especially for<br>> European users.<br><br>Right. (And Paul appears to be ignorant of this.)</blockquote><div><br>
I don't see how the fact that an individual system can have half of the
files in one encoding and half in another could argue IN FAVOUR of a
system-global default. I would have thought it strengthens my argument AGAINST trying to apply a random encoding to files.<br>
"If on a particular box<br>
most files are encoded in encoding X, and the user did whatever is<br>
necessary to tell the tools that that's their preferred encoding, I<br>
want Python to honor that encoding when opening text files, unless the<br>
program makes other arrangements explicitly (such as specifying an<br>
explicit encoding as a parameter to open())."<br>
But there is no such thing that "most users do" to tell tool what's
their preferred encoding. Most users use some random (to them)
operating system default which on Windows is usually wrong and is
different (for no particular reason) on the Macintosh than on Linux.
Long-time Windows users in this thread cannot even agree what is the
default for US English Windows because there is no single default.
There are two.<br>
Can we at least agree that if LC_CHARSET is demonstrably wrong most of
the time on Windows that we should not use it (at least on Windows)?<br>