[Python-Dev] open(): set the default encoding to 'utf-8' in Python 3.3?

Paul Moore p.f.moore at gmail.com
Tue Jun 28 16:46:12 CEST 2011


On 28 June 2011 14:43, Victor Stinner <victor.stinner at haypocalc.com> wrote:
> As discussed before on this list, I propose to set the default encoding
> of open() to UTF-8 in Python 3.3, and add a warning in Python 3.2 if
> open() is called without an explicit encoding and if the locale encoding
> is not UTF-8. Using the warning, you will quickly notice the potential
> problem (using Python 3.2.2 and -Werror) on Windows or by using a
> different locale encoding (.e.g using LANG="C").

-1. This will make things harder for simple scripts which are not
intended to be cross-platform.

I use Windows, and come from the UK, so 99% of my text files are
ASCII. So the majority of my code will be unaffected. But in the
occasional situation where I use a £ sign, I'll get encoding errors,
where currently things will "just work". And the failures will be data
dependent, and hence intermittent (the worst type of problem). I'll
write a quick script, use it once and it'll be fine, then use it later
on some different data and get an error. :-(

I appreciate that the point here is to make sure that people think a
bit more carefully about encoding issues. But doing so by making
Python less friendly for casual, adhoc script use, seems to me to be a
mistake.

> I don't think that Windows developer even know that they are writing
> files into the ANSI code page. MSDN documentation of
> WideCharToMultiByte() warns developer that the ANSI code page is not
> portable, even accross Windows computers:

Probably true. But for many uses they also don't care. If you're
writing something solely for a one-off job on your own PC, the ANSI
code page is fine, and provides interoperability with other programs
on your PC, which is really what you care about. (UTF-8 without BOM
displays incorrectly in Vim, wordpad, and powershell get-content. MBCS
works fine in all of these. It also displays incorrectly in CMD type,
but in a less familiar form than the incorrect display mbcs produces,
for what that's worth...)

> It will always be possible to use ANSI code page using
> encoding="mbcs" (only work on Windows), or an explicit code page number
> (e.g. encoding="cp2152").

So, in effect, you propose making the default favour writing
multiplatform portable code at the expense of quick and dirty scripts?
My personal view is that this is the wrong choice ("practicality beats
purity") but I guess it's ultimately a question of Python's design
philosophy.

> The two other (rejetected?) options to improve open() are:
>
> - raise an error if the encoding argument is not set: will break most
> programs
> - emit a warning if the encoding argument is not set

IMHO, you missed another option - open() does not need improving, the
current behaviour is better than any of the 3 options noted.

Paul.


More information about the Python-Dev mailing list