open(): set the default encoding to 'utf-8' in Python 3.3?

In Python 2, open() opens the file in binary mode (e.g. file.readline() returns a byte string). codecs.open() opens the file in binary mode by default, you have to specify an encoding name to open it in text mode. In Python 3, open() opens the file in text mode by default. (It only opens the binary mode if the file mode contains "b".) The problem is that open() uses the locale encoding if the encoding is not specified, which is the case *by default*. The locale encoding can be: - UTF-8 on Mac OS X, most Linux distributions - ISO-8859-1 os some FreeBSD systems - ANSI code page on Windows, e.g. cp1252 (close to ISO-8859-1) in Western Europe, cp952 in Japan, ... - ASCII if the locale is manually set to an empty string or to "C", or if the environment is empty, or by default on some systems - something different depending on the system and user configuration... If you develop under Mac OS X or Linux, you may have surprises when you run your program on Windows on the first non-ASCII character. You may not detect the problem if you only write text in english... until someone writes the first letter with a diacritic. As discussed before on this list, I propose to set the default encoding of open() to UTF-8 in Python 3.3, and add a warning in Python 3.2 if open() is called without an explicit encoding and if the locale encoding is not UTF-8. Using the warning, you will quickly notice the potential problem (using Python 3.2.2 and -Werror) on Windows or by using a different locale encoding (.e.g using LANG="C"). I expect a lot of warnings from the Python standard library, and as many in third party modules and applications. So do you think that it is too late to change that in Python 3.3? One argument for changing it directly in Python 3.3 is that most users will not notice the change because their locale encoding is already UTF-8. An alternative is to: - Python 3.2: use the locale encoding but emit a warning if the locale encoding is not UTF-8 - Python 3.3: use UTF-8 and emit a warning if the locale encoding is not UTF-8... or maybe always emit a warning? - Python 3.3: use UTF-8 (but don't emit warnings anymore) I don't think that Windows developer even know that they are writing files into the ANSI code page. MSDN documentation of WideCharToMultiByte() warns developer that the ANSI code page is not portable, even accross Windows computers: "The ANSI code pages can be different on different computers, or can be changed for a single computer, leading to data corruption. For the most consistent results, applications should use Unicode, such as UTF-8 or UTF-16, instead of a specific code page, unless legacy standards or data formats prevent the use of Unicode. If using Unicode is not possible, applications should tag the data stream with the appropriate encoding name when protocols allow it. HTML and XML files allow tagging, but text files do not." It will always be possible to use ANSI code page using encoding="mbcs" (only work on Windows), or an explicit code page number (e.g. encoding="cp2152"). -- The two other (rejetected?) options to improve open() are: - raise an error if the encoding argument is not set: will break most programs - emit a warning if the encoding argument is not set -- Should I convert this email into a PEP, or is it not required? Victor

Victor Stinner wrote:
In Python 2, open() opens the file in binary mode (e.g. file.readline() returns a byte string). codecs.open() opens the file in binary mode by default, you have to specify an encoding name to open it in text mode.
In Python 3, open() opens the file in text mode by default. (It only opens the binary mode if the file mode contains "b".) The problem is that open() uses the locale encoding if the encoding is not specified, which is the case *by default*. The locale encoding can be:
- UTF-8 on Mac OS X, most Linux distributions - ISO-8859-1 os some FreeBSD systems - ANSI code page on Windows, e.g. cp1252 (close to ISO-8859-1) in Western Europe, cp952 in Japan, ... - ASCII if the locale is manually set to an empty string or to "C", or if the environment is empty, or by default on some systems - something different depending on the system and user configuration...
If you develop under Mac OS X or Linux, you may have surprises when you run your program on Windows on the first non-ASCII character. You may not detect the problem if you only write text in english... until someone writes the first letter with a diacritic.
How about a more radical change: have open() in Py3 default to opening the file in binary mode, if no encoding is given (even if the mode doesn't include 'b') ? That'll make it compatible to the Py2 world again and avoid all the encoding guessing. Making such default encodings depend on the locale has already failed to work when we first introduced a default encoding in Py2, so I don't understand why we are repeating the same mistake again in Py3 (only in a different area). Note that in Py2, Unix applications often leave out the 'b' mode, since there's no difference between using it or not. Only on Windows, you'll see a difference. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 28 2011)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On 6/28/2011 10:02 AM, M.-A. Lemburg wrote:
How about a more radical change: have open() in Py3 default to opening the file in binary mode, if no encoding is given (even if the mode doesn't include 'b') ?
That'll make it compatible to the Py2 world again
I disagree. I believe S = open('myfile.txt').read() now return a text string in both Py2 and Py3 and a subsequent 'abc' in S works in both.
and avoid all the encoding guessing.
Making such default encodings depend on the locale has already failed to work when we first introduced a default encoding in Py2, so I don't understand why we are repeating the same mistake again in Py3 (only in a different area).
I do not remember any proposed change during the Py3 design discussions.
Note that in Py2, Unix applications often leave out the 'b' mode, since there's no difference between using it or not.
I believe it makes a difference now as to whether one gets str or bytes.
Only on Windows, you'll see a difference.
I believe the only difference now on Windows is the decoding used, not the return type. -- Terry Jan Reedy

On 28/06/2011 15:36, Terry Reedy wrote:
On 6/28/2011 10:02 AM, M.-A. Lemburg wrote:
How about a more radical change: have open() in Py3 default to opening the file in binary mode, if no encoding is given (even if the mode doesn't include 'b') ?
That'll make it compatible to the Py2 world again
I disagree. I believe S = open('myfile.txt').read() now return a text string in both Py2 and Py3 and a subsequent 'abc' in S works in both.
Nope, it returns a bytestring in Python 2. Mistakenly treating bytestrings as text is one of the things we aimed to correct in the transition to Python 3. Michael
and avoid all the encoding guessing.
Making such default encodings depend on the locale has already failed to work when we first introduced a default encoding in Py2, so I don't understand why we are repeating the same mistake again in Py3 (only in a different area).
I do not remember any proposed change during the Py3 design discussions.
Note that in Py2, Unix applications often leave out the 'b' mode, since there's no difference between using it or not.
I believe it makes a difference now as to whether one gets str or bytes.
Only on Windows, you'll see a difference.
I believe the only difference now on Windows is the decoding used, not the return type.
-- http://www.voidspace.org.uk/ May you do good and not evil May you find forgiveness for yourself and forgive others May you share freely, never taking more than you give. -- the sqlite blessing http://www.sqlite.org/different.html

On 6/28/2011 10:48 AM, Michael Foord wrote:
On 28/06/2011 15:36, Terry Reedy wrote:
S = open('myfile.txt').read() now return a text string in both Py2 and Py3 and a subsequent 'abc' in S works in both.
Nope, it returns a bytestring in Python 2.
Which, in Py2 is a str() object. In both Pythons, .read() in default mode returns an object of type str() and 'abc' is an object of type str() and so expressions involving undecorated string literals and input just work, but would not work if input defaulted to bytes in Py 3. Sorry if I was not clear enough. -- Terry Jan Reedy

On 28/06/2011 17:34, Terry Reedy wrote:
On 6/28/2011 10:48 AM, Michael Foord wrote:
On 28/06/2011 15:36, Terry Reedy wrote:
S = open('myfile.txt').read() now return a text string in both Py2 and Py3 and a subsequent 'abc' in S works in both.
Nope, it returns a bytestring in Python 2.
Which, in Py2 is a str() object.
Yes, but not a "text string". The equivalent of the Python 2 str in Python 3 is bytes. Irrelevant discussion anyway.
In both Pythons, .read() in default mode returns an object of type str() and 'abc' is an object of type str() and so expressions involving undecorated string literals and input just work, but would not work if input defaulted to bytes in Py 3. Sorry if I was not clear enough.
Well, I think you're both right. Both semantics break some assumption or other. All the best, Michael -- http://www.voidspace.org.uk/ May you do good and not evil May you find forgiveness for yourself and forgive others May you share freely, never taking more than you give. -- the sqlite blessing http://www.sqlite.org/different.html

Michael Foord wrote:
On 28/06/2011 17:34, Terry Reedy wrote:
On 6/28/2011 10:48 AM, Michael Foord wrote:
On 28/06/2011 15:36, Terry Reedy wrote:
S = open('myfile.txt').read() now return a text string in both Py2 and Py3 and a subsequent 'abc' in S works in both.
Nope, it returns a bytestring in Python 2.
Which, in Py2 is a str() object.
Yes, but not a "text string". The equivalent of the Python 2 str in Python 3 is bytes. Irrelevant discussion anyway.
Irrelevant to the OP, yes, but a Python 2 string *is not* the same as Python 3 bytes. If you don't believe me fire up your Python 3 shell and try b'xyz'[1] == 'y'. ~Ethan~

Ethan Furman wrote:
Michael Foord wrote:
On 28/06/2011 17:34, Terry Reedy wrote:
On 6/28/2011 10:48 AM, Michael Foord wrote:
On 28/06/2011 15:36, Terry Reedy wrote:
S = open('myfile.txt').read() now return a text string in both Py2 and Py3 and a subsequent 'abc' in S works in both.
Nope, it returns a bytestring in Python 2.
Which, in Py2 is a str() object.
Yes, but not a "text string". The equivalent of the Python 2 str in Python 3 is bytes. Irrelevant discussion anyway.
Irrelevant to the OP, yes, but a Python 2 string *is not* the same as Python 3 bytes. If you don't believe me fire up your Python 3 shell and try b'xyz'[1] == 'y'.
er, make that b'xyz'[1] == b'y' :(

Terry Reedy <tjreedy@udel.edu> wrote:
Making such default encodings depend on the locale has already failed to work when we first introduced a default encoding in Py2, so I don't understand why we are repeating the same mistake again in Py3 (only in a different area).
I do not remember any proposed change during the Py3 design discussions.
I certainly proposed it, more than once. Bill

M.-A. Lemburg <mal@egenix.com> wrote:
How about a more radical change: have open() in Py3 default to opening the file in binary mode, if no encoding is given (even if the mode doesn't include 'b') ?
+1.
That'll make it compatible to the Py2 world again and avoid all the encoding guessing.
Yep. Bill

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 06/28/2011 12:52 PM, Bill Janssen wrote:
M.-A. Lemburg <mal@egenix.com> wrote:
How about a more radical change: have open() in Py3 default to opening the file in binary mode, if no encoding is given (even if the mode doesn't include 'b') ?
+1.
That'll make it compatible to the Py2 world again and avoid all the encoding guessing.
Yep.
+1 from me, as well: "in the face of ambiguity, refuse the temptation sto guess." Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk4KVbwACgkQ+gerLs4ltQ4gPACgoWGjAhmOg9IGQgMht2KsZYn5 mKUAnjcLP6BGCVFSudm1v77ZHere3VHw =4SPu -----END PGP SIGNATURE-----

Le mardi 28 juin 2011 à 16:02 +0200, M.-A. Lemburg a écrit :
How about a more radical change: have open() in Py3 default to opening the file in binary mode, if no encoding is given (even if the mode doesn't include 'b') ?
I tried your suggested change: Python doesn't start. sysconfig uses the implicit locale encoding to read sysconfig.cfg, the Makefile and pyconfig.h. I think that it is correct to use the locale encoding for Makefile and pyconfig.h, but maybe not for sysconfig.cfg. Python require more changes just to run "make". I was able to run "make" by using encoding='utf-8' in various functions (of distutils and setup.py). I didn't try the test suite, I expect too many failures. -- Then I tried my suggestion (use "utf-8" by default): Python starts correctly, I can build it (run "make") and... the full test suite pass without any change. (I'm testing on Linux, my locale encoding is UTF-8.) Victor

Victor Stinner wrote:
Le mardi 28 juin 2011 à 16:02 +0200, M.-A. Lemburg a écrit :
How about a more radical change: have open() in Py3 default to opening the file in binary mode, if no encoding is given (even if the mode doesn't include 'b') ?
I tried your suggested change: Python doesn't start.
No surprise there: it's an incompatible change, but one that undoes a wart introduced in the Py3 transition. Guessing encodings should be avoided whenever possible.
sysconfig uses the implicit locale encoding to read sysconfig.cfg, the Makefile and pyconfig.h. I think that it is correct to use the locale encoding for Makefile and pyconfig.h, but maybe not for sysconfig.cfg.
Python require more changes just to run "make". I was able to run "make" by using encoding='utf-8' in various functions (of distutils and setup.py). I didn't try the test suite, I expect too many failures.
This demonstrates that Python's stdlib is still not being explicit about the encoding issues. I suppose that things just happen to work because we mostly use ASCII files for configuration and setup.
--
Then I tried my suggestion (use "utf-8" by default): Python starts correctly, I can build it (run "make") and... the full test suite pass without any change. (I'm testing on Linux, my locale encoding is UTF-8.)
I bet it would also with "ascii" in most cases. Which then just means that the Python build process and test suite is not a good test case for choosing a default encoding. Linux is also a poor test candidate for this, since most user setups will use UTF-8 as locale encoding. Windows, OTOH, uses all sorts of code page encodings (usually not UTF-8), so you are likely to hit the real problem cases a lot easier. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 29 2011)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

Le mercredi 29 juin 2011 à 10:18 +0200, M.-A. Lemburg a écrit :
Victor Stinner wrote:
Le mardi 28 juin 2011 à 16:02 +0200, M.-A. Lemburg a écrit :
How about a more radical change: have open() in Py3 default to opening the file in binary mode, if no encoding is given (even if the mode doesn't include 'b') ?
I tried your suggested change: Python doesn't start.
No surprise there: it's an incompatible change, but one that undoes a wart introduced in the Py3 transition. Guessing encodings should be avoided whenever possible.
It means that all programs written for Python 3.0, 3.1, 3.2 will stop working with the new 3.x version (let say 3.3). Users will have to migrate from Python 2 to Python 3.2, and then migration from Python 3.2 to Python 3.3 :-( I would prefer a ResourceWarning (emited if the encoding is not specified), hidden by default: it doesn't break compatibility, and -Werror gives exactly the same behaviour that you expect.
This demonstrates that Python's stdlib is still not being explicit about the encoding issues. I suppose that things just happen to work because we mostly use ASCII files for configuration and setup.
I did more tests. I found some mistakes and sometimes the binary mode can be used, but most function really expect the locale encoding (it is the correct encoding to read-write files). I agree that it would be to have an explicit encoding="locale", but make it mandatory is a little bit rude.
Then I tried my suggestion (use "utf-8" by default): Python starts correctly, I can build it (run "make") and... the full test suite pass without any change. (I'm testing on Linux, my locale encoding is UTF-8.)
I bet it would also with "ascii" in most cases. Which then just means that the Python build process and test suite is not a good test case for choosing a default encoding.
Linux is also a poor test candidate for this, since most user setups will use UTF-8 as locale encoding. Windows, OTOH, uses all sorts of code page encodings (usually not UTF-8), so you are likely to hit the real problem cases a lot easier.
I also ran the test suite on my patched Python (open uses UTF-8 by default) with ASCII locale encoding (LANG=C), the test suite does also pass. Many tests uses non-ASCII characters, some of them are skipped if the locale encoding is unable to encode the tested text. Victor

Victor Stinner wrote:
Le mercredi 29 juin 2011 à 10:18 +0200, M.-A. Lemburg a écrit :
Victor Stinner wrote:
Le mardi 28 juin 2011 à 16:02 +0200, M.-A. Lemburg a écrit :
How about a more radical change: have open() in Py3 default to opening the file in binary mode, if no encoding is given (even if the mode doesn't include 'b') ?
I tried your suggested change: Python doesn't start.
No surprise there: it's an incompatible change, but one that undoes a wart introduced in the Py3 transition. Guessing encodings should be avoided whenever possible.
It means that all programs written for Python 3.0, 3.1, 3.2 will stop working with the new 3.x version (let say 3.3). Users will have to migrate from Python 2 to Python 3.2, and then migration from Python 3.2 to Python 3.3 :-(
I wasn't suggesting doing this for 3.3, but we may want to start the usual feature change process to make the change eventually happen.
I would prefer a ResourceWarning (emited if the encoding is not specified), hidden by default: it doesn't break compatibility, and -Werror gives exactly the same behaviour that you expect.
ResourceWarning is the wrong type of warning for this. I'd suggest to use a UnicodeWarning or perhaps create a new EncodingWarning instead.
This demonstrates that Python's stdlib is still not being explicit about the encoding issues. I suppose that things just happen to work because we mostly use ASCII files for configuration and setup.
I did more tests. I found some mistakes and sometimes the binary mode can be used, but most function really expect the locale encoding (it is the correct encoding to read-write files). I agree that it would be to have an explicit encoding="locale", but make it mandatory is a little bit rude.
Again: Using a locale based default encoding will not work out in the long run. We've had those discussions many times in the past. I don't think there's anything bad with having the user require to set an encoding if he wants to read text. It makes him/her think twice about the encoding issue, which is good. And, of course, the stdlib should start using this explicit-is-better-than-implicit approach as well.
Then I tried my suggestion (use "utf-8" by default): Python starts correctly, I can build it (run "make") and... the full test suite pass without any change. (I'm testing on Linux, my locale encoding is UTF-8.)
I bet it would also with "ascii" in most cases. Which then just means that the Python build process and test suite is not a good test case for choosing a default encoding.
Linux is also a poor test candidate for this, since most user setups will use UTF-8 as locale encoding. Windows, OTOH, uses all sorts of code page encodings (usually not UTF-8), so you are likely to hit the real problem cases a lot easier.
I also ran the test suite on my patched Python (open uses UTF-8 by default) with ASCII locale encoding (LANG=C), the test suite does also pass. Many tests uses non-ASCII characters, some of them are skipped if the locale encoding is unable to encode the tested text.
Thanks for checking. So the build process and test suite are indeed not suitable test cases for the problem at hand. With just ASCII files to decode, Python will simply never fail to decode the content, regardless of whether you use an ASCII, UTF-8 or some Windows code page as locale encoding. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 29 2011)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On Tue, 28 Jun 2011 15:43:05 +0200 Victor Stinner <victor.stinner@haypocalc.com> wrote:
- ISO-8859-1 os some FreeBSD systems - ANSI code page on Windows, e.g. cp1252 (close to ISO-8859-1) in Western Europe, cp952 in Japan, ... - ASCII if the locale is manually set to an empty string or to "C", or if the environment is empty, or by default on some systems - something different depending on the system and user configuration...
Why would utf-8 be the right thing in these cases? Regards Antoine.

On 6/28/2011 10:06 AM, Antoine Pitrou wrote:
On Tue, 28 Jun 2011 15:43:05 +0200 Victor Stinner<victor.stinner@haypocalc.com> wrote:
- ISO-8859-1 os some FreeBSD systems - ANSI code page on Windows, e.g. cp1252 (close to ISO-8859-1) in Western Europe, cp952 in Japan, ... - ASCII if the locale is manually set to an empty string or to "C", or if the environment is empty, or by default on some systems - something different depending on the system and user configuration...
Why would utf-8 be the right thing in these cases?
Because utf-8 is the only way to write out any Python 3 text. By default, writing and reading an str object should work on all Python installations. And because other apps are (increasingly) using it for exactly the same reason. -- Terry Jan Reedy

On Tue, 28 Jun 2011 10:41:38 -0400 Terry Reedy <tjreedy@udel.edu> wrote:
On 6/28/2011 10:06 AM, Antoine Pitrou wrote:
On Tue, 28 Jun 2011 15:43:05 +0200 Victor Stinner<victor.stinner@haypocalc.com> wrote:
- ISO-8859-1 os some FreeBSD systems - ANSI code page on Windows, e.g. cp1252 (close to ISO-8859-1) in Western Europe, cp952 in Japan, ... - ASCII if the locale is manually set to an empty string or to "C", or if the environment is empty, or by default on some systems - something different depending on the system and user configuration...
Why would utf-8 be the right thing in these cases?
Because utf-8 is the only way to write out any Python 3 text.
Er, no, you also have utf-16, utf-32, utf-7 (and possibly others, including home-baked encodings).
By default, writing and reading an str object should work on all Python installations.
But that's only half of the problem. If the text is supposed to be read or processed by some other program, then writing it in some encoding that the other program doesn't expect doesn't really help. That's why we use the locale encoding: because it's a good guess as to what the system (and its users) expects text to be encoded in. Regards Antoine.

On 6/28/2011 9:43 AM, Victor Stinner wrote:
In Python 2, open() opens the file in binary mode (e.g. file.readline() returns a byte string). codecs.open() opens the file in binary mode by default, you have to specify an encoding name to open it in text mode.
In Python 3, open() opens the file in text mode by default. (It only opens the binary mode if the file mode contains "b".) The problem is that open() uses the locale encoding if the encoding is not specified, which is the case *by default*. The locale encoding can be:
- UTF-8 on Mac OS X, most Linux distributions - ISO-8859-1 os some FreeBSD systems - ANSI code page on Windows, e.g. cp1252 (close to ISO-8859-1) in Western Europe, cp952 in Japan, ... - ASCII if the locale is manually set to an empty string or to "C", or if the environment is empty, or by default on some systems - something different depending on the system and user configuration...
If you develop under Mac OS X or Linux, you may have surprises when you run your program on Windows on the first non-ASCII character. You may not detect the problem if you only write text in english... until someone writes the first letter with a diacritic.
As discussed before on this list, I propose to set the default encoding of open() to UTF-8 in Python 3.3, and add a warning in Python 3.2 if open() is called without an explicit encoding and if the locale encoding is not UTF-8. Using the warning, you will quickly notice the potential problem (using Python 3.2.2 and -Werror) on Windows or by using a different locale encoding (.e.g using LANG="C").
I expect a lot of warnings from the Python standard library, and as many in third party modules and applications. So do you think that it is too late to change that in Python 3.3? One argument for changing it directly in Python 3.3 is that most users will not notice the change because their locale encoding is already UTF-8.
An alternative is to: - Python 3.2: use the locale encoding but emit a warning if the locale encoding is not UTF-8 - Python 3.3: use UTF-8 and emit a warning if the locale encoding is not UTF-8... or maybe always emit a warning? - Python 3.3: use UTF-8 (but don't emit warnings anymore)
I don't think that Windows developer even know that they are writing files into the ANSI code page. MSDN documentation of WideCharToMultiByte() warns developer that the ANSI code page is not portable, even accross Windows computers:
"The ANSI code pages can be different on different computers, or can be changed for a single computer, leading to data corruption. For the most consistent results, applications should use Unicode, such as UTF-8 or UTF-16, instead of a specific code page, unless legacy standards or data formats prevent the use of Unicode. If using Unicode is not possible, applications should tag the data stream with the appropriate encoding name when protocols allow it. HTML and XML files allow tagging, but text files do not."
It will always be possible to use ANSI code page using encoding="mbcs" (only work on Windows), or an explicit code page number (e.g. encoding="cp2152").
--
The two other (rejetected?) options to improve open() are:
- raise an error if the encoding argument is not set: will break most programs - emit a warning if the encoding argument is not set
--
Should I convert this email into a PEP, or is it not required?
I think a PEP is needed. -- Terry Jan Reedy

On 28.06.2011 14:24, Terry Reedy wrote:
As discussed before on this list, I propose to set the default encoding of open() to UTF-8 in Python 3.3, and add a warning in Python 3.2 if open() is called without an explicit encoding and if the locale encoding is not UTF-8. Using the warning, you will quickly notice the potential problem (using Python 3.2.2 and -Werror) on Windows or by using a different locale encoding (.e.g using LANG="C").
[...]
Should I convert this email into a PEP, or is it not required?
I think a PEP is needed.
Absolutely. And I hope the hypothetical PEP would be rejected in this form. We need to stop making incompatible changes to Python 3. We had the chance and took it to break all kinds of stuff, some of it gratuitous, with 3.0 and even 3.1. Now the users need a period of compatibility and stability (just like the language moratorium provided for one aspect of Python). Think about porting: Python 3 uptake is not ahead of time (I don't want to say it's too slow, but it's certainly not too fast.) For the sake of porters' sanity, 3.x should not be a moving target. New features are not so much of a problem, but incompatibilities like this one certainly are. At the very least, a change like this needs a transitional strategy, like it has been used during the 2.x series: * In 3.3, accept "locale" as the encoding parameter, meaning the locale encoding * In 3.4, warn if encoding isn't given and the locale encoding isn't UTF-8 * In 3.5, change default encoding to UTF-8 It might be just enough to stress in the documentation that usage of the encoding parameter is recommended for cross-platform consistency. cheers, Georg

On 6/28/2011 5:42 PM, Georg Brandl wrote:
At the very least, a change like this needs a transitional strategy, like it has been used during the 2.x series:
* In 3.3, accept "locale" as the encoding parameter, meaning the locale encoding * In 3.4, warn if encoding isn't given and the locale encoding isn't UTF-8 * In 3.5, change default encoding to UTF-8
3.5 should be 4-5 years off. I actually would not propose anything faster than that. -- Terry Jan Reedy

On Wed, Jun 29, 2011 at 7:42 AM, Georg Brandl <g.brandl@gmx.net> wrote:
On 28.06.2011 14:24, Terry Reedy wrote:
I think a PEP is needed.
Absolutely. And I hope the hypothetical PEP would be rejected in this form.
We need to stop making incompatible changes to Python 3. We had the chance and took it to break all kinds of stuff, some of it gratuitous, with 3.0 and even 3.1. Now the users need a period of compatibility and stability (just like the language moratorium provided for one aspect of Python).
+1 to everything Georg said. - nothing can change in 3.2 - perhaps provide a way for an application to switch the default behaviour between 'locale' and 'utf-8' in 3.3 - if this is done, also provide a way to explicitly request the 'locale' behaviour (likely via a locale dependent codec alias) - maybe start thinking about an actual transition to 'utf-8' as default in the 3.4/5 time frame Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Jun 28, 2011, at 09:42 PM, Georg Brandl wrote:
We need to stop making incompatible changes to Python 3. We had the chance and took it to break all kinds of stuff, some of it gratuitous, with 3.0 and even 3.1. Now the users need a period of compatibility and stability (just like the language moratorium provided for one aspect of Python).
+1. I think this is the #1 complaint I hear about Python in talking to users. I think in general we do a pretty good job of maintaining backward compatibility between releases, but not a perfect job, and the places where we miss can be painful for folks. It may be difficult to achieve in all cases, but compatibility should be carefully and thoroughly considered for all changes, especially in the stdlib, and clearly documented where deliberate decisions to break that are adopted. -Barry

On 28 June 2011 14:43, Victor Stinner <victor.stinner@haypocalc.com> wrote:
As discussed before on this list, I propose to set the default encoding of open() to UTF-8 in Python 3.3, and add a warning in Python 3.2 if open() is called without an explicit encoding and if the locale encoding is not UTF-8. Using the warning, you will quickly notice the potential problem (using Python 3.2.2 and -Werror) on Windows or by using a different locale encoding (.e.g using LANG="C").
-1. This will make things harder for simple scripts which are not intended to be cross-platform. I use Windows, and come from the UK, so 99% of my text files are ASCII. So the majority of my code will be unaffected. But in the occasional situation where I use a £ sign, I'll get encoding errors, where currently things will "just work". And the failures will be data dependent, and hence intermittent (the worst type of problem). I'll write a quick script, use it once and it'll be fine, then use it later on some different data and get an error. :-( I appreciate that the point here is to make sure that people think a bit more carefully about encoding issues. But doing so by making Python less friendly for casual, adhoc script use, seems to me to be a mistake.
I don't think that Windows developer even know that they are writing files into the ANSI code page. MSDN documentation of WideCharToMultiByte() warns developer that the ANSI code page is not portable, even accross Windows computers:
Probably true. But for many uses they also don't care. If you're writing something solely for a one-off job on your own PC, the ANSI code page is fine, and provides interoperability with other programs on your PC, which is really what you care about. (UTF-8 without BOM displays incorrectly in Vim, wordpad, and powershell get-content. MBCS works fine in all of these. It also displays incorrectly in CMD type, but in a less familiar form than the incorrect display mbcs produces, for what that's worth...)
It will always be possible to use ANSI code page using encoding="mbcs" (only work on Windows), or an explicit code page number (e.g. encoding="cp2152").
So, in effect, you propose making the default favour writing multiplatform portable code at the expense of quick and dirty scripts? My personal view is that this is the wrong choice ("practicality beats purity") but I guess it's ultimately a question of Python's design philosophy.
The two other (rejetected?) options to improve open() are:
- raise an error if the encoding argument is not set: will break most programs - emit a warning if the encoding argument is not set
IMHO, you missed another option - open() does not need improving, the current behaviour is better than any of the 3 options noted. Paul.

@ Paul Moore <p.f.moore@gmail.com> wrote (2011-06-28 16:46+0200):
UTF-8 without BOM displays incorrectly in vim(1)
Stop right now (you're oh so wrong)! :-) (By the way: UTF-8 and BOM? Interesting things i learn on this list. And i hope in ten years we can laugh about this -> UTF-8 transition all over the place, 'cause it's simply working.) -- Ciao, Steffen sdaoden(*)(gmail.com) () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments

On 28 June 2011 16:06, Steffen Daode Nurpmeso <sdaoden@googlemail.com> wrote:
@ Paul Moore <p.f.moore@gmail.com> wrote (2011-06-28 16:46+0200):
UTF-8 without BOM displays incorrectly in vim(1)
Stop right now (you're oh so wrong)! :-)
Sorry. Please add "using the default settings of gvim on Windows". My context throughout was Windows not Unix. Sorry I didn't make that clear.
(By the way: UTF-8 and BOM?
Windows uses it, I believe. My tests specifically used files with no BOM, just utf8-encoded text. I made this statement to head off people assuming that UTF8 can be detected in Windows by looking at the first few bytes.
Interesting things i learn on this list.
:-)
And i hope in ten years we can laugh about this -> UTF-8 transition all over the place, 'cause it's simply working.)
That would be good... Paul.

On Tue, Jun 28, 2011 at 03:46:12PM +0100, Paul Moore wrote:
On 28 June 2011 14:43, Victor Stinner <victor.stinner@haypocalc.com> wrote:
As discussed before on this list, I propose to set the default encoding of open() to UTF-8 in Python 3.3, and add a warning in Python 3.2 if open() is called without an explicit encoding and if the locale encoding is not UTF-8. Using the warning, you will quickly notice the potential problem (using Python 3.2.2 and -Werror) on Windows or by using a different locale encoding (.e.g using LANG="C").
-1. This will make things harder for simple scripts which are not intended to be cross-platform.
I use Windows, and come from the UK, so 99% of my text files are ASCII. So the majority of my code will be unaffected. But in the occasional situation where I use a £ sign, I'll get encoding errors, where currently things will "just work". And the failures will be data dependent, and hence intermittent (the worst type of problem). I'll write a quick script, use it once and it'll be fine, then use it later on some different data and get an error. :-(
I don't think this change would make things "harder". It will just move where the pain occurs. Right now, the failures are intermittent on A) computers other than the one that you're using. or B) intermittent when run under a different user than yourself. Sys admins where I'm at are constantly writing ad hoc scripts in python that break because you stick something in a cron job and the locale settings suddenly become "C" and therefore the script suddenly only deals with ASCII characters. I don't know that Victor's proposed solution is the best (I personally would like it a whole lot more than the current guessing but I never develop on Windows so I can certainly see that your environment can lead to the opposite assumption :-) but something should change here. Issuing a warning like "open used without explicit encoding may lead to errors" if open() is used without an explicit encoding would help a little (at least, people who get errors would then have an inkling that the culprit might be an open() call). If I read Victor's previous email correctly, though, he said this was previously rejected. Another brainstorming solution would be to use different default encodings on different platforms. For instance, for writing files, utf-8 on *nix systems (including macosX) and utf-16 on windows. For reading files, check for a utf-16 BOM, if not present, operate as utf-8. That would seem to address your issue with detection by vim, etc but I'm not sure about getting "£" in your input stream. I don't know where your input is coming from and how Windows equivalent of locale plays into that. -Toshio

Le mardi 28 juin 2011 à 09:33 -0700, Toshio Kuratomi a écrit :
Issuing a warning like "open used without explicit encoding may lead to errors" if open() is used without an explicit encoding would help a little (at least, people who get errors would then have an inkling that the culprit might be an open() call). If I read Victor's previous email correctly, though, he said this was previously rejected.
Oh sorry, I used the wrong word. I listed two other possible solutions, but there were not really rejetected. I just thaugh that changing the default encoding to UTF-8 was the most well accepted idea. If I mix different suggestions together: another solution is to emit a warning if the encoding is not specified (not only if the locale encoding is different than UTF-8). Using encoding="locale" would make it quiet. It would be annoying if the warning would be displayed by default ("This will make things harder for simple scripts which are not intended to be cross-platform." wrote Paul Moore). It only makes sense if we use the same policy than unclosed files/sockets: hidden by default, but it can be configured using command line options (-Werror, yeah!).
Another brainstorming solution would be to use different default encodings on different platforms. For instance, for writing files, utf-8 on *nix systems (including macosX) and utf-16 on windows.
I don't think that UTF-16 is a better choice than UTF-8 on Windows :-(
For reading files, check for a utf-16 BOM, if not present, operate as utf-8.
Oh oh. I already suggested to read the BOM. See http://bugs.python.org/issue7651 and read the email thread "Improve open() to support reading file starting with an unicode BOM" http://mail.python.org/pipermail/python-dev/2010-January/097102.html Reading the BOM is a can of worm, everybody expects something different. I forgot the idea of changing that. Victor

On 6/28/2011 10:46 AM, Paul Moore wrote:
I use Windows, and come from the UK, so 99% of my text files are ASCII. So the majority of my code will be unaffected. But in the occasional situation where I use a £ sign, I'll get encoding errors,
I do not understand this. With utf-8 you would never get a string encoding error.
where currently things will "just work".
As long as you only use the machine-dependent restricted character set.
And the failures will be data dependent, and hence intermittent (the worst type of problem).
That is the situation now, with platform/machine dependencies added in. Some people share code with other machines, even locally.
So, in effect, you propose making the default favour writing multiplatform portable code at the expense of quick and dirty scripts?
Let us frame it another way. Should Python installations be compatible with other Python installations, or with the other apps on the same machine? Part of the purpose of Python is to cover up platform differences, to the extent possible (and perhaps sensible -- there is the argument). This was part of the purpose of writing our own io module instead of using the compiler stdlib. The evolution of floating point math has gone in the same direction. For instance, float now expects uniform platform-independent Python-dependent names for infinity and nan instead of compiler-dependent names. As for practicality. Notepad++ on Windows offers ANSI, utf-8 (w,w/o BOM), utf-16 (big/little endian). I believe that ODF documents are utf-8 encoded xml (compressed or not). My original claim for this proposal was/is that even Windows apps are moving to uft-8 and that someday making that the default for Python everywhere will be the obvious and sensible thing. -- Terry Jan Reedy

On 28/06/2011 18:06, Terry Reedy wrote:
On 6/28/2011 10:46 AM, Paul Moore wrote:
I use Windows, and come from the UK, so 99% of my text files are ASCII. So the majority of my code will be unaffected. But in the occasional situation where I use a £ sign, I'll get encoding errors,
I do not understand this. With utf-8 you would never get a string encoding error.
I assumed he meant that files written out as utf-8 by python would then be read in using the platform encoding (i.e. not utf-8 on Windows) by the other applications he is inter-operating with. The error would not be in Python but in those applications.
where currently things will "just work".
As long as you only use the machine-dependent restricted character set.
Which is the situation he is describing. You do go into those details below, and which choice is "correct" depends on which trade-off you want to make. For the sake of backwards compatibility we are probably stuck with the current trade-off however - unless we deprecate using open(...) without an explicit encoding. All the best, Michael
And the failures will be data dependent, and hence intermittent (the worst type of problem).
That is the situation now, with platform/machine dependencies added in. Some people share code with other machines, even locally.
So, in effect, you propose making the default favour writing multiplatform portable code at the expense of quick and dirty scripts?
Let us frame it another way. Should Python installations be compatible with other Python installations, or with the other apps on the same machine? Part of the purpose of Python is to cover up platform differences, to the extent possible (and perhaps sensible -- there is the argument). This was part of the purpose of writing our own io module instead of using the compiler stdlib. The evolution of floating point math has gone in the same direction. For instance, float now expects uniform platform-independent Python-dependent names for infinity and nan instead of compiler-dependent names.
As for practicality. Notepad++ on Windows offers ANSI, utf-8 (w,w/o BOM), utf-16 (big/little endian). I believe that ODF documents are utf-8 encoded xml (compressed or not). My original claim for this proposal was/is that even Windows apps are moving to uft-8 and that someday making that the default for Python everywhere will be the obvious and sensible thing.
-- http://www.voidspace.org.uk/ May you do good and not evil May you find forgiveness for yourself and forgive others May you share freely, never taking more than you give. -- the sqlite blessing http://www.sqlite.org/different.html

On 28 June 2011 18:22, Michael Foord <fuzzyman@voidspace.org.uk> wrote:
On 28/06/2011 18:06, Terry Reedy wrote:
On 6/28/2011 10:46 AM, Paul Moore wrote:
I use Windows, and come from the UK, so 99% of my text files are ASCII. So the majority of my code will be unaffected. But in the occasional situation where I use a £ sign, I'll get encoding errors,
I do not understand this. With utf-8 you would never get a string encoding error.
I assumed he meant that files written out as utf-8 by python would then be read in using the platform encoding (i.e. not utf-8 on Windows) by the other applications he is inter-operating with. The error would not be in Python but in those applications.
That is correct. Or files written out (as platform encoding) by other applications, will later be read in as UTF-8 by Python, and be seen as incorrect characters, or worse raise decoding errors. (Sorry, in my original post I said "encoding" where I meant "decoding"...) I'm not interested in allocating "blame" for the "error". I'm not convinced that it *is* an error, merely 2 programs with incompatible assumptions. What I'm saying is that compatibility between various programs on a single machine can, in some circumstances, be more important than compatibility between (the same, or different) programs running on different machines or OSes. And that I, personally, am in that situation.
where currently things will "just work".
As long as you only use the machine-dependent restricted character set.
Which is the situation he is describing. You do go into those details below, and which choice is "correct" depends on which trade-off you want to make.
For the sake of backwards compatibility we are probably stuck with the current trade-off however - unless we deprecate using open(...) without an explicit encoding.
Backward compatibility is another relevant point. But other than that, it's a design trade-off, agreed. All I'm saying is that I see the current situation (which is in favour of quick script use and beginner friendly at the expense of conceptual correctness and forcing the user to think about his choices) as being preferable (and arguably more "Pythonic", in the sense that I see it as a case of "practicality beats purity" - although it's easy to argue that "in the face of ambiguity..." also applies here :-)) Paul.

On Tue, 28 Jun 2011 13:06:44 -0400 Terry Reedy <tjreedy@udel.edu> wrote:
As for practicality. Notepad++ on Windows offers ANSI, utf-8 (w,w/o BOM), utf-16 (big/little endian).
Well, that's *one* application. We would need much more data than that.
I believe that ODF documents are utf-8 encoded xml (compressed or not).
XML doesn't matter for this discussion, since it explicitly declares the encoding. What we are talking about is "raw" text files that don't have an encoding declaration and for which the data format doesn't specify any default encoding (which also rules out Python source code, by the way).
My original claim for this proposal was/is that even Windows apps are moving to uft-8 and that someday making that the default for Python everywhere will be the obvious and sensible thing.
True, but that may be 5 or 10 years from now. Regards Antoine.

On 28.06.2011 19:06, Terry Reedy wrote:
On 6/28/2011 10:46 AM, Paul Moore wrote:
I use Windows, and come from the UK, so 99% of my text files are ASCII. So the majority of my code will be unaffected. But in the occasional situation where I use a £ sign, I'll get encoding errors,
I do not understand this. With utf-8 you would never get a string encoding error.
Yes, but you'll get plenty of *decoding* errors. Georg

I don't think that Windows developer even know that they are writing files into the ANSI code page. MSDN documentation of WideCharToMultiByte() warns developer that the ANSI code page is not portable, even accross Windows computers:
Probably true. But for many uses they also don't care. If you're writing something solely for a one-off job on your own PC, the ANSI code page is fine, and provides interoperability with other programs on your PC, which is really what you care about. (UTF-8 without BOM displays incorrectly in Vim, wordpad, and powershell get-content.
I tried to open a text file encoded to UTF-8 (without BOM) on Windows Seven. The default application displays it correctly, it's the well known builtin notepad program. gvim is unable to detect the encoding, it reads the file using the ANSI code page (WTF? UTF-8 is correctly detected on Linux!?). Wordpad reads the file using the ANSI code page, it is unable to detect the UTF-8 encoding. The "type" command in a MS-Dos shell (cmd.exe) dosen't display the UTF-8 correctly, but a file encoded to ANSI code is also display incorrectly. I suppose that the problem is that the terminal uses the OEM code page, not the ANSI code page. Visual C++ 2008 detects the UTF-8 encoding. I don't have other applications to test on my Windows Seven. I agree that UTF-8 is not well supported by "standard" Windows applications. I would at least expect that Wordpad and gvim are able to detect the UTF-8 encoding.
MBCS works fine in all of these. It also displays incorrectly in CMD type, but in a less familiar form than the incorrect display mbcs produces, for what that's worth...)
True, the encoding of a text file encoded to the ANSI code page is correctly detected by all applications (except "type" in a shell, it should be the OEM/ANSI code page conflict).
IMHO, you missed another option - open() does not need improving, the current behaviour is better than any of the 3 options noted.
My original need is to detect that my program will behave differently on Linux and Windows, because open() uses the implicit locale encoding. Antoine suggested me to monkeypatch __builtins__.open to do that. Victor

Le 28/06/2011 16:46, Paul Moore a écrit :
-1. This will make things harder for simple scripts which are not intended to be cross-platform.
+1 to all you said. I frequently use the python command prompt or "python -c" for various quick tasks (mostly on linux). I would hate to replace my ugly, but working
open('example.txt').read()
with the unnecessarily verbose
open('example.txt',encoding='utf-8').read()
When using python that way as a "swiss army knife", typing does matter. My preferred solution would be:
- emit a warning if the encoding argument is not set
By the way, I just thought that for real programming, I would love to have a -Wcrossplatform command switch, which would warn for all unportable constructs in one go. That way, I don't have to remember which parts of 'os' wrap posix-only functionnality. Baptiste

Le mercredi 29 juin 2011 à 09:21 +0200, Baptiste Carvello a écrit :
By the way, I just thought that for real programming, I would love to have a -Wcrossplatform command switch, which would warn for all unportable constructs in one go. That way, I don't have to remember which parts of 'os' wrap posix-only functionnality.
When I developed using PHP, error_reporting(E_ALL) was really useful. I would like a same function on Python, but I realized that it is not necessary. Python is already strict *by default*. Python can help developers to warn them about some "corner cases". We have already the -bb option for bytes/str warnings (in Python 3), -Werror to convert warnings to exceptions, and ResourceWarning (since Python 3.2) for unclosed files/sockets. I "just" would like a new warning for an implicit locale encoding, so -Wcrossplatform would be as easy as -Werror. -Werror is like Perl "use strict;" or PHP error_reporting(E_ALL). Use -Wd if you prefer to display warnings instead of raising exceptions. See issue #11455 and #11470 for a new "CompatibilityWarning", it's not "cross platform" but "cross Python" :-) It warns about implementation details like non strings keys in a type dict. Victor

Victor Stinner, 28.06.2011 15:43:
In Python 2, open() opens the file in binary mode (e.g. file.readline() returns a byte string). codecs.open() opens the file in binary mode by default, you have to specify an encoding name to open it in text mode.
In Python 3, open() opens the file in text mode by default. (It only opens the binary mode if the file mode contains "b".) The problem is that open() uses the locale encoding if the encoding is not specified, which is the case *by default*. The locale encoding can be:
- UTF-8 on Mac OS X, most Linux distributions - ISO-8859-1 os some FreeBSD systems - ANSI code page on Windows, e.g. cp1252 (close to ISO-8859-1) in Western Europe, cp952 in Japan, ... - ASCII if the locale is manually set to an empty string or to "C", or if the environment is empty, or by default on some systems - something different depending on the system and user configuration...
If you develop under Mac OS X or Linux, you may have surprises when you run your program on Windows on the first non-ASCII character. You may not detect the problem if you only write text in english... until someone writes the first letter with a diacritic.
I agree that this is a *very* common source of problems. People write code that doesn't care about encodings all over the place, and are then surprised when it stops working at some point, either by switching environments or by changing the data. I've seen this in virtually all projects I've ever come to work in[1]. So, eventually, all of that code was either thrown away or got debugged and fixed to use an explicit (and usually configurable) encoding. Consequently, I don't think it's a bad idea to break out of this ever recurring development cycle by either requiring an explicit encoding right from the start, or by making the default encoding platform independent. The opportunity to fix this was very unfortunately missed in Python 3.0. Personally, I don't buy the argument that it's harder to write quick scripts if an explicit encoding is required. Most code that gets written is not just quick scripts, and even those tend to live longer than initially intended. Stefan [1] Admittedly, most of those projects were in Java, where the situation is substantially worse than in Python. Java entirely lacks a way to define a per-module source encoding, and it even lacks a straight forward way to encode/decode a file with an explicit encoding. So, by default, *both* input encodings are platform dependent, whereas in Python it's only the default file encoding, and properly decoding a file is straight forward there.
participants (16)
-
Antoine Pitrou
-
Baptiste Carvello
-
Barry Warsaw
-
Bill Janssen
-
Ethan Furman
-
Georg Brandl
-
M.-A. Lemburg
-
Michael Foord
-
Nick Coghlan
-
Paul Moore
-
Stefan Behnel
-
Steffen Daode Nurpmeso
-
Terry Reedy
-
Toshio Kuratomi
-
Tres Seaver
-
Victor Stinner