![](https://secure.gravatar.com/avatar/20f57595c7d4893e8ab834e6f69fc449.jpg?s=120&d=mm&r=g)
M.-A. Lemburg:
2) Return unicode when the text can not be represented in ASCII. This will cause a change of behaviour for existing code which deals with non-ASCII data.
+1 on this one (s/ASCII/Python's default encoding).
I assume you mean the result of sys.getdefaultencoding() here. Unless much of the Python library is modified to use the default encoding, this will break. The problem is that different implicit encodings are being used for reading data and for accessing files. When calling a function, such as open, with a byte string, Python passes that byte string through to Windows which interprets it as being encoded in CP_ACP. When this differs from sys.getdefaultencoding() there will be a mismatch. Say I have been working on a machine set up for Australian English (or other Western European locale) but am working with Russian data so have set Python's default encoding to cp1251. With this simple script, g.py: import sys print file(sys.argv[1]).read() I process a file called '€.txt' with contents "European Euro" to produce C:\zed>python_d g.py €.txt European Euro With the proposed modification, sys.argv[1] u'\u20ac.txt' is converted through cp1251 to '\x88.txt' as the Euro is located at 0x88 in CP1251. The operating system is then asked to open '\x88.txt' which it interprets through CP_ACP to be u'\u02c6.txt' ('ˆ.txt') which then fails. If you are very unlucky there will be a file called 'ˆ.txt' so the call will succeed and produce bad data. Simulating with str(sys.argvu[1]): C:\zed>python_d g.py €.txt Traceback (most recent call last): File "g.py", line 2, in ? print file(str(sys.argvu[1])).read() IOError: [Errno 2] No such file or directory: '\x88.txt'
-1: code pages are evil and the reason why Unicode was invented in the first place. This would be a step back in history.
Features used to specify files (sys.argv, os.environ, ...) should match functions used to open and perform other operations with files as they do currently. This means their encodings should match. Neil