[Python-Dev] Python-3.0, unicode, and os.environ
Toshio Kuratomi
a.badger at gmail.com
Thu Dec 4 21:02:19 CET 2008
I opened up bug http://bugs.python.org/issue4006 a while ago and it was
suggested in the report that it's not a bug but a feature and so I
should come here to see about getting the feature changed :-)
I have a specific problem with os.environ and a somewhat less important
architectural issue with the unicode/bytes handling in certain os.*
modules. I'll start with the important one:
Currently in python3 there's no way to get at environment variables that
are not encoded in the system default encoding. My understanding is
that this isn't a problem on Windows systems but on *nix this is a huge
problem. environment variables on *nix are a sequence of non-null
bytes. These bytes are almost always "characters" but they do not have
to be. Further, there is nothing that requires that the characters be
in the same encoding; some of the characters could be in the UTF-8
character set while others are in latin-1, shift-jis, or big-5.
These mixed encodings can occur for a variety of reasons. Here's an
example that isn't too contrived :-)
Swallow is a multi-user shell server hosted at a university in Japan.
The OS installed is Fedora 10 where the encoding of all filenames
provided by the OS are UTF-8. The administrator of the OS has kept this
convention and, among other things has created a directory to mount and
NFS directory from another computer. He calls that "ネットワーク"
("network" in Japanese). Since it's utf-8, that gets put on the
filesystem as
'\xe3\x83\x8d\xe3\x83\x83\xe3\x83\x88\xe3\x83\xaf\xe3\x83\xbc\xe3\x82\xaf'
Now the administrators of the fileserver have been maintaining it since
before Unicode was invented. Furthermore, they don't want to suffer
from the space loss of using utf-8 to encode Japanese so they use
shift-jis everywhere. They have a directory on the nfs share for
programs that are useful for people on the shell server to access. It's
called "プログラム" ("programs" in Japanese) Since they're using
shift-jis, the bytes on the filesystem are:
'\x83v\x83\x8d\x83O\x83\x89\x83\x80'
The system administrator of the shell server adds the directory of
programs to all his user's default PATH variables so then they have this:
PATH=/bin:/usr/bin:/usr/local/bin:/mnt/\xe3\x83\x8d\xe3\x83\x83\xe3\x83\x88\xe3\x83\xaf\xe3\x83\xbc\xe3\x82\xaf/\x83v\x83\x8d\x83O\x83\x89\x83\x80
(Note: python syntax, In the unix shell you'd likely have octal instead
of hex)
Now comes the problematic part. One of the user's on the system wants
to write a python3 program that needs to determine if a needed program
is in the user's PATH. He tries to code it like this::
#!/usr/bin/python3.0
import os
for directory in os.environ['PATH']:
programs = os.listdir(directory)
That code raises a KeyError because python3 has silently discarded the
PATH due to the shift-jis encoded path elements. Much more importantly,
there's no way the programmer can handle the KeyError and actually get
the PATH from within python.
In the bug report I opened, I listed four ways to fix this along with
the pros and cons:
1) return mixed unicode and byte types in os.environ and os.getenv
- I think this one is a bad idea. It's the easiest for simple code
to deal with but it's repeating the major problem with python2's Unicode
handling: mixing unicode and byte types unpredictably.
2) return only byte types in os.environ
- This is conceptually correct but the most annoying option.
Technically we're receiving bytes from the C libraries and the C
libraries expect bytes in return. But in the common case we will be
dealing with things in one encoding so this causes needless effort to
the application programmer in the common case.
3) silently ignore non-decodable value when accessing os.environ['PATH']
as we do now but allow access to the full information via
os.environ[b'PATH'] and os.getenvb().
- This mirrors the practice of os.listdir('.') vs os.listdir(b'.') and
os.getcwd() vs os.getcwdb().
4) raise an exception when non-decodable values are *accessed* and
continue as in #3. This means that os.environ wouldn't be a simple dict
as it would need to decode the values when keys are accessed (although
it could cache the values).
- This mirrors the practice of open() which is to decode the value for
the common case but throw an exception and allow the programmer to
decide what to do if all values are not decodable.
Either #3 or #4 will solve the major problem and both have precedent in
python3's current implementation. The difference between them is
whether to throw an exception when a non-decodable value is encountered.
Here's why I think that's appropriate:
One of the things I enjoy about python is the informative tracebacks
that make debugging easy. I think that the ease of debugging is lost
when we silently ignore an error. If we look at the difference in
coding and debugging for problems with files that aren't encoded in the
default encoding (where a traceback is issued) and os.listdir() when
filenames aren't in the default encoding (where the filenames are
silently ignored), I think we'll see that::
#!/usr/bin/python3.0
# Code with two unicode problems:
import os, sys
directory = sys.stdin.readline().strip()
for filename in os.listdir(directory):
myfile = open(filename, 'r')
print('%s: %s' % [os.path.join(directory, filename), myfile.readline()])
myfile.close()
Let's say I write the above code and test it on a directory that's all
encoded in the default encoding. I release it to the world. Someone
uses it on a system that has files and filenames with mixed encodings.
They immediately get a traceback like this:
File "./test.py", line 7, in <module>
print(myfile.readline())
[...]
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 24-26:
invalid data
With that information I can diagnose that my program is failing to read
a line from a file because the file is not written in the default
encoding (utf8 in this case). It points out that myfile on line 7 of
test.py is the file object that has issues. I quickly fix it by doing this:
+ unknown_encoded_files = []
[...]
+ try:
- print(myfile.readline())
+ print('%s: %s' % [os.path.join(directory, filename),
myfile.readline()])
+ except UnicodeDecodeError:
+ unknown_encoded_files.append(filename)
myfile.close()
+if unknown_encoded_files:
+ print('These files are not in the default encoding:\n %s' % '\n
'.join(unknown_encoded_files))
Very simple. The traceback has all the information I need to fix this.
A little later I get another report from that user that my code is
failing to list the first line of all the files in their home directory.
This time there's no traceback to point out which of my files is
failing, just that some files are being ignored. I ask for the list of
files in the directory and get back:
é.txt
ñ.txt
I create those files in a directory and they're processed fine. I tell
the user that and ask if there's anything special about what's in the
files or anything that makes them different. No... they're both text
files on his machine. One was created there, though, and the other was
copied from another machine. Hmm.. do the filenames show up mangled by
any chance? Yes, one of them does but he knows it's correct since it
shows up correctly on his machine at home.
Ah ha! That seems to point at an encoding problem. But where? After
writing a test and perusing my code for a while, I find my os.listdir()
call. directory has to be converted to bytes for this to work. So I
change the code like so:
- for filename in os.listdir(directory):
+ for filename in os.listdir(directory.encode()):
[...]
- unknown_encoded_files.append(filename)
+ unknown_encoded_files.append(str(filename, errors='replace'))
The code for the fix is simple but the debugging to find the problem is
not. Raising an exception instead of silently failing is much better
for getting code that works correctly.
The bug report I opened suggests creating a PEP to address this issue.
I think that's a good idea for whether os.listdir() and friends should
be changed to raise an exception but not having any way to get at some
environment variables seems like it's just a bug that needs to be
addressed. What do other people think on both these issues?
-Toshio
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/python-dev/attachments/20081204/e2ab19a0/attachment.pgp>
More information about the Python-Dev
mailing list