[Python-Dev] Python-3.0, unicode, and os.environ
victor.stinner at haypocalc.com
Fri Dec 5 11:18:48 CET 2008
Le Thursday 04 December 2008 21:02:19 Toshio Kuratomi, vous avez écrit :
> I opened up bug http://bugs.python.org/issue4006 a while ago and it was
> suggested in the report that it's not a bug but a feature and so I
> should come here to see about getting the feature changed :-)
Yeah, I prefer to discuss such changes on the mailing list.
> These mixed encodings can occur for a variety of reasons. Here's an
> example that isn't too contrived :-)
> Furthermore, they don't want to suffer from the space loss of using
> utf-8 to encode Japanese so they use shift-jis everywhere.
"space loss"? Really? If you configure your server correctly, you should get
UTF-8 even if the file system is Shift-JIS. But it would be much easier to
use UTF-8 everywhere.
Hum... I don't think that the discussion is about one specific server, but the
lack of bytes environment variables in Python3 :-)
> 1) return mixed unicode and byte types in ...
> 2) return only byte types in os.environ
Hum... Most users have UTF-8 everywhere (eg. all Windows users ;-)), and
Python3 already use Unicode everywhere (input(), open(), filenames, ...).
> 3) silently ignore non-decodable value when accessing os.environ['PATH']
> as we do now but allow access to the full information via
> os.environ[b'PATH'] and os.getenvb()
I don't like os.environ[b'PATH']. I prefer to always get the same result
type... But os.listdir() doesn't respect that :-(
os.listdir(str) -> list of str
os.listdir(bytes) -> list of bytes
I would prefer a similar API for easier migration from Python2/Python3
(unicode). os.environb sounds like the best choice for me.
But they are open questions (already asked in the bug tracker):
(a) Should os.environ be updated if os.environb is changed? If yes, how?
os.environb['PATH'] = '\xff' (or any invalid string in the system
=> os.environ['PATH'] = ???
(b) Should os.environb be updated if os.environ is changed? If yes, how?
The problem comes with non-Unicode locale (eg. latin-1 or ASCII): most charset
are unable to encode the whole Unicode charset (eg. codes >= 65535).
os.environ['PATH'] = chr(0x10000)
=> os.environb['PATH'] = ???
(c) Same question when a key is deleted (del os.environ['PATH']).
If Python 3.1 will have os.environ and os.environb, I'm quite sure that some
modules will user os.environ and other will prefer os.environb. If both
environments are differents, the two modules set will work differently :-/
It would be maybe easier if os.environ supports bytes and unicode keys. But we
have to keep these assertions:
os.environ[bytes] -> bytes
os.environ[str] -> str
> 4) raise an exception when non-decodable values are *accessed* and
> continue as in #3.
I like os.listdir() behaviour: just *ignore* non-decodable files. If you
really want to access these files, use a bytes directory name ;-)
> I think that the ease of debugging is lost when we silently ignore an error.
Guido gave a good example. If your directory contains an non decodable
filename (eg. "???.txt"): glob('*.py') will fail because of the evil
filename. With the current behaviour, you're unable to list all files but
glob('*.py') will list all Python scripts!
And Python3 is released, it's maybe a bad idea to change the behaviour (of
os.environ) in Python 3.1 :-/
> The bug report I opened suggests creating a PEP to address this issue.
Please, try to answer to my questions about os.environ and os.environb
I also like bytes environment variables. I need them for my fuzzing program.
The lack of bytes variables is a regression from Python2 (for my program). On
UNIX, filenames are bytes and the environment variables are bytes. For the
best interoperability, Python3 should support bytes. But the default choice
should always be characters (unicode) and to never mix the bytes and str
As usual, it goes faster if someone writes a patch :-) I could try to work on
Victor Stinner aka haypo
More information about the Python-Dev