[Python-Dev] Python-3.0, unicode, and os.environ

glyph at divmod.com glyph at divmod.com
Fri Dec 5 04:52:36 CET 2008

On 02:08 am, tjreedy at udel.edu wrote:
>James Y Knight wrote:
>>On Dec 4, 2008, at 6:39 PM, Martin v. Löwis wrote:
>>>I'm in favour of a different, fifth solution:
>>>5) represent all environment variables in Unicode strings,
>>>   including the ones that currently fail to decode.
>>>   (then do the same to file names, then drop the byte-oriented
>>>    file operations again)

>>FWIW, I still agree with Martin that that's the most reasonable 
>FWIW2, I have much the same feeling.

And I still disagree, but I re-read the old thread and didn't see much 
of a clear argument there, so at least I'm not re-treading old ground 

The only strategy that would allow us to encode all inputs as unicode 
(including the invalid ones) is to abuse NUL to mean "ha ha, this isn't 
actually a unicode string, it's something I couldn't decode".  This is 
nice because it allows the type of the returned value to be the same, so 
a Python program that expects a unicode object will be able to 
manipulate this object (as long as it doesn't split it up too close to a 

It seems to me that this convenient, but clever-clever type distinction 
will inevitably be a bug magnet.  For the most basic example, see the 
caveat above.  But more realistically - not too much code splits 
filenames on anything but "." or os.sep, after all - if you pass this to 
an extension module which then wants to invoke a C library function 
which passes the file name to open() and friends, what is the right 
thing for the extension module to do?  There would need to be a new API 
which could get the "right" bytes out of a unicode string which 
potentially has NULs in it.  This can't just be an encoding, either, 
because you might need to get the Shift-JIS bytes (some foreign system's 
encoding) for some got-NULs-in-it filename even though your locale says 
the encoding is UTF-8.  And what if those bytes happen to be valid 
Shift-JIS?  Decoding bytes makes a lot more sense to me than transcoding 

Filenames and environment variables would all need to be encoded or 
decoded according to this magic encoding.  And what happens if you get 
some garbage data from elsewhere and pass it to a function that 
*generates* a filename?  Now, you get a pleasant error message, 
"TypeError: file() argument 1 must be (encoded string without NULL 
bytes), not str".  In the future, I can only assume (if you're lucky) 
that you'll get some weird thing out of the guts of an encoding module; 
or, more likely, some crazy mojibake filename containing PUA code points 
or whatever will silently get opened.  You can make this less likely 
(and harder to debug in the odd cases where it does happen) by making 
the encoding more clever, but eventually your luck will run out: most 
likely on somebody's computer who doesn't speak english well enough to 
report the problem clearly.

The scenario gets progressively more nightmarish as you start putting 
more libraries into the mix.  You pass some environment variable into 
some library which knows all about unicode and happily handles it 
correctly, but a second library which doesn't know about this proposed 
tricky NUL convention gets the unicode filename and transcodes it 
literally, causing an error return from open().  This puts the apparent 
error very far away from the responsible code.

Ultimately it makes sense to expose the underlying bytes as bytes 
without forcing everyone to pretend that they make sense as anything but 
bytes, and allow different applications to make appropriately educated 
guesses about their character format.  In any case, programmers who 
don't know about these kinds of issues are going to make mistakes in 
handling invalid filenames on UNIXy systems, and some users won't be 
able to open some files.  If there is an easy and straightforward way to 
get the bytes out, it's more likely that programmers who know what they 
are doing will be able to get the correct behavior.

Of course, the NUL-encoding trick will make it *possible* to do the 
right thing, but our hypothetically savvy programmer now needs to learn 
about the bytes/unicode distinction between 
windows/mac+linux+everythingelse, and Python's special convention for 
invalid data, and how to mix it with encoding/decoding/transcoding, 
rather than just Python's distinct API for the distinct types that may 
represent a filename.  I think this is significantly harder to document 
than just having two parallel APIs (environ, environb, open(str), 
open(bytes), listdir(str), listdir(bytes)) to reflect the very subtle, 
but nevertheless very real, distinction between the Windows and UNIX 

This distinct API can still provide the same illusion of "it usually 
works" portability that the encoding convention can: for Windows, 
environb can be the representation of the environment in a particular 
encoding; for UNIX, environ(u) can be all of the variables which 
correctly decode.  And so on for each other API.

At least this time I think I've encapsulated pretty much my entire 
argument here, so if you don't buy it, we can probably just agree to 
disagree :).

More information about the Python-Dev mailing list