[Python-Dev] Python-3.0, unicode, and os.environ

Toshio Kuratomi a.badger at gmail.com
Thu Dec 4 23:51:25 CET 2008


Terry Reedy wrote:
> Toshio Kuratomi wrote:
>> I opened up bug http://bugs.python.org/issue4006 a while ago and it was
>> suggested in the report that it's not a bug but a feature and so I
>> should come here to see about getting the feature changed :-)
> 
> It does you no good and (and will irritate others) to conflate 'design
> decision I do not agree with' with 'mistaken documentation or
> implementation of a design decision'.  The former is opinion, the latter
> is usually fact (with occasional border cases).  The latter is what core
> developers mean by 'bug'.
> 
Noted.  However, there's also a difference between "Prevents us from
doing useful things" and "Allows doing a useful thing in a non-trivial
manner".  The latter I would call a difference in design decision and
the former I would call a bug in the design.

>> Currently in python3 there's no way to get at environment variables that
>> are not encoded in the system default encoding.  My understanding is
>> that this isn't a problem on Windows systems but on *nix this is a huge
>> problem.  environment variables on *nix are a sequence of non-null
>> bytes.  These bytes are almost always "characters" but they do not have
>> to be.  Further, there is nothing that requires that the characters be
>> in the same encoding; some of the characters could be in the UTF-8
>> character set while others are in latin-1, shift-jis, or big-5.
> 
> To me, mixing encodings within a string is at least slightly insane.  If
> by design, maybe even a 'design bug' ;-).
> 
As an application level developer I echo your sentiment :-)  I
recognize, though, that *nix filesystem semantics were designed many
years before unicode and the decision to treat filenames, environment
variables, and so much else as bytes follows naturally from the C
definition of a char.  It's up to a higher level than the OS to decide
how to displa6 the bytes.

[shell server and fileserver result in this insane PATH]
>> PATH=/bin:/usr/bin:/usr/local/bin:/mnt/\xe3\x83\x8d\xe3\x83\x83\xe3\x83\x88\xe3\x83\xaf\xe3\x83\xbc\xe3\x82\xaf/\x83v\x83\x8d\x83O\x83\x89\x83\x80
>>
> 
> I would think life would be ultimately easier if either the file server
> or the shell server automatically translated file names from jis and
> utf8 and back, so that the PATH on the *nix shell server is entirely
> utf8.

This is not possible because no part of the computer knows what the
encoding is.  To the computer, it's just a sequence of bytes.  Unlike
xml or the windows filesystem (winfs? ntfs?) where the encoding is
specified as part of the document/filesystem there's nothing to tell
what encoding the filenames are in.

>  How would you ever display a mixture to users?

This is up to the application.  My recomendation would be to keep the
raw bytes (to access the file on the filesystem) and display the results
of str(filename, errors='replace') to the user.

>  What if there
> were an ambiguous component that could be legally decoded more than one
> way?
> 
The ambiguity is the reason that the fileserver and shell server can't
automatically translate the filename (many encodings merely use all of
the 2^8 byte combinations available in a C char type.  This makes the
byte decodable in any one of those encodings).  In the application, only
using the raw bytes to access the file also prevents ambiguity because
the raw bytes only references one file.

>> Now comes the problematic part.  One of the user's on the system wants
>> to write a python3 program that needs to determine if a needed program
>> is in the user's PATH.  He tries to code it like this::
>>
>> #!/usr/bin/python3.0
>>
>> import os
>>
>> for directory in os.environ['PATH']:
>>     programs = os.listdir(directory)
>>
>> That code raises a KeyError because python3 has silently discarded the
>> PATH due to the shift-jis encoded path elements.  Much more importantly,
>> there's no way the programmer can handle the KeyError and actually get
>> the PATH from within python.
> 
> Have you tried os.system or os.popen or the subprocess module to use and
> get a response from a native *nix command?  On Windows
> 
Sure, you can subprocess your way out of a lot of sticky situations
since you're essentially delegating the task to a C routine.  But there
are drawbacks:

* You become dependent on an external program being available.  What
happens if your code is run in a chroot, for instance?
* Do we want anyone writing programs that access the environment on *NIX
to have to discover this pattern themselves and implement it?

As for wrapping this up in os.*, that isn't necessary -- the python3
interpreter already knows about the byte-oriented environment; it just
isn't making it available to people programming in python.

-Toshio

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/python-dev/attachments/20081204/c9faf0e7/attachment.pgp>


More information about the Python-Dev mailing list