[Python-Dev] Python-3.0, unicode, and os.environ

Thu Dec 4 22:51:52 CET 2008

Toshio Kuratomi wrote:
> I opened up bug http://bugs.python.org/issue4006 a while ago and it was
> suggested in the report that it's not a bug but a feature and so I
> should come here to see about getting the feature changed :-)

It does you no good and (and will irritate others) to conflate 'design 
decision I do not agree with' with 'mistaken documentation or 
implementation of a design decision'.  The former is opinion, the latter 
is usually fact (with occasional border cases).  The latter is what core 
developers mean by 'bug'.

> Currently in python3 there's no way to get at environment variables that
> are not encoded in the system default encoding.  My understanding is
> that this isn't a problem on Windows systems but on *nix this is a huge
> problem.  environment variables on *nix are a sequence of non-null
> bytes.  These bytes are almost always "characters" but they do not have
> to be.  Further, there is nothing that requires that the characters be
> in the same encoding; some of the characters could be in the UTF-8
> character set while others are in latin-1, shift-jis, or big-5.

To me, mixing encodings within a string is at least slightly insane.  If 
by design, maybe even a 'design bug' ;-).

> These mixed encodings can occur for a variety of reasons.  Here's an
> example that isn't too contrived :-)
> 
> Swallow is a multi-user shell server hosted at a university in Japan.
> The OS installed is Fedora 10 where the encoding of all filenames
> provided by the OS are UTF-8.  The administrator of the OS has kept this
> convention and, among other things has created a directory to mount and
> NFS directory from another computer.  He calls that "ネットワーク"
> ("network" in Japanese).  Since it's utf-8, that gets put on the
> filesystem as
> '\xe3\x83\x8d\xe3\x83\x83\xe3\x83\x88\xe3\x83\xaf\xe3\x83\xbc\xe3\x82\xaf'
> 
> Now the administrators of the fileserver have been maintaining it since
> before Unicode was invented.  Furthermore, they don't want to suffer
> from the space loss of using utf-8 to encode Japanese so they use
> shift-jis everywhere.  They have a directory on the nfs share for
> programs that are useful for people on the shell server to access.  It's
> called "プログラム" ("programs" in Japanese)  Since they're using
> shift-jis, the bytes on the filesystem are:
> '\x83v\x83\x8d\x83O\x83\x89\x83\x80'
> 
> The system administrator of the shell server adds the directory of
> programs to all his user's default PATH variables so then they have this:
> 
> PATH=/bin:/usr/bin:/usr/local/bin:/mnt/\xe3\x83\x8d\xe3\x83\x83\xe3\x83\x88\xe3\x83\xaf\xe3\x83\xbc\xe3\x82\xaf/\x83v\x83\x8d\x83O\x83\x89\x83\x80

I would think life would be ultimately easier if either the file server 
or the shell server automatically translated file names from jis and 
utf8 and back, so that the PATH on the *nix shell server is entirely 
utf8.  How would you ever display a mixture to users?  What if there 
were an ambiguous component that could be legally decoded more than one way?

> Now comes the problematic part.  One of the user's on the system wants
> to write a python3 program that needs to determine if a needed program
> is in the user's PATH.  He tries to code it like this::
> 
> #!/usr/bin/python3.0
> 
> import os
> 
> for directory in os.environ['PATH']:
>     programs = os.listdir(directory)
> 
> That code raises a KeyError because python3 has silently discarded the
> PATH due to the shift-jis encoded path elements.  Much more importantly,
> there's no way the programmer can handle the KeyError and actually get
> the PATH from within python.

Have you tried os.system or os.popen or the subprocess module to use and 
get a response from a native *nix command?  On Windows

 >>> import subprocess as sp
 >>> s=sp.Popen('path', shell=True, stdout=sp.PIPE)
 >>> s.stdout.read()
b'PATH=C:\\temp\\WatconPermanent\\binnt;C:\\temp\\WatconPermanent\\binw;C:\\WINDOWS\\System32;C:\\WINDOWS\\system32;C:\\WINDOWS;C:\\WINDOWS\\System32\\Wbem;C:\\Program 
Files\\PC-Doctor for Windows\\services;C:\\Program Files\\ATI 
Technologies\\ATI.ACE\\Core-Static;C:\\Program 
Files\\Python25;C:\\Program Files\\QuickTime\\QTSystem\\\r\n'

There are the bytes.  This took me 10 minutes and a few mistakes as a 
first time subprocess user.

Another 10 minutes and I figured out how to get the entire environment 
as bytes *and* convert them to a dict.  This is a bit trickier

s=sp.Popen('set', shell=True, stdout=sp.PIPE) #null set (env) cmd gets
e1= s.stdout.read()
e2=e1.split(b'\r\n')
e2.pop() # get rid of trailing b'' from trailing '\r\n'
e3=[i.split(b'=') for i in e2]
env = dict(e3)

Whether either of these should be wrapped in os, I'll leave for others 
to discuss and decide, but if you can do the same in *nix, you should be 
able to do what you need to for now.

Terry Jan Reedy