[Python-Dev] Python-3.0, unicode, and os.environ

Guido van Rossum guido at python.org
Fri Dec 5 06:14:39 CET 2008


>> On Dec 4, 2008, at 6:39 PM, Martin v. Löwis wrote:
>>> I'm in favour of a different, fifth solution:
>>>
>>> 5) represent all environment variables in Unicode strings,
>>>  including the ones that currently fail to decode.
>>>  (then do the same to file names, then drop the byte-oriented
>>>   file operations again)

> On Thu, Dec 4, 2008 at 6:14 PM, James Y Knight <foom at fuhm.net> wrote:
[...]
>> FWIW, I still agree with Martin that that's the most reasonable solution.

On Thu, Dec 4, 2008 at 6:32 PM, Adam Olsen <rhamph at gmail.com> wrote:
> It died because nobody presented a viable solution, and I maintain no
> solution is possible.  All suggestions involve arbitrary
> transformations that fail to round trip correctly at some point or
> another.  They're simply about shuffling the failure around to
> somewhere the poster happens to like.
>
> Please, if you have a *new* idea that doesn't have a failure mode, by
> all means post it.  But don't resurrect a pointless bikeshed.

I don't like Martin's solution at all. Glyph's message nails the
problem -- the "funny encoding" solution breaks as soon as filenames
get passed to other components, and as that's what Python is often all
about, it's likely to happen all the time.

The simplest example I can think of is a program that prints a
directory listing to stdout -- printing the "funny" encoding to stdout
isn't going to be what users expect. So the program has to be aware of
the possibility of "funny" encoded filenames, and the roundtripping
isn't useful at all.

At the risk of bringing up something that was already rejected, let me
propose something that follows the path taken in 3.0 for filenames,
rather than doubling back:

For os.environ, os.getenv() and os.putenv(), I think a similar
approach as used for os.listdir() and os.getcwd() makes sense: let
os.environ skip variables whose name or value is undecodable, and have
a separate os.environb() which contains bytes; let os.getenv() and
os.putenv() do the right thing when the arguments passed in are bytes.

For sys.argv, because it's positional, you can't skip undecodable
values, so I propose to use error=replace for the decoding; again, we
can add sys.argvb that contains the raw bytes values. The various
os.exec*() and os.spawn*() calls (as well as os.system(), os.popen()
and the subprocess module) should all accept bytes as well as strings.

On Windows, the bytes APIs should probably not exist.

I predict that most developers can get away with not using the bytes
APIs at all. The small minority that needs to be robust if not all
filenames use the system encoding can use the bytes APIs. This would
be developers on various Unix systems except OSX (which uses UTF8 for
its filesystems), and perhaps the occasional developer on OSX whose
app needs to work with files on mounted filesystems that use a
different encoding.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)


More information about the Python-Dev mailing list