[Python-Dev] Python-3.0, unicode, and os.environ

Fri Dec 5 18:59:52 CET 2008

On Fri, Dec 5, 2008 at 2:27 AM, Ulrich Eckhardt <eckhardt at satorlaser.com> wrote:
> Seriously, what would you suggest to someone that
> wants to handle paths in a portable way? Using the Unicode variants of
> functions is fubar, because encoding/decoding is not universally possible.
> Using the byte variant is equally fubar, because e.g. on MS Windows it is not
> supported, except through a very lossy roundtrip through the locale's
> codepage, limiting your functionality.

Write a lightweight abstraction layer that uses Unicode when possible
and bytes otherwise. You'd need to write a few functions for the path
handling code you need, with a platform check or two sprinkled in.

Writing such an abstraction for the purpose of one specific
application is usually simple enough. However, writing a similar
abstraction that serves all apps and all use cases is hard. I hope
that eventually someone will come up with one though -- the failure of
earlier path object proposals notwithstanding.

> I actually think it is about time to give up on trying to think about a path
> as a string. Dito for data received from os.environ or sys.argv. There are
> only very few things that are universal to them and a reliable encoding is
> none of them. Then, once you have let that idea go, meditate a bit over the
> Zen.

This sounds too pessimistic to me. I expect that in five years it will
be universally accepted that these variables must be encoded in a
standard encoding. People are never going to give up thinking about
filenames etc. as strings, because that's what they are conceptually.
The problem is purely one of encoding, and that's where Unix/Linux are
behind the curve, since (so far) they haven't taken the plunge and
picked a universal standard encoding, the way Windows and Mac OS X
have done.

> What I propose is that paths must be treated as OS-specific, with the only
> common reliable operations being joining them, concatenating them and
> splitting them into segments divided by the (again, OS-specific) separator.
> Other operations, like e.g. appending a string or converting it to a string
> in order to display it can fail. And if they fail, they should fail noisily.

That's bad though, since filenames are being displayed all the time
(e.g. in error messages).

> In 99% of all cases, using the default encoding will work and do what people
> expect, which is why I would make this conversion automatic. In all other
> cases, it will at least not fail silently (which would lead to garbage and
> data loss) and allow more sophisticated applications to handle it.

I think the "always fail noisily" approach isn't the best approach.
E.g. if I am globbing for *.py, and there's an undecodable .txt file
in a directory, its presence shouldn't cause the glob to fail.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)