[Python-Dev] Python-3.0, unicode, and os.environ

Sun Dec 7 02:02:21 CET 2008

Oleg Broytmann wrote:
> My filemanager
> (Midnight Commander, for the matter) shows these files and directories as
> "?????.???", but I can chdir to such directories, and I can open such
> files. It would be a big bad blow for me if filemanagers (or other
> programs) start to filter these filenames.

Summary for those without the time to read the longer version below:
- File managers, backup managers and similar apps should use the binary
APIs worldwide
- Most apps in countries where encoding problems are common will also
need to use the binary APIs to be acceptable to their uses
- Many apps in countries where the 'native' encoding is UTF-8, ASCII or
latin-1 will be able to use the Unicode APIs without any issues whatsoever
- Apps targeting a limited, well-controlled execution environment (e.g.
web services) will also be able to use the Unicode APIs
- I think the binary and Unicode APIs should be available (and fully
functional) on all platforms (including Windows) so that app developers
don't create portability problems for themselves when they make the
decision as to which API to use

-------------

The point about *filesystem* apps (i.e. file managers, backup tools,
indexing engines) needing to deal with the imperfect world of dodgy
filesystem encodings isn't in dispute at all - that's why the binary
alternative APIs were added.

The point is that there is a spectrum from providing a completely clean
solution that addresses only the ideal case of "file paths and other
items such as environment variable names and values retrieved from the
OS are always well-formed text in the appropriate default encoding"
(which will actually work for large chunks of the planet - those where
the locals are native ASCII speakers and those where computers didn't
start to enter widespread use until after Unicode was already available)
to addressing only the most pessimistic case of "you can't trust the
default encoding at all, and need to assume that all strings retrieved
from the OS contain arbitrary binary data" (which is actually true for
some parts of the planet, but thankfully not for all of it).

Hopefully people can at least agree that the first extreme is
unacceptable because that ideal world doesn't exist. I personally think
that the other extreme is *also* unacceptable, because it burdens every
single application developer with dealing with a potential problem that
quite simply may not be a problem for them because they're in a
situation where the naive assumption of a sane operating environment is
actually a valid one for their particular application.

The idea of parallel Unicode and bytes APIs means that for those with an
appropriately limited target environment and/or audience, the Unicode
APIs will "just work", while the developers that aren't so lucky can
rely on the binary APIs instead.

That's actually the one place where I disagree with Guido: I agree with
Adam that the binary APIs *should* be available on Windows.

The difference would be that whereas on *nix type systems, the bytes
APIs are the 'lower level' that more accurately represents the
underlying OS, on Windows it would be the other way around, with the
Unicode APIs as the lower level ones, and the binary APIs as wrappers
around them that automatically decoded the bytes representation to a
Unicode one when writing to the OS, and encoded from Unicode to bytes
when reading from the OS.

If the binary APIs are missing from a major platform (i.e. Windows) then
the choice to use them brings with it a major cross-platform portability
problem that should really be handled by the standard library.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------