Re: [Python-Dev] [Python-3000] Proposed Python 3.0 schedule

Oct. 7, 2008

      On Tue, Oct 7, 2008 at 9:51 AM, James Y Knight <foom@fuhm.net> wrote:
...
On Oct 7, 2008, at 3:47 AM, Martin v. Löwis wrote:
...
...
- Having os.getcwdb isn't much use when you can't even run python in
the first place when the current directory has "bad" bytes in it.
That's not true: it *is* of much use. Python will live in /usr/bin,
which has a nicely-decodable path.
...
Currently Python outputs:
Could not find platform independent libraries <prefix>
Could not find platform dependent libraries <exec_prefix>
Consider setting $PYTHONHOME to <prefix>[:<exec_prefix>]
Fatal Python error: Py_Initialize: can't initialize sys standard streams
ImportError: No module named encodings.utf_8
Aborted
I can't reproduce that. This happens (for me) when Python lives in
a directory that has an undecodable path - not when the current
directory is undecodable.
Sorry about that: this test was indeed in error: I ran "../python" from an
undecodeable current directory, rather than "/full/path/to/python", or
putting python on the PATH and running it as "python". The first does not
work, but the other more common ways to start it do.
...
...
I'm sure there's even more APIs dealing with pathnames, command line
arguments, or environment variables that ought to be able to handle both
bytes and strings, that currently don't.
Please, no.
I completely and totally agree with your distate, it's rather gross to allow
bytes-or-str for every API that touches anything like
filenames/argv/environ. That's why I was pushing for the reversible
conversion to str...But if bytes-or-str is the solution that's been chosen
for this issue, it ought to either be fully committed to and implemented, or
at least fully recognized and documented as a half-baked solution.
Of course, if an reversible encoding into string solution is used instead,
none of these things would need special treatment: they would all work
already.
FWIW: Qt works fine with undecodeable filenames, and it too uses unicode
strings everywhere in its API. I looked into what it does, and found that it
uses your (Martin)'s original idea for solving this: it stores undecodeable
bytes as characters from 0x10fe00 to 0x10feff (which is valid private-use
codespace).  While that might not be ideally correct, since you lose those
256 PUA characters, even that is IMO better than pushing out bytes to every
API, or worse, giving up and just having python unable to access files, as
it is now.
See lines 3074: QString::toUtf8() and 3408: QString::fromUtf8()) of
http://www.google.com/codesearch?q=+show:o7fNK6SzOYs:NO-Bv-AR2rI:toIOngLf1V8&cs_p=http://ie.archive.ubuntu.com/trolltech/pub/qt/snapshots/qt-x11-opensource-src-4.4.0-snapshot-20070402.tar.bz2&cs_f=qt-x11-opensource-src-4.4.0-snapshot-20070402/src/corelib/tools/qstring.cpp
So what does Qt do when given a file name already using those PUA?
Looks like they get passed through untouched when decoded, but will
get translated into invalid names upon encoding.  So you still have
file names you can't open, and you're incompatible with what other
libraries do.

The only thing going for Qt is that they seem specifically interested
in latin-1, rather than arbitrary bad names.  The latin-1 strings that
would correspond to the UTF-8 PUA used would include at least one
control character, as well as other unusual bits, so it's pretty
unlikely to encounter a real latin-1 file name like that.

-- 
Adam Olsen, aka Rhamphoryncus