On Oct 2, 2008, at 1:53 PM, Guido van Rossum wrote:
we have no choice of coming up with a way of encoding all possible byte sequences into Unicode strings, using a reversible encoding. This has been shown to be hard no matter what encoding you favor -- as soon as those "Unicode" strings are passed on to other libraries or programs nobody is sure how they will be treated.
Indeed; weird encoding heuristics would be unusable in practice, and
don't seem to offer benefits to those building higher-level
portability layers either. I see no future for that approach.
If we switch to the view that all filenames are bytes after all, Windows loses, because because not all filenames are representable that way (unless you deviate from the encoding that Windows has chosen for you, which presents other problems). Also, it would be a *huge* project, since filenames are so ubiquitous.
As much as I'd like to say files and paths are bytes ('cause that's
easy), I agree that it doesn't work that way either. Paths are
platform-specific, and Windows and Unix might disagree just for the
principal of the thing for many years to come.
There are a number of ways out, but I don't think we'll be able to come up with a solution without doing a lot of experimentation. Therefore I believe the best thing to do is to release 3.0 with a low-level solution that makes it possible to carry out those experiments.
Agreed. Having it be possible to use whatever the "right" solution is
on each platform is about as good as it gets in the short term.
Getting good, portable abstractions on top of that will take time.
That doesn't mean it's not scary when thinking about writing portable
code in this environment. That's not entirely new, but the fact that
so much of these details are being addressed so late in the release
cycle *should* give cause for concern, especially to those of use who
are still a long way from stepping up to current versions.
-Fred
-- Fred Drake <fdrake at acm.org>