[Python-Dev] Python-3.0, unicode, and os.environ
Ulrich Eckhardt
eckhardt at satorlaser.com
Thu Dec 11 10:19:16 CET 2008
On Wednesday 10 December 2008, Adam Olsen wrote:
> On Wed, Dec 10, 2008 at 3:39 AM, Ulrich Eckhardt
>
> <eckhardt at satorlaser.com> wrote:
> > On Tuesday 09 December 2008, Adam Olsen wrote:
> >> The only thing separating this from a bikeshed discussion is that a
> >> bikeshed has many equally good solutions, while we have no good
> >> solutions. Instead we're trying to find the least-bad one. The
> >> unicode/bytes separation is pretty close to that. Adding a warning
> >> gets even closer. Adding magic makes it worse.
> >
> > Well, I see two cases:
> > 1. Converting from an uncertain representation to a known one.
> > 2. Converting from a known representation to a known one.
>
> Not quite:
> 1. Using a garbage file name locally (within a single process, not
> talking to any libs)
> 2. Using a unicode filename everywhere (libs, saved to config files,
> displayed to the user, etc.)
I think there is some misunderstanding. I was referring to conversions and
whether it is good to perform them implicitly. For that, I saw the above two
cases.
> On linux the bytes/unicode separation is perfect for this. You decide
> which approach you're using and use it consistently. If you mess up
> (mixing bytes and unicode) you'll consistently get an error.
>
> We currently don't follow this model on windows, so a garbage file
> name gets passed around as if it was unicode, but fails when passed to
> a lib, saved to a config file, is displayed to a user, etc.
I'm not sure I agree with this. Facts I know are:
1. On POSIX systems, there is no reliable encoding for filenames while the
system APIs use char/byte strings.
2. On MS Windows, the encoding for filenames is Unicode/UTF-16.
Returning Unicode strings from readdir() is wrong because it can't handle the
case 1 above. Returning byte strings is wrong because it can't handle case 2
above because it gives you useless roundtrips from UTF-16 to either UTF-8 or,
worst case, to the locale-dependent MBCS. Returning something different
depending on the system us also broken because that would make Python code
that uses this function and assumes a certain type unportable.
Note that this doesn't get much better if you provide a separate readdirb()
API or one that simply returns a byte string or Unicode string depending on
its argument. It just shifts the brokenness from readdir() to the code that
uses it, unless this code makes a distinction between the target systems.
Since way too many programmers are not aware of the problem, they will not
handle these systems differently, so code will become non-portable.
What I'd just like some feedback on is the approach to return a distinct type
(neither a byte string nor a Unicode string) from readdir(). In order to use
this, a programmer will have to convert it explicitly, otherwise e.g.
printing it will just produce <env_string at 0x01234567>. This will
immediately bump each programmer with their heads on the issue of unknown
encodings and they will have to make the application-specific choice whether
an approximation of the filename, an exception or ignoring the file is the
right choice. Also, it presents the options for doing this conversion in a
single class, which I personally find much better than providing overloads
for hundreds of functions.
Sorry for ranting, but I'm a bit confused and desperate, because either I'm
unable to explain what I mean or I'm really not understanding something that
everybody else here seems to agree upon. I just know that using a distinct
path type has helped me in C++ in the past, and I don't see why it shouldn't
in Python.
Uli
--
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932
**************************************************************************************
Visit our website at <http://www.satorlaser.de/>
**************************************************************************************
Diese E-Mail einschließlich sämtlicher Anhänge ist nur für den Adressaten bestimmt und kann vertrauliche Informationen enthalten. Bitte benachrichtigen Sie den Absender umgehend, falls Sie nicht der beabsichtigte Empfänger sein sollten. Die E-Mail ist in diesem Fall zu löschen und darf weder gelesen, weitergeleitet, veröffentlicht oder anderweitig benutzt werden.
E-Mails können durch Dritte gelesen werden und Viren sowie nichtautorisierte Änderungen enthalten. Sator Laser GmbH ist für diese Folgen nicht verantwortlich.
**************************************************************************************
More information about the Python-Dev
mailing list