[Python-Dev] a suggestion ... Re: PEP 383 (again)

"Martin v. Löwis" martin at v.loewis.de
Thu Apr 30 17:35:19 CEST 2009


>     What's an analogous failure? Or, rather, why would a failure analogous
>     to the one I got when using System.IO.DirectoryInfo ever exist in
>     Python?
> 
> 
> Mono.Unix uses an encoder and a decoder that knows about special quoting
> rules.  System.IO uses a different encoder and decoder because it's a
> reimplementation of a Microsoft library and the Mono developers chose
> not to implement Mono.Unix quoting rules in it.  There is nothing
> technical preventing System.IO from using the Mono.Unix codec, it's just
> that the developers didn't want to change the behavior of an ECMA and
> Microsoft library.
> 
> The analogous phenomenon will exist in Python with PEP 383.  Let's say I
> have a C library with wide character interfaces and I pass it a unicode
> string from Python.(*)  That C library now turns that unicode string
> into UTF-8 for writing to disk using its internal UTF-8 converter.

What specific library do you have in mind? Would it always use UTF-8?
If so, it will fail in many other ways, as well - if the locale charset
is different from UTF-8.

I fail to see the analogy. In Python, the standard library works,
and the extension fails; in Mono, it's actually vice versa, and not
at all analogous.

> So, I don't see any reason to prefer your half surrogate quoting to the
> Mono U+0000-based quoting.  Both seem to achieve the same goal with
> respect to round tripping file names, displaying them, etc., but Mono
> quoting actually results in valid unicode strings.  It works because
> null is the one character that's not legal in a UNIX path name.
> 
> So, why do you prefer half surrogate coding to U+0000 quoting?

If I pass a string with an embedded U+0000 to gtk, gtk will truncate
the string, and stop rendering it at this character. This is worse than
what it does for invalid UTF-8 sequences. Chances are fairly high that
other C libraries will fail in the same way, in particular if they
expect char* (which is very common in C).

So I prefer the half surrogate because its failure mode is better th

> (*) There's actually a second, sutble issue.  PEP 383 intends utf-8b
> only to be used for file names.  But that means that I might have to
> bind the first argument to TIFFOpen with utf-8b conversion, while I
> might have to bind other arguments with utf-8 conversion.

I couldn't find a Python wrapper for libtiff. If a wrapper was written,
it would indeed have to use the file system encoding for the file name
parameters. However, it would have to do that even without PEP 383,
since the file name should be encoded in the locale's encoding, not
in UTF-8, anyway.

Regards,
Martin


More information about the Python-Dev mailing list