[Python-Dev] Python-3.0, unicode, and os.environ

Steve Holden steve at holdenweb.com
Thu Dec 11 13:13:49 CET 2008


Ulrich Eckhardt wrote:
> On Wednesday 10 December 2008, Adam Olsen wrote:
>> On Wed, Dec 10, 2008 at 3:39 AM, Ulrich Eckhardt
>>
>> <eckhardt at satorlaser.com> wrote:
>>> On Tuesday 09 December 2008, Adam Olsen wrote:
>>>> The only thing separating this from a bikeshed discussion is that a
>>>> bikeshed has many equally good solutions, while we have no good
>>>> solutions.  Instead we're trying to find the least-bad one.  The
>>>> unicode/bytes separation is pretty close to that.  Adding a warning
>>>> gets even closer.  Adding magic makes it worse.
>>> Well, I see two cases:
>>> 1. Converting from an uncertain representation to a known one.
>>> 2. Converting from a known representation to a known one.
>> Not quite:
>> 1. Using a garbage file name locally (within a single process, not
>> talking to any libs)
>> 2. Using a unicode filename everywhere (libs, saved to config files,
>> displayed to the user, etc.)
> 
> I think there is some misunderstanding. I was referring to conversions and 
> whether it is good to perform them implicitly. For that, I saw the above two 
> cases.
> 
>> On linux the bytes/unicode separation is perfect for this.  You decide
>> which approach you're using and use it consistently.  If you mess up
>> (mixing bytes and unicode) you'll consistently get an error.
>>
>> We currently don't follow this model on windows, so a garbage file
>> name gets passed around as if it was unicode, but fails when passed to
>> a lib, saved to a config file, is displayed to a user, etc.
> 
> I'm not sure I agree with this. Facts I know are:
> 1. On POSIX systems, there is no reliable encoding for filenames while the 
> system APIs use char/byte strings.
> 2. On MS Windows, the encoding for filenames is Unicode/UTF-16.
> 
> Returning Unicode strings from readdir() is wrong because it can't handle the 
> case 1 above. Returning byte strings is wrong because it can't handle case 2 
> above because it gives you useless roundtrips from UTF-16 to either UTF-8 or, 
> worst case, to the locale-dependent MBCS. Returning something different 
> depending on the system us also broken because that would make Python code 
> that uses this function and assumes a certain type unportable.
> 
> Note that this doesn't get much better if you provide a separate readdirb() 
> API or one that simply returns a byte string or Unicode string depending on 
> its argument. It just shifts the brokenness from readdir() to the code that 
> uses it, unless this code makes a distinction between the target systems. 
> Since way too many programmers are not aware of the problem, they will not 
> handle these systems differently, so code will become non-portable.
> 
> What I'd just like some feedback on is the approach to return a distinct type 
> (neither a byte string nor a Unicode string) from readdir(). In order to use 
> this, a programmer will have to convert it explicitly, otherwise e.g. 
> printing it will just produce <env_string at 0x01234567>. This will 
> immediately bump each programmer with their heads on the issue of unknown 
> encodings and they will have to make the application-specific choice whether 
> an approximation of the filename, an exception or ignoring the file is the 
> right choice. Also, it presents the options for doing this conversion in a 
> single class, which I personally find much better than providing overloads 
> for hundreds of functions.
> 
> 
> Sorry for ranting, but I'm a bit confused and desperate, because either I'm 
> unable to explain what I mean or I'm really not understanding something that 
> everybody else here seems to agree upon. I just know that using a distinct 
> path type has helped me in C++ in the past, and I don't see why it shouldn't 
> in Python.
> 
Seems to me this just threatens to add to the confusion.

If you know what your filesystem produces, you can take the appropriate
action to convert it into a type that makes sense to the user. If you
don't, then at least if you have the string in its bytes form you can
re-present it to the filesystem to manipulate the file. What are we
supposed to do with the "special type"?

regards
 Steve
-- 
Steve Holden        +1 571 484 6266   +1 800 494 3119
Holden Web LLC              http://www.holdenweb.com/



More information about the Python-Dev mailing list