[Python-3000] [Python-Dev] Filename as byte string in python 2.6 or 3.0?
James Y Knight
foom at fuhm.net
Thu Oct 2 00:14:50 CEST 2008
On Oct 1, 2008, at 3:03 PM, Glenn Linderman wrote:
> On approximately 10/1/2008 11:30 AM, came the following characters
> from the keyboard of James Y Knight:
>> BTW, Windows will cheerfully let you create and access files with
>> "garbage surrogates" in it.
>> Try it yourself:
>>
>> open(u"\ud8fd", 'w').close()
>> os.listdir(u'.')
>
> But Windows doesn't have the problem of non-Unicode sequences
> needing to be translated to something else in the first place. So
> this is mostly irrelevant to the problem at hand.
Well...either you consider lone surrogates as valid Unicode sequences,
or else Windows *does* have the problem of non-Unicode sequences
needing to be translated to something else.
Currently, the answer is that lone surrogates are treated as valid
Unicode, and allowed into Python via the windows file APIs. Thus,
filename strings in Python are going to have lone surrogates, anyways,
on Windows.
Therefore, any external library which freaks out upon seeing a lone
surrogate is already going to be broken for some filenames on Windows.
So, it seems to me, converting invalid UTF-8 sequences into lone
surrogates for Unix doesn't actually add any new form of brokenness.
So why not just do that?
>> So, I'm back to favoring the lone surrogate plan over the U+0000
>> plan. But either one seems better than the alternatives.
>
> The original byte string must be preserved for use in actually
> opening files.
Or reversibly transformed.
> How it is displayed is another question. Doing something that works
> for both Unicode display and access to the file is basically
> impossible in all cases. Providing an encapsulation of the byte
> string that has display methods, together with new methods to
> transform the file path, and use parts of it to create other file
> paths, is the solution I described earlier.
This sounds like a fine solution. And it would work just as well with
a UTF-8b base API as with a dual string/byte string base API. The only
difference is what the default behavior for people who don't use your
new fancy API is. In the UTF-8b case, most things would work, even
with invalidly-encoded filenames.
James
More information about the Python-3000
mailing list