[Python-3000] [Python-Dev] Filename as byte string in python 2.6 or 3.0?

Sat Oct 4 01:54:06 CEST 2008

On Fri, Oct 3, 2008 at 5:02 PM, Glenn Linderman <v+python at g.nevcal.com> wrote:
> On approximately 10/3/2008 2:36 PM, came the following characters from the
> keyboard of Adam Olsen:
>>
>> UTF-8b produces an *invalid* unicode sequence, via lone scalars.  Any
>> attempt to encode or decode using a validating UTF-8 (or
>> UTF-16/UTF-32) codec would reject them, which is why they can
>> unambiguously be used.
>>
>> In other words, it's not unicode (despite a resemblence), so it's easy
>> to be 1-to-1.
>
> Sort of.  There is no numerical reason they cannot be represented in a
> UTF-8-like numeric encoding scheme.  It is only rules and regulations that
> prevent it.  So FOOTRUTF8 can exist, just not legally.  If the expectation
> is that an illegal UTF-16 code can be used, to permit the UTF-8b translation
> scheme to work at all, then it seems reasonable to expect than an illegal
> translation of it to UTF-8 might happen also, which means that the
> transformation isn't 1-to-1!

No, UTF-8b can't be translated to UTF-8.  It's illegal.

> I think someone demonstrated the use of unpaired surrogates in the Windows
> filename context the other day.  Whether that is a bug or not, it is the
> current state of affairs, someone might read a name from Windows and want to
> create it on Posix... what happens?  If we implement UTF-8b, I know what
> would happen.  But what would happen if we don't, today, on a Posix Python
> 3?  Would it use FOOTRUTF8 or would it generate an error?  I don't suppose
> it matters a lot, it is stupidity to use such names whether or not the
> prevention of it is enforced.

If python worked properly?  The illegal unicode object would get an
encoding error when you tried to translate to UTF-8 to send it over to
the Posix box.  You'd have alter all the software that touches it to
use your looks-like-but-isn't-quite-unicode, rather than using the
real unicode.

That's why I favour validating the windows API too, and making the raw
API be the raw UTF-16 (rather than letting it get encoded into a
single-byte encoding).  The rawness is what bytes need, not ASCII
similarity.

> But if someone on Posix is creating non-Python software that uses illegal
> lone surrogates, illegally UTF-8 coding them to create the file, and then
> giving them to a Python program to manipulate the content, things could get
> confused, if UTF-8b translations happen under the Python covers... the
> Python program would attempt to open a different file than the non-Python
> software created.

No, they can't illegal use UTF-8.  It's not UTF-8, period.  It's just garbage.

> Seems like attempts to manipulate and transform names are doomed to failure;
> the approach of having a bytes level interface seems to be the correct one,
> glad that seems to be the approach that Victor is implementing and Guido is
> favoring, although it is a pity that it can't be fully encapsulated into an
> object in time for 3.0, leaving us with multiple APIs for file access, and a
> potential future translation to an encapsulated object approach.

the bytes object covers 90% of the raw usage.  The other 10% is a
lossy encoding to unicode.  I much prefer that to be explicit, so an
attribute may do.. say b.decode('UTF-8', 'replace')?  Or do we need a
subtype of bytes, just to reduce that to 5-8 characters?

-- 
Adam Olsen, aka Rhamphoryncus