[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Mon Apr 27 09:07:16 CEST 2009

On approximately 4/25/2009 5:22 AM, came the following characters from 
the keyboard of Martin v. Löwis:
>> The problem with this, and other preceding schemes that have been
>> discussed here, is that there is no means of ascertaining whether a
>> particular file name str was obtained from a str API, or was funny-
>> decoded from a bytes API... and thus, there is no means of reliably
>> ascertaining whether a particular filename str should be passed to a
>> str API, or funny-encoded back to bytes.
> 
> Why is it necessary that you are able to make this distinction?

It is necessary that programs (not me) can make the distinction, so that 
it knows whether or not to do the funny-encoding or not.  If a name is 
funny-decoded when the name is accessed by a directory listing, it needs 
to be funny-encoded in order to open the file.

>> Picking a character (I don't find U+F01xx in the
>> Unicode standard, so I don't know what it is)
> 
> It's a private use area. It will never carry an official character
> assignment.

I know that U+F0000 - U+FFFFF is a private use area.  I don't find a 
definition of U+F01xx to know what the notation means.  Are you picking 
a particular character within the private use area, or a particular 
range, or what?

>> As I realized in the email-sig, in talking about decoding corrupted
>> headers, there is only one way to guarantee this... to encode _all_
>> character sequences, from _all_ interfaces.  Basically it requires
>> reserving an escape character (I'll use ? in these examples -- yes, an
>> ASCII question mark -- happens to be illegal in Windows filenames so
>> all the better on that platform, but the specific character doesn't
>> matter... avoiding / \ and . is probably good, though).
> 
> I think you'll have to write an alternative PEP if you want to see
> something like this implemented throughout Python.

I'm certainly not experienced enough in Python development processes or 
internals to attempt such, as yet.  But somewhere in 25 years of 
programming, I picked up the knowledge that if you want to have a 1-to-1 
reversible mapping, you have to avoid data puns, mappings of two 
different data values into a single data value.  Your PEP, as first 
written, didn't seem to do that... since there are two interfaces from 
which to obtain data values, one performing a mapping from bytes to 
"funny invalid" Unicode, and the other performing no mapping, but 
accepting any sort of Unicode, possibly including "funny invalid" 
Unicode, the possibility of data puns seems to exist.  I may be 
misunderstanding something about the use cases that prevent these two 
sources of "funny invalid" Unicode from ever coexisting, but if so, 
perhaps you could point it out, or clarify the PEP.  I'll try to reread 
it again... could you post a URL to the most up-to-date version of the 
PEP, since I haven't seen such appear here, and the version I found via 
a Google search seems to be the original?

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking