On approximately 4/25/2009 5:22 AM, came the following characters from the keyboard of Martin v. Löwis:
The problem with this, and other preceding schemes that have been discussed here, is that there is no means of ascertaining whether a particular file name str was obtained from a str API, or was funny- decoded from a bytes API... and thus, there is no means of reliably ascertaining whether a particular filename str should be passed to a str API, or funny-encoded back to bytes.
Why is it necessary that you are able to make this distinction?
It is necessary that programs (not me) can make the distinction, so that it knows whether or not to do the funny-encoding or not. If a name is funny-decoded when the name is accessed by a directory listing, it needs to be funny-encoded in order to open the file.
Picking a character (I don't find U+F01xx in the Unicode standard, so I don't know what it is)
It's a private use area. It will never carry an official character assignment.
I know that U+F0000 - U+FFFFF is a private use area. I don't find a definition of U+F01xx to know what the notation means. Are you picking a particular character within the private use area, or a particular range, or what?
As I realized in the email-sig, in talking about decoding corrupted headers, there is only one way to guarantee this... to encode _all_ character sequences, from _all_ interfaces. Basically it requires reserving an escape character (I'll use ? in these examples -- yes, an ASCII question mark -- happens to be illegal in Windows filenames so all the better on that platform, but the specific character doesn't matter... avoiding / \ and . is probably good, though).
I think you'll have to write an alternative PEP if you want to see something like this implemented throughout Python.
I'm certainly not experienced enough in Python development processes or internals to attempt such, as yet. But somewhere in 25 years of programming, I picked up the knowledge that if you want to have a 1-to-1 reversible mapping, you have to avoid data puns, mappings of two different data values into a single data value. Your PEP, as first written, didn't seem to do that... since there are two interfaces from which to obtain data values, one performing a mapping from bytes to "funny invalid" Unicode, and the other performing no mapping, but accepting any sort of Unicode, possibly including "funny invalid" Unicode, the possibility of data puns seems to exist. I may be misunderstanding something about the use cases that prevent these two sources of "funny invalid" Unicode from ever coexisting, but if so, perhaps you could point it out, or clarify the PEP. I'll try to reread it again... could you post a URL to the most up-to-date version of the PEP, since I haven't seen such appear here, and the version I found via a Google search seems to be the original? -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking