(sent only to python-dev, as I am not a subscriber of tahoe-dev)
[Tahoe] currently uses utf-8 for its internal storage (note: nothing to do with reading or writing files from external sources -- only for storing filenames in the decentralized storage system which is accessed by Tahoe clients), and we can't start putting non-utf-8-valid sequences in the "filename" slot because other Tahoe clients would then get a UnicodeDecodeError exception when trying to read those directories.
So what do you do when someone has an existing file whose name is supposed to be in utf-8, but whose actual bytes are not valid utf-8?
If you have somehow solved that problem, then you're already done -- the PEP's encoding is a no-op on anything that isn't already invalid unicode.
If you have not solved that problem, then those clients will already be getting a UnicodeDecodeError; all the PEP does is make it at least possible for them to recover.
Requirement 1 (unicode): Each filename that you see needs to be valid unicode (it is stored internally in utf-8).
(repeating) What does Tahoe do if this is violated? Do you throw an exception right there and not let them copy the file to tahoe? If so, then that same error correction means that utf8b will never differ from utf-8, and you have nothing to worry about.
Requirement 2 (faithful if unicode):
Doesn't the PEP meet this?
Requirement 3 (no file left behind):
Doesn't the PEP also meet this? I thought the concern was just that the name used would not be valid unicode, unless the original name was itself valid unicode.
Possible Requirement 4 (faithful bytes if not unicode, a.k.a. "round-tripping"):
Doesn't the PEP also support this? (Only) the invalid bytes get escaped and therefore must be unescaped, but the escapement is reversible.
- (handling collisions) In either case 2.a or 2.b the resulting
unicode string may already be present in the directory.
This collision is what the use of half-surrogates (as the escape characters) avoids. Such collisions can't be present unless the data was invalid unicode, in which case it was the result of an escapement (unless something other than python is creating new invalid filenames).