(sent only to python-dev, as I am not a subscriber of tahoe-dev)
> [Tahoe] currently uses utf-8 for its internal storage (note: nothing to
> do with reading or writing files from external sources -- only for
> storing filenames in the decentralized storage system which is
> accessed by Tahoe clients), and we can't start putting non-utf-8-valid
> sequences in the "filename" slot because other Tahoe clients would
> then get a UnicodeDecodeError exception when trying to read those
So what do you do when someone has an existing file whose name is
supposed to be in utf-8, but whose actual bytes are not valid utf-8?
If you have somehow solved that problem, then you're already done --
the PEP's encoding is a no-op on anything that isn't already invalid
If you have not solved that problem, then those clients will already
be getting a UnicodeDecodeError; all the PEP does is make it at least
possible for them to recover.
> Requirement 1 (unicode): Each filename that you see needs to be valid
> unicode (it is stored internally in utf-8).
(repeating) What does Tahoe do if this is violated? Do you throw an
exception right there and not let them copy the file to tahoe? If so,
then that same error correction means that utf8b will never differ
from utf-8, and you have nothing to worry about.
> Requirement 2 (faithful if unicode):
Doesn't the PEP meet this?
> Requirement 3 (no file left behind):
Doesn't the PEP also meet this? I thought the concern was just that
the name used would not be valid unicode, unless the original name was
itself valid unicode.
> Possible Requirement 4 (faithful bytes if not unicode, a.k.a.
Doesn't the PEP also support this? (Only) the invalid bytes get
escaped and therefore must be unescaped, but the escapement is
> 3. (handling collisions) In either case 2.a or 2.b the resulting
> unicode string may already be present in the directory.
This collision is what the use of half-surrogates (as the escape
characters) avoids. Such collisions can't be present unless the data
was invalid unicode, in which case it was the result of an escapement
(unless something other than python is creating new invalid