Mailman 3 PEP 383 and GUI libraries - Python-Dev

3 May 2009

      (sent only to python-dev, as I am not a subscriber of tahoe-dev)

Zooko wrote:
...
[Tahoe] currently uses utf-8 for its internal storage (note: nothing to
do with reading or writing files from external sources -- only for
storing filenames in the decentralized storage system which is
accessed by Tahoe clients), and we can't start putting non-utf-8-valid
sequences in the "filename" slot because other Tahoe clients would
then get a UnicodeDecodeError exception when trying to read those
directories.
So what do you do when someone has an existing file whose name is
supposed to be in utf-8, but whose actual bytes are not valid utf-8?

If you have somehow solved that problem, then you're already done --
the PEP's encoding is a no-op on anything that isn't already invalid
unicode.

If you have not solved that problem, then those clients will already
be getting a UnicodeDecodeError; all the PEP does is make it at least
possible for them to recover.

...
...
Requirement 1 (unicode):  Each filename that you see needs to be valid
unicode (it is stored internally in utf-8).
(repeating) What does Tahoe do if this is violated?  Do you throw an
exception right there and not let them copy the file to tahoe?  If so,
then that same error correction means that utf8b will never differ
from utf-8, and you have nothing to worry about.
...
Requirement 2 (faithful if unicode):
Doesn't the PEP meet this?
...
Requirement 3 (no file left behind):
Doesn't the PEP also meet this?  I thought the concern was just that
the name used would not be valid unicode, unless the original name was
itself valid unicode.
...
Possible Requirement 4 (faithful bytes if not unicode, a.k.a.
"round-tripping"):
Doesn't the PEP also support this?  (Only) the invalid bytes get
escaped and therefore must be unescaped, but the escapement is
reversible.
...
3. (handling collisions)  In either case 2.a or 2.b the resulting
unicode string may already be present in the directory.
This collision is what the use of half-surrogates (as the escape
characters) avoids.  Such collisions can't be present unless the data
was invalid unicode, in which case it was the result of an escapement
(unless something other than python is creating new invalid
filenames).

-jJ

PEP 383 and GUI libraries

Jim Jewett

tags

participants (1)