Re: [Python-Dev] PEP 383 and GUI libraries

May 3, 2009

      [cross-posting to python-dev and tahoe-dev]

On Fri, May 1, 2009 at 8:12 PM, James Y Knight <foom@fuhm.net> wrote:
...
If I were designing a new system such as this, I'd probably just go for
utf8b *always*.
Ah, this would be a very tempting possibility -- abandon all unix
users who are slow to embrace our utf-8b future!

However, it is moot because Tahoe is not a new system. It is currently
at v1.4.1, has a strong policy of backwards-compatibility, and already
has lots of data, lots of users, and programmers building on top of
it. It currently uses utf-8 for its internal storage (note: nothing to
do with reading or writing files from external sources -- only for
storing filenames in the decentralized storage system which is
accessed by Tahoe clients), and we can't start putting non-utf-8-valid
sequences in the "filename" slot because other Tahoe clients would
then get a UnicodeDecodeError exception when trying to read those
directories.

We *could* create a new metadata entry to hold things other than
utf-8. Current Tahoe clients would never look at that entry (the
metadata is a JSON-serialized dictionary, so we can add a new key name
into it without disturbing the existing clients), but future Tahoe
clients could look for that new key. That is where it is possible that
future versions of Tahoe might be able to benefit from utf-8b or PEP
383, although what PEP 383 offers for this use case remains unclear to
me.
...
But if you don't do that, then, I still don't see what purpose your
requirements serve. If I have two systems: one with a UTF-8 locale, and one
with a Latin-1 locale, why should transmitting filenames from system 1 to
system 2 through tahoe preserve the raw bytes, but doing the reverse *not*
preserve the raw bytes? (all byte-sequences are valid in latin-1, remember,
so they'll all decode into unicode without error, and then be reencoded in
utf-8...). This seems rather a useless behavior to me.
I see I'm not explaining the Tahoe requirements clearly. It's probably
that I'm not understanding them clearly myself. Hopefully the
following will help.

There are two different things stored in Tahoe for each directory
entry: the filename and the metadata.

Suppose you have run "tahoe cp -r myfiles/ tahoe:" on a Linux system
and then you inspect the files in the Tahoe filesystem, such as by
examining the web interface [1] or by running "tahoe ls", either of
which you could do either from the same machine where you ran "tahoe
cp" or from a different machine (which could be using any operating
system). We have the following requirements about what ends up in your
Tahoe directory after that cp -r.

Requirement 1 (unicode):  Each filename that you see needs to be valid
unicode (it is stored internally in utf-8). This eliminates utf-8b and
PEP 383 from being directly applicable to the filename part, although
perhaps they could be useful for the metadata part (about which more
below).

Requirement 2 (faithful if unicode):  For each filename (byte string)
in your myfiles directory, if that bytestring is the valid encoding of
some string in your stated locale, then the resulting filename in
Tahoe is that (unicode) string. Nobody ever doesn't want this, right?
Well, maybe some people don't want this sometimes, because it could be
that the locale was wrong for this byte string and the resulting
successfully-decoded unicode name is gibberish. This is especially
acute if the locale is an 8-bit encoding such as latin-1 or
windows-1252. However, what's the alternative?  Guessing that their
locale shouldn't be set to latin-1 and instead decoding their bytes
some other way?  It seems like we're not going to do better than
requirement 2 (faithful if unicode).

Requirement 3 (no file left behind):  For each filename (byte string)
in your myfiles directory, whether or not that byte string is the
valid encoding of anything in your stated locale, then that file will
be added into the Tahoe filesystem under *some* name (a good candidate
would be mojibake, e.g. decode the bytes with latin-1, but that is not
the only possibility). I have heard some developers say that they
don't want to support this requirement and would rather tell the users
to fix their filenames before they can back up or share those files
through Tahoe. On the other hand, users have said that they require
this and they are not going to go mucking about with all their
filenames just so that they can use my backup and filesharing tool.

Now already we can say that these three requirements mean that there
can be collisions -- for example a directory could have two entries,
one of which is not a valid encoding in the locale, and whatever
unicode string we invent to name it with in order to satisfy
requirements 3 (no file left behind) and 1 (unicode) might happen to
be the same as the (correctly-encoded) name of the other file.
Therefore these three requirements imply that we have to detect such
collisions and deal with them somehow. (Thanks to Martin v. Löwis for
reminding me of this.)

Possible Requirement 4 (faithful bytes if not unicode, a.k.a.
"round-tripping"): Suppose you have a directory with some files with
Japanese names, encoded using shift-jis, and some files with Russian
names, encoded using koi8-r. Suppose your locale is set to shift-jis,
and then you do "tahoe cp -r myfiles/ tahoe:". Then suppose you or
someone else does "tahoe cp -r tahoe: copy_of_myfiles/". The
"round-tripping" feature is that the files with Russian names that did
not accidentally decode cleanly with shift-jis still have the same
bytes in their names as they did in the original myfiles directory.

As I write this, I am becoming skeptical of this (faithful bytes if
not unicode, a.k.a. "round-tripping"), thanks in part to criticism
from James Knight, MvL, Thomas Breuel, and others. One reason to be
skeptical is that about a third of the Russian files will happen to
decode cleanly as shift-jis anyway, and will therefore come out as
something entirely different if the target filesystem's encoding is
something other than shift-jis. But an even worse problem -- the
show-stopper for me -- is that I don't want what Tahoe shows when you
do "tahoe ls" or view it in a web browser to differ from what it
writes out when you do "tahoe cp -r tahoe: newfiles/". So I'm ready to
reject this one.

Now about the "metadata" part which is separate from the filename
itself. I have another requirement:

Requirement 5 (no loss of information):  I don't want Tahoe to destroy
information -- every transformation should be (in principle)
reversible by some future computer-augmented archaeologist. For
example, if a bytestring decodes cleanly with the locale's suggested
encoding, and we use the resulting unicode as the filename, then we
also store the original byte string in the metadata since we don't
know if the locale's suggested encoding was good. This allows the
later invention of a tool which shows the user what the filename would
have been with other encodings and let the user choose one that makes
sense. It is important to note that this does not impose any
requirement on the *filename* itself -- all such information can be
stored in the metadata.

Okay, in light of the above four requirements and the rejection of #4,
I hereby propose to change from the previous Tahoe design [2] to the
following:

To copy an entry from a local filesystem into Tahoe:

1. On Windows or Mac read the filename with the unicode APIs.
Normalize the string with filename = unicodedata.normalize('NFC',
filename). Leave the "original_bytes" key and the "failed_decode" flag
out of the metadata.

2. On Linux or Solaris read the filename with the string APIs, and
store the result in the "original_bytes" part of the metadata. Call
sys.getfilesystemencoding() to get an alleged_encoding. Then, call
bytes.decode(alleged_encoding, 'strict') to try to get a unicode
object.

2.a. If this decoding succeeds then normalize the unicode filename
with filename = unicodedata.normalize('NFC', filename), store the
resulting filename and leave the "failed_decode" flag out of the
metadata.

2.b. If this decoding fails, then we decode it again with
bytes.decode('latin-1', 'strict'). Do not normalize it. Store the
resulting unicode object into the "filename" part, set the
"failed_decode" flag to True. This is mojibake!

3. (handling collisions)  In either case 2.a or 2.b the resulting
unicode string may already be present in the directory. If so, check
the failed_decode flags on the current entry and the new entry. If
they are both set or both unset then the new entry overwrites the old
entry -- they had the same name. If the failed_decode flags differ
then this is a case of collision -- the old entry and the new entry
had (as far as we are concerned) different names that accidentally
generated the same unicode. Alter the new entry's name, for example by
appending "~1" and then trying again and incrementing the number until
it doesn't match any extant entry.

To copy an entry from Tahoe into a local filesystem:

Always use the Python unicode API. The original_bytes field and the
failed_decode field in the metadata are not consulted.

Now a question for python-dev people: could utf-8b or PEP 383 be
useful for requirements like the four requirements listed above?  If
not, what requirements does PEP 383 help with?  I'm sure that if can
help with the use case of "I'm doing os.listdir() and then I'm going
to turn around and use the resulting unicode objects on the same local
filesystem in the same Python process". I'm not sure that it can help
if you are going to store the results of your os.listdir()
persistently or if you are going to transmit them over a network.
Indeed, using the results that way could lead to unpleasant surprises.
Does that sound right to you?  Perhaps this could be documented
somehow to help other programmers along the way.

Thanks very much for your help, everyone.

Regards,

Zooko

[1] http://testgrid.allmydata.org:3567/uri/URI%3ADIR2%3Adjrdkfawoqihigoett4g6auz...
[2] http://allmydata.org/trac/tahoe/ticket/534#comment:47

Re: [Python-Dev] PEP 383 and GUI libraries

Zooko O'Whielacronx