[cross-posting to python-dev and tahoe-dev] On Fri, May 1, 2009 at 8:12 PM, James Y Knight <foom@fuhm.net> wrote:
If I were designing a new system such as this, I'd probably just go for utf8b *always*.
Ah, this would be a very tempting possibility -- abandon all unix users who are slow to embrace our utf-8b future! However, it is moot because Tahoe is not a new system. It is currently at v1.4.1, has a strong policy of backwards-compatibility, and already has lots of data, lots of users, and programmers building on top of it. It currently uses utf-8 for its internal storage (note: nothing to do with reading or writing files from external sources -- only for storing filenames in the decentralized storage system which is accessed by Tahoe clients), and we can't start putting non-utf-8-valid sequences in the "filename" slot because other Tahoe clients would then get a UnicodeDecodeError exception when trying to read those directories. We *could* create a new metadata entry to hold things other than utf-8. Current Tahoe clients would never look at that entry (the metadata is a JSON-serialized dictionary, so we can add a new key name into it without disturbing the existing clients), but future Tahoe clients could look for that new key. That is where it is possible that future versions of Tahoe might be able to benefit from utf-8b or PEP 383, although what PEP 383 offers for this use case remains unclear to me.
But if you don't do that, then, I still don't see what purpose your requirements serve. If I have two systems: one with a UTF-8 locale, and one with a Latin-1 locale, why should transmitting filenames from system 1 to system 2 through tahoe preserve the raw bytes, but doing the reverse *not* preserve the raw bytes? (all byte-sequences are valid in latin-1, remember, so they'll all decode into unicode without error, and then be reencoded in utf-8...). This seems rather a useless behavior to me.
I see I'm not explaining the Tahoe requirements clearly. It's probably that I'm not understanding them clearly myself. Hopefully the following will help. There are two different things stored in Tahoe for each directory entry: the filename and the metadata. Suppose you have run "tahoe cp -r myfiles/ tahoe:" on a Linux system and then you inspect the files in the Tahoe filesystem, such as by examining the web interface [1] or by running "tahoe ls", either of which you could do either from the same machine where you ran "tahoe cp" or from a different machine (which could be using any operating system). We have the following requirements about what ends up in your Tahoe directory after that cp -r. Requirement 1 (unicode): Each filename that you see needs to be valid unicode (it is stored internally in utf-8). This eliminates utf-8b and PEP 383 from being directly applicable to the filename part, although perhaps they could be useful for the metadata part (about which more below). Requirement 2 (faithful if unicode): For each filename (byte string) in your myfiles directory, if that bytestring is the valid encoding of some string in your stated locale, then the resulting filename in Tahoe is that (unicode) string. Nobody ever doesn't want this, right? Well, maybe some people don't want this sometimes, because it could be that the locale was wrong for this byte string and the resulting successfully-decoded unicode name is gibberish. This is especially acute if the locale is an 8-bit encoding such as latin-1 or windows-1252. However, what's the alternative? Guessing that their locale shouldn't be set to latin-1 and instead decoding their bytes some other way? It seems like we're not going to do better than requirement 2 (faithful if unicode). Requirement 3 (no file left behind): For each filename (byte string) in your myfiles directory, whether or not that byte string is the valid encoding of anything in your stated locale, then that file will be added into the Tahoe filesystem under *some* name (a good candidate would be mojibake, e.g. decode the bytes with latin-1, but that is not the only possibility). I have heard some developers say that they don't want to support this requirement and would rather tell the users to fix their filenames before they can back up or share those files through Tahoe. On the other hand, users have said that they require this and they are not going to go mucking about with all their filenames just so that they can use my backup and filesharing tool. Now already we can say that these three requirements mean that there can be collisions -- for example a directory could have two entries, one of which is not a valid encoding in the locale, and whatever unicode string we invent to name it with in order to satisfy requirements 3 (no file left behind) and 1 (unicode) might happen to be the same as the (correctly-encoded) name of the other file. Therefore these three requirements imply that we have to detect such collisions and deal with them somehow. (Thanks to Martin v. Löwis for reminding me of this.) Possible Requirement 4 (faithful bytes if not unicode, a.k.a. "round-tripping"): Suppose you have a directory with some files with Japanese names, encoded using shift-jis, and some files with Russian names, encoded using koi8-r. Suppose your locale is set to shift-jis, and then you do "tahoe cp -r myfiles/ tahoe:". Then suppose you or someone else does "tahoe cp -r tahoe: copy_of_myfiles/". The "round-tripping" feature is that the files with Russian names that did not accidentally decode cleanly with shift-jis still have the same bytes in their names as they did in the original myfiles directory. As I write this, I am becoming skeptical of this (faithful bytes if not unicode, a.k.a. "round-tripping"), thanks in part to criticism from James Knight, MvL, Thomas Breuel, and others. One reason to be skeptical is that about a third of the Russian files will happen to decode cleanly as shift-jis anyway, and will therefore come out as something entirely different if the target filesystem's encoding is something other than shift-jis. But an even worse problem -- the show-stopper for me -- is that I don't want what Tahoe shows when you do "tahoe ls" or view it in a web browser to differ from what it writes out when you do "tahoe cp -r tahoe: newfiles/". So I'm ready to reject this one. Now about the "metadata" part which is separate from the filename itself. I have another requirement: Requirement 5 (no loss of information): I don't want Tahoe to destroy information -- every transformation should be (in principle) reversible by some future computer-augmented archaeologist. For example, if a bytestring decodes cleanly with the locale's suggested encoding, and we use the resulting unicode as the filename, then we also store the original byte string in the metadata since we don't know if the locale's suggested encoding was good. This allows the later invention of a tool which shows the user what the filename would have been with other encodings and let the user choose one that makes sense. It is important to note that this does not impose any requirement on the *filename* itself -- all such information can be stored in the metadata. Okay, in light of the above four requirements and the rejection of #4, I hereby propose to change from the previous Tahoe design [2] to the following: To copy an entry from a local filesystem into Tahoe: 1. On Windows or Mac read the filename with the unicode APIs. Normalize the string with filename = unicodedata.normalize('NFC', filename). Leave the "original_bytes" key and the "failed_decode" flag out of the metadata. 2. On Linux or Solaris read the filename with the string APIs, and store the result in the "original_bytes" part of the metadata. Call sys.getfilesystemencoding() to get an alleged_encoding. Then, call bytes.decode(alleged_encoding, 'strict') to try to get a unicode object. 2.a. If this decoding succeeds then normalize the unicode filename with filename = unicodedata.normalize('NFC', filename), store the resulting filename and leave the "failed_decode" flag out of the metadata. 2.b. If this decoding fails, then we decode it again with bytes.decode('latin-1', 'strict'). Do not normalize it. Store the resulting unicode object into the "filename" part, set the "failed_decode" flag to True. This is mojibake! 3. (handling collisions) In either case 2.a or 2.b the resulting unicode string may already be present in the directory. If so, check the failed_decode flags on the current entry and the new entry. If they are both set or both unset then the new entry overwrites the old entry -- they had the same name. If the failed_decode flags differ then this is a case of collision -- the old entry and the new entry had (as far as we are concerned) different names that accidentally generated the same unicode. Alter the new entry's name, for example by appending "~1" and then trying again and incrementing the number until it doesn't match any extant entry. To copy an entry from Tahoe into a local filesystem: Always use the Python unicode API. The original_bytes field and the failed_decode field in the metadata are not consulted. Now a question for python-dev people: could utf-8b or PEP 383 be useful for requirements like the four requirements listed above? If not, what requirements does PEP 383 help with? I'm sure that if can help with the use case of "I'm doing os.listdir() and then I'm going to turn around and use the resulting unicode objects on the same local filesystem in the same Python process". I'm not sure that it can help if you are going to store the results of your os.listdir() persistently or if you are going to transmit them over a network. Indeed, using the results that way could lead to unpleasant surprises. Does that sound right to you? Perhaps this could be documented somehow to help other programmers along the way. Thanks very much for your help, everyone. Regards, Zooko [1] http://testgrid.allmydata.org:3567/uri/URI%3ADIR2%3Adjrdkfawoqihigoett4g6auz... [2] http://allmydata.org/trac/tahoe/ticket/534#comment:47