Thank you for sharing your extensive knowledge of these issues, SJT. On Sun, May 3, 2009 at 3:32 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Zooko O'Whielacronx writes:
However, it is moot because Tahoe is not a new system. It is currently at v1.4.1, has a strong policy of backwards- compatibility, and already has lots of data, lots of users, and programmers building on top of it.
Cool!
Thanks! Actually yes it is extremely cool that it really does this encryption, erasure-encoding, capability-based access control, and decentralized topology all in a fully functional, stable system. If you're interested in such stuff then you should definitely check it out!
Question: is there a way to negotiate versions, or better yet, features?
For the peer-to-peer protocol there is, but the persistent storage is an inherently one-way communication. A Tahoe client writes down information, and at a later point a Tahoe client, possibly of a different version, reads it. There is no way for the original writer to ask what versions or features the readers may eventually have. But, the writer can write down optional information which will be invisible to readers that don't know to look for it, but adding it into the "metadata" dictionary. For example: http://testgrid.allmydata.org:3567/uri/URI%3ADIR2%3Adjrdkfawoqihigoett4g6auz... renders the directory contents into json and results in this: "r\u00e9sum\u00e9.html": [ "filenode", { "mutable": false, "verify_uri": "URI:CHK-Verifier:63y4b5bziddi73jc6cmyngyqdq:5p7cxw7ofacblmctmjtgmhi6jq7g5wf77tx6befn2rjsfpedzkia:3:10:8328", "metadata": { "ctime": 1241365319.0695441, "mtime": 1241365319.0695441 }, "ro_uri": "URI:CHK:no2l46woyeri6xmhcrhhomgr5a:5p7cxw7ofacblmctmjtgmhi6jq7g5wf77tx6befn2rjsfpedzkia:3:10:8328", "size": 8328 } ], A new version of Tahoe writing entries like this is constrained to making the primary key (the filename) be a valid unicode string (if it wants older Tahoe clients to be able to read the directory at all). However, it is not constrained about what new keys it may add to the "metadata" dict, which is where we propose to add the "failed_decode" flag and the "original_bytes".
Well, it's a high-dimensional problem. Keeping track of all the variables is hard.
Well put.
That's why something like PEP 383 can be important to you even though it's only a partial solution; it eliminates one variable.
Would that it were so! The possibility that PEP 383 could help me or other like me is why I am trying so hard to explain what kind of help I need. :-)
Suppose you have run "tahoe cp -r myfiles/ tahoe:" on a Linux system and then you inspect the files in the Tahoe filesystem, such as by examining the web interface [1] or by running "tahoe ls", either of which you could do either from the same machine where you ran "tahoe cp" or from a different machine (which could be using any operating system). We have the following requirements about what ends up in your Tahoe directory after that cp -r.
Whoa! Slow down! Where's "my" "Tahoe directory"? Do you mean the directory listing? A copy to whatever system I'm on? The bytes that the Tahoe host has just loaded into a network card buffer to tell me about it? The bytes on disk at the Tahoe host? You'll find it a lot easier to explain things if you adopt a precise, consistent terminology.
Okay here's some more detail. There exists a Tahoe directory, the bytes of which are encrypted, erasure-coded, and spread out over multiple Tahoe servers. (To the servers it is utterly opaque, since it is encrypted with a symmetric encryption key that they don't have.) A Tahoe client has the decryption key and it recovers the cleartext bytes. (Note: the internal storage format is not the json encoding shown above -- it is a custom format -- the json format above is what is produced to be exported through the API, and it serves as a useful example for e-mail discussions.) Then for each bytestring childname in the directory it decodes it with utf-8 to get the unicode childname. Does that all make sense?
Requirement 1 (unicode): Each filename that you see needs to be valid unicode
What does "see" mean? In directory listings?
Yes, either with "tahoe ls", with a FUSE plugin, wht the web UI. Remove the trailing "?t=json" from the URL above to see an example.
Under what circumstances, if any, can what I see be different from what I get?
This a good question! In the previous iteration of the Tahoe design, you could sometimes get something from "tahoe cp" which is different from what you saw with "tahoe ls". In the current design -- http://allmydata.org/trac/tahoe/ticket/534#comment:66 , this is no longer the case, because we abandon the requirement to have "round-trip fidelity of bytes".
Requirement 2 (faithful if unicode): For each filename (byte string) in your myfiles directory,
My local myfiles directory, or my Tahoe myfiles directory?
The local one.
if that bytestring is the valid encoding of some string in your stated locale,
Who stated the locale? How? Are you referring to what getfilesystemencoding returns? This is a "(unicode) string", right?
Yes, and yes.
Requirement 3 (no file left behind): For each filename (byte string) in your myfiles directory, whether or not that byte string is the valid encoding of anything in your stated locale, then that file will be added into the Tahoe filesystem under *some* name (a good candidate would be mojibake, e.g. decode the bytes with latin-1, but that is not the only possibility).
That's not even a possibility, actually. Technically, Latin-1 has a "hole" from U+0080 to U+009F. You need to add the C1 controls to fill in that gap. (I don't think it actually matters in practice, everybody seems to implement ISO-8859/1 as though it contained the control characters ... except when detecting encodings ... but it pays to be precise in these things ....)
Perhaps windows-1252 would be a better codec for this purpose? However it would be clearer for the purposes of this discussion, and also perhaps for actual users of Tahoe, if instead of decoding with windows-1252 in order to get a mojibake name, Tahoe would simply generate a name like "badly_encoded_filename_#1". Let's run with that. For clarity, assume that the arbitrary unicode filename that Tahoe comes up with is "badly_encoded_filename_#1". This doesn't change anything in this story. In particular it doesn't change the fact that there might already be an entry in the directory which is named "badly_encoded_filename_#1" even though it was *not* a badly encoded filename, but a correctly encoded one.
Now already we can say that these three requirements mean that there can be collisions -- for example a directory could have two entries, one of which is not a valid encoding in the locale, and whatever unicode string we invent to name it with in order to satisfy requirements 3 (no file left behind) and 1 (unicode) might happen to be the same as the (correctly-encoded) name of the other file.
This is false with rather high probability, but you need some extra structure to deal with it. First, claim the Unicode private planes for Tahoe. [snip on long and intriguin instructions to perform unicode magic that I don't understand]
Wait, wait. What good would this do? The current plan is that if the filenames collide we increment the number at the end "#$NUMBER", if we are just naming them "badly_encoded_filename_#1", or that we append "~1" if we are naming them by mojibake. And the current plan is that the original bytes are saved in the metadata for future cyborg archaeologists. How would this complex unicode magic that I don't understand improve the current plan? Would it provide filenames that are more meaningful or useful to the users than the "badly_encoded_filename_#1" or the mojibake?
The registry of characters is somewhat unpleasant, but it does allow you to detect filenames that are the same reliably.
There is no server, so to implement such a registry we would probably have to include a copy of the registry inside each (encrypted, erasure-encoded) directory.
Possible Requirement 4 (faithful bytes if not unicode, a.k.a. "round-tripping"):
PEP 383 gives you this, but you must store the encoding used for each such file name.
Well, at this point this has become an anti-requirement because it causes the filename as displayed when examining the directory to be different from the filename that results when cp'ing the directory. Also I don't see why PEP 383's implementation of this would be better than the previous iteration of the design in which this was accomplished by simply storing the original bytes and then writing them back out again on demand, or the design before that in which this was accomplished by mojibake'ing the bytes (by decoding them with windows-1252) and setting a flag indicating that this has been done. I think I understand now that PEP 383 is better for the case that you can't store extra metadata (such as our failed_decode flag or our original_bytes), but you can ensure that the encoding that will be used later matches the one that was used for decoding now. Neither of these two criteria apply to Tahoe, and I suspect that neither of them apply to most uses other than the entirely local and non-persistent "for x in os.listdir(): open(x)".
But an even worse problem -- the show-stopper for me -- is that I don't want what Tahoe shows when you do "tahoe ls" or view it in a web browser to differ from what it writes out when you do "tahoe cp -r tahoe: newfiles/".
But as a requirement, that's incoherent. What you are "seeing" is Unicode, what it will write out is bytes.
In the new plan, we write the unicode filename out using Python's unicode filesystem APIs, so Python will attempt to encode it into the appropriate filesystem encoding (raising UnicodeEncodeError if it won't fit).
That means that if multiple locales are in use on both the backup and restore systems, and the nominal system encodings are different, people whose personal default locales are not the same as the system's will see what they expect on the backup system (using system ls), mojibake on Tahoe (using tahoe ls), and *different* mojibake on the restore system (system ls, again).
Let's see... Tahoe is a user-space program and lets Python determine what the appropriate "sys.getfilesystemencoding()" is based on what the user's locale was at Python startup. So I don't think what you wrote above is correct. I think that in the first transition, from source system to Tahoe, that either the name will be correctly transcoded (i.e., it looks the same to the user as long as the locale they are using to "look" at it, e.g. with "ls" or Nautilus or whatever is the same as the locale that was set when their Python process started up), or else it will be undecodable under their current locale and instead will be replaced with either mojibake or "badly_encoded_filename_#1". Hm, here is a good argument in favor of using mojibake to generate the arbitrary unicode name instead of naming it "badly_encoded_filename_#1": because that's probably what ls and Nautilus will show! Let me try that... Oh, cool, Nautilus and GNU ls both replace invalid chars with U+FFFD (like the 'replace' error handler does in Python's decode()) and append " (invalid encoding)" to the end. That sounds like an even better way to handle it than either mojibake or "badly_encoded_filename_#1", and it also means that it will look the same in Tahoe as it does in GNU ls and Nautilus. Excellent. On the next transition, from Tahoe to system, Tahoe uses the Python unicode API, which will attempt to encode the unicode filename into the local filesystem encoding and raise UnicodeEncodeError if it can't.
Requirement 5 (no loss of information): I don't want Tahoe to destroy information -- every transformation should be (in principle) reversible by some future computer-augmented archaeologist. ... UTF-8b would be just as good for storing the original bytestring, as long as you keep the original encoding. It's actually probably preferable if PEP 383 can be assumed to be implemented in the versions of Python you use.
It isn't -- Tahoe doesn't run on Python 3. Also Tahoe is increasingly interoperating with tools written in completely different languages. It is much easier for to tell all of those programmers (in my documentation) that in the filename slot is the (normal, valid, standard) unicode, and in the metadata slot there are the bytes than to tell them about utf-8b (which is not even implemented in their tools: JavaScript, JSON, C#, C, and Ruby). I imagine that it would be a deal-killer for many or most of them if I said they couldn't use Tahoe reliably without first implementing utf-8b for their toolsets.
1. On Windows or Mac read the filename with the unicode APIs. Normalize the string with filename = unicodedata.normalize('NFC', ... NFD is probably better for fuzzy matching and display on legacy terminals.
I don't know anything about them, other than that Macintosh uses NFD and everything else uses NFC. Should I specify NFD? What are these "legacy terminals" of which you speak? Will NFD make it look better when I cat it to my vt102? (Just kidding -- I don't have one.)
Per the koi8-lucky example, you don't know if it succeeded for the right reason or the wrong reason. You really should store the alleged_encoding used in the metadata, always.
Right -- got it.
2.b. If this decoding fails, then we decode it again with bytes.decode('latin-1', 'strict'). Do not normalize it. Store the resulting unicode object into the "filename" part, set the "failed_decode" flag to True. This is mojibake!
Not necessarily. Most ISO-8859/X names will fail to decode if the alleged_encoding is UTF-8, for example, but many (even for X != 1) will be correctly readable because of the policy of trying to share code points across Latin-X encodings. Certainly ISO-8859/1 (and much ISO-8859/15) will be correct.
Ah. What is the Japanese word for "word with some characters right and other characters mojibake!"? :-)
Now a question for python-dev people: could utf-8b or PEP 383 be useful for requirements like the four requirements listed above? If not, what requirements does PEP 383 help with?
By giving you a standard, invertible way to represent anything that the OS can throw at you, it helps with all of them.
So, it is invertible only if you can assume that the same encoding will be used on the second leg of the trip, right? Which you can do by writing down what encoding was used on this leg of the trip and forcing it to use the same encoding on the other leg. Except that we can't force that to happen on Windows at all as far as I understand, which is a show-stopper right there. But even if we could, this would require us to write down a bit of information and transmit it to the other side and use it to do the encoding. And if we are going to do that, why don't we just transmit the original bytes? Okay, maybe because that would roughly double the amount of data we have to transmit, and maybe we are stingy. But if we are stingy we could instead transmit a single added bit to indicate whether the name is normal or mojibake, and then use windows-1252 to stuff the bytes into the name. One of those options has the advantage of simplicity to the programmer ("There is the unicode, and there are the bytes."), and the other has the advantage of good compression. Both of them have the advantage that nobody involved has to understand and possibly implement a non-standard unicode hack. I'm trying not to be too pushy about this (heaven knows I've been completely wrong about things a dozen times in a row so far in this design process), but as far as I can understand it, PEP 383 can be used only when you can force the same encoding on both sides (the PEP says that encoding "only 'works' if the data get converted back to bytes with the python-escape error handler also"). That happens naturally when both sides are in the same Python process, so PEP 383 naturally looks good in that context. However, if the filenames are going to be stored persistently or transmitted over a network, then it seems simpler, easier, and more portable to use some other method than PEP 383 to handle badly encoded names.
I'm not sure that it can help if you are going to store the results of your os.listdir() persistently or if you are going to transmit them over a network. Indeed, using the results that way could lead to unpleasant surprises.
No more than any other system for giving a canonical Unicode spelling to the results of an OS call.
I think PEP 383 yields more surprises than the alternative of decoding with error handler 'replace' and then including the original bytes along with the unicode. During the course of this process I have also considered using two other mechanisms instead of decoding with error handler 'replace' -- mojibake using windows-1252 or a simple placeholder like "badly_encoded_filename_#1". Any of these three seem to be less surprising and similarly functional to PEP 383. I have to admit that they are not as elegant. Utf-8b is a really neat hack, and MvL's generalization of it to all unicode encodings is, too. I'm still being surprised by it after trying to understand it for many days now. For example, what happens if you decode a filename with PEP 383, store that filename somewhere, and then later try to write a file under that name on Windows? If it only 'works' if the data get converted back to bytes with the python-escape error handler, then can you use the python-escape error handler when trying to, say, create a new file on Windows? Regards, Zooko