Re: [Python-Dev] PEP 383 and Tahoe [was: GUI libraries]

May 4, 2009

      Thank you for sharing your extensive knowledge of these issues, SJT.

On Sun, May 3, 2009 at 3:32 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
...
Zooko O'Whielacronx writes:
...
However, it is moot because Tahoe is not a new system. It is
currently at v1.4.1, has a strong policy of backwards-
compatibility, and already has lots of data, lots of users, and
programmers building on top of it.
Cool!
Thanks!  Actually yes it is extremely cool that it really does this
encryption, erasure-encoding, capability-based access control, and
decentralized topology all in a fully functional, stable system.  If
you're interested in such stuff then you should definitely check it
out!
...
Question: is there a way to negotiate versions, or better yet,
features?
For the peer-to-peer protocol there is, but the persistent storage is
an inherently one-way communication.  A Tahoe client writes down
information, and at a later point a Tahoe client, possibly of a
different version, reads it.  There is no way for the original writer
to ask what versions or features the readers may eventually have.
But, the writer can write down optional information which will be
invisible to readers that don't know to look for it, but adding it
into the "metadata" dictionary.  For example:
http://testgrid.allmydata.org:3567/uri/URI%3ADIR2%3Adjrdkfawoqihigoett4g6auz...
renders the directory contents into json and results in this:

  "r\u00e9sum\u00e9.html": [
    "filenode",
    {
     "mutable": false,
     "verify_uri":
"URI:CHK-Verifier:63y4b5bziddi73jc6cmyngyqdq:5p7cxw7ofacblmctmjtgmhi6jq7g5wf77tx6befn2rjsfpedzkia:3:10:8328",
     "metadata": {
      "ctime": 1241365319.0695441,
      "mtime": 1241365319.0695441
     },
     "ro_uri": "URI:CHK:no2l46woyeri6xmhcrhhomgr5a:5p7cxw7ofacblmctmjtgmhi6jq7g5wf77tx6befn2rjsfpedzkia:3:10:8328",
     "size": 8328
    }
   ],

A new version of Tahoe writing entries like this is constrained to
making the primary key (the filename) be a valid unicode string (if it
wants older Tahoe clients to be able to read the directory at all).
However, it is not constrained about what new keys it may add to the
"metadata" dict, which is where we propose to add the "failed_decode"
flag and the "original_bytes".
...
Well, it's a high-dimensional problem.  Keeping track of all the
variables is hard.
Well put.
...
That's why something like PEP 383 can be important
to you even though it's only a partial solution; it eliminates one
variable.
Would that it were so!  The possibility that PEP 383 could help me or
other like me is why I am trying so hard to explain what kind of help
I need.  :-)
...
...
Suppose you have run "tahoe cp -r myfiles/ tahoe:" on a Linux
system and then you inspect the files in the Tahoe filesystem,
such as by examining the web interface [1] or by running
"tahoe ls", either of which you could do either from the same
machine where you ran "tahoe cp" or from a different machine
(which could be using any operating system). We have the
following requirements about what ends up in your Tahoe directory
after that cp -r.
Whoa! Slow down!  Where's "my" "Tahoe directory"?  Do you mean the
directory listing?  A copy to whatever system I'm on?  The bytes that
the Tahoe host has just loaded into a network card buffer to tell me
about it?  The bytes on disk at the Tahoe host?  You'll find it a lot
easier to explain things if you adopt a precise, consistent
terminology.
Okay here's some more detail.

There exists a Tahoe directory, the bytes of which are encrypted,
erasure-coded, and spread out over multiple Tahoe servers.  (To the
servers it is utterly opaque, since it is encrypted with a symmetric
encryption key that they don't have.)  A Tahoe client has the
decryption key and it recovers the cleartext bytes.  (Note: the
internal storage format is not the json encoding shown above -- it is
a custom format -- the json format above is what is produced to be
exported through the API, and it serves as a useful example for e-mail
discussions.)  Then for each bytestring childname in the directory it
decodes it with utf-8 to get the unicode childname.

Does that all make sense?
...
...
Requirement 1 (unicode):  Each filename that you see needs to be valid
unicode
What does "see" mean?  In directory listings?
Yes, either with "tahoe ls", with a FUSE plugin, wht the web UI.
Remove the trailing "?t=json" from the URL above to see an example.
...
Under what
circumstances, if any, can what I see be different from what I get?
This a good question!  In the previous iteration of the Tahoe design,
you could sometimes get something from "tahoe cp" which is different
from what you saw with "tahoe ls".  In the current design --
http://allmydata.org/trac/tahoe/ticket/534#comment:66 , this is no
longer the case, because we abandon the requirement to have
"round-trip fidelity of bytes".
...
...
Requirement 2 (faithful if unicode):  For each filename (byte
string) in your myfiles directory,
My local myfiles directory, or my Tahoe myfiles directory?
The local one.
...
...
if that bytestring is the valid encoding of some string in your
stated locale,
Who stated the locale?  How?  Are you referring to what
getfilesystemencoding returns?  This is a "(unicode) string", right?
Yes, and yes.
...
...
Requirement 3 (no file left behind):  For each filename (byte
string) in your myfiles directory, whether or not that byte
string is the valid encoding of anything in your stated locale,
then that file will be added into the Tahoe filesystem under
*some* name (a good candidate would be mojibake, e.g. decode the
bytes with latin-1, but that is not the only possibility).
That's not even a possibility, actually.  Technically, Latin-1 has a
"hole" from U+0080 to U+009F.  You need to add the C1 controls to fill
in that gap.  (I don't think it actually matters in practice,
everybody seems to implement ISO-8859/1 as though it contained the
control characters ... except when detecting encodings ... but it pays
to be precise in these things ....)
Perhaps windows-1252 would be a better codec for this purpose?
However it would be clearer for the purposes of this discussion, and
also perhaps for actual users of Tahoe, if instead of decoding with
windows-1252 in order to get a mojibake name, Tahoe would simply
generate a name like "badly_encoded_filename_#1".  Let's run with
that.  For clarity, assume that the arbitrary unicode filename that
Tahoe comes up with is "badly_encoded_filename_#1".  This doesn't
change anything in this story.  In particular it doesn't change the
fact that there might already be an entry in the directory which is
named "badly_encoded_filename_#1" even though it was *not* a badly
encoded filename, but a correctly encoded one.
...
...
Now already we can say that these three requirements mean that
there can be collisions -- for example a directory could have two
entries, one of which is not a valid encoding in the locale, and
whatever unicode string we invent to name it with in order to
satisfy requirements 3 (no file left behind) and 1 (unicode)
might happen to be the same as the (correctly-encoded) name of
the other file.
This is false with rather high probability, but you need some extra
structure to deal with it.  First, claim the Unicode private planes
for Tahoe.
[snip on long and intriguin instructions to perform unicode magic that
I don't understand]
Wait, wait.  What good would this do?  The current plan is that if the
filenames collide we increment the number at the end "#$NUMBER", if we
are just naming them "badly_encoded_filename_#1", or that we append
"~1" if we are naming them by mojibake.  And the current plan is that
the original bytes are saved in the metadata for future cyborg
archaeologists.  How would this complex unicode magic that I don't
understand improve the current plan?  Would it provide filenames that
are more meaningful or useful to the users than the
"badly_encoded_filename_#1" or the mojibake?
...
The registry of characters is somewhat unpleasant, but it does allow
you to detect filenames that are the same reliably.
There is no server, so to implement such a registry we would probably
have to include a copy of the registry inside each (encrypted,
erasure-encoded) directory.
...
...
Possible Requirement 4 (faithful bytes if not unicode, a.k.a.
"round-tripping"):
PEP 383 gives you this, but you must store the encoding used for each
such file name.
Well, at this point this has become an anti-requirement because it
causes the filename as displayed when examining the directory to be
different from the filename that results when cp'ing the directory.
Also I don't see why PEP 383's implementation of this would be better
than the previous iteration of the design in which this was
accomplished by simply storing the original bytes and then writing
them back out again on demand, or the design before that in which this
was accomplished by mojibake'ing the bytes (by decoding them with
windows-1252) and setting a flag indicating that this has been done.

I think I understand now that PEP 383 is better for the case that you
can't store extra metadata (such as our failed_decode flag or our
original_bytes), but you can ensure that the encoding that will be
used later matches the one that was used for decoding now.  Neither of
these two criteria apply to Tahoe, and I suspect that neither of them
apply to most uses other than the entirely local and non-persistent
"for x in os.listdir(): open(x)".
...
...
But an even worse problem -- the show-stopper for me -- is that I
don't want what Tahoe shows when you do "tahoe ls" or view it in a
web browser to differ from what it writes out when you do
"tahoe cp -r tahoe: newfiles/".
But as a requirement, that's incoherent.  What you are "seeing" is
Unicode, what it will write out is bytes.
In the new plan, we write the unicode filename out using Python's
unicode filesystem APIs, so Python will attempt to encode it into the
appropriate filesystem encoding (raising UnicodeEncodeError if it
won't fit).
...
That means that if multiple
locales are in use on both the backup and restore systems, and the
nominal system encodings are different, people whose personal default
locales are not the same as the system's will see what they expect on
the backup system (using system ls), mojibake on Tahoe (using tahoe
ls), and *different* mojibake on the restore system (system ls,
again).
Let's see...  Tahoe is a user-space program and lets Python determine
what the appropriate "sys.getfilesystemencoding()" is based on what
the user's locale was at Python startup.  So I don't think what you
wrote above is correct.  I think that in the first transition, from
source system to Tahoe, that either the name will be correctly
transcoded (i.e., it looks the same to the user as long as the locale
they are using to "look" at it, e.g. with "ls" or Nautilus or whatever
is the same as the locale that was set when their Python process
started up), or else it will be undecodable under their current locale
and instead will be replaced with either mojibake or
"badly_encoded_filename_#1".  Hm, here is a good argument in favor of
using mojibake to generate the arbitrary unicode name instead of
naming it "badly_encoded_filename_#1": because that's probably what ls
and Nautilus will show!  Let me try that...  Oh, cool, Nautilus and
GNU ls both replace invalid chars with U+FFFD (like the 'replace'
error handler does in Python's decode()) and append " (invalid
encoding)" to the end.  That sounds like an even better way to handle
it than either mojibake or "badly_encoded_filename_#1", and it also
means that it will look the same in Tahoe as it does in GNU ls and
Nautilus.  Excellent.

On the next transition, from Tahoe to system, Tahoe uses the Python
unicode API, which will attempt to encode the unicode filename into
the local filesystem encoding and raise UnicodeEncodeError if it
can't.
...
...
Requirement 5 (no loss of information):  I don't want Tahoe to
destroy information -- every transformation should be (in
principle) reversible by some future computer-augmented
archaeologist.
...
UTF-8b would be just as good for storing the original bytestring, as
long as you keep the original encoding.  It's actually probably
preferable if PEP 383 can be assumed to be implemented in the
versions of Python you use.
It isn't -- Tahoe doesn't run on Python 3.  Also Tahoe is increasingly
interoperating with tools written in completely different languages.
It is much easier for to tell all of those programmers (in my
documentation) that in the filename slot is the (normal, valid,
standard) unicode, and in the metadata slot there are the bytes than
to tell them about utf-8b (which is not even implemented in their
tools: JavaScript, JSON, C#, C, and Ruby).  I imagine that it would be
a deal-killer for many or most of them if I said they couldn't use
Tahoe reliably without first implementing utf-8b for their toolsets.
...
...
1. On Windows or Mac read the filename with the unicode APIs.
Normalize the string with filename = unicodedata.normalize('NFC',
...
NFD is probably better for fuzzy matching and display on legacy
terminals.
I don't know anything about them, other than that Macintosh uses NFD
and everything else uses NFC.  Should I specify NFD?  What are these
"legacy terminals" of which you speak?  Will NFD make it look better
when I cat it to my vt102?  (Just kidding -- I don't have one.)
...
Per the koi8-lucky example, you don't know if it succeeded for the
right reason or the wrong reason.  You really should store the
alleged_encoding used in the metadata, always.
Right -- got it.
...
...
2.b. If this decoding fails, then we decode it again with
bytes.decode('latin-1', 'strict'). Do not normalize it. Store the
resulting unicode object into the "filename" part, set the
"failed_decode" flag to True. This is mojibake!
Not necessarily.  Most ISO-8859/X names will fail to decode if the
alleged_encoding is UTF-8, for example, but many (even for X != 1)
will be correctly readable because of the policy of trying to share
code points across Latin-X encodings.  Certainly ISO-8859/1 (and
much ISO-8859/15) will be correct.
Ah.  What is the Japanese word for "word with some characters right
and other characters mojibake!"?  :-)
...
...
Now a question for python-dev people: could utf-8b or PEP 383 be
useful for requirements like the four requirements listed above?  If
not, what requirements does PEP 383 help with?
By giving you a standard, invertible way to represent anything that
the OS can throw at you, it helps with all of them.
So, it is invertible only if you can assume that the same encoding
will be used on the second leg of the trip, right?  Which you can do
by writing down what encoding was used on this leg of the trip and
forcing it to use the same encoding on the other leg.  Except that we
can't force that to happen on Windows at all as far as I understand,
which is a show-stopper right there.  But even if we could, this would
require us to write down a bit of information and transmit it to the
other side and use it to do the encoding.  And if we are going to do
that, why don't we just transmit the original bytes?  Okay, maybe
because that would roughly double the amount of data we have to
transmit, and maybe we are stingy.  But if we are stingy we could
instead transmit a single added bit to indicate whether the name is
normal or mojibake, and then use windows-1252 to stuff the bytes into
the name.  One of those options has the advantage of simplicity to the
programmer ("There is the unicode, and there are the bytes."), and the
other has the advantage of good compression.  Both of them have the
advantage that nobody involved has to understand and possibly
implement a non-standard unicode hack.

I'm trying not to be too pushy about this (heaven knows I've been
completely wrong about things a dozen times in a row so far in this
design process), but as far as I can understand it, PEP 383 can be
used only when you can force the same encoding on both sides (the PEP
says that encoding "only 'works' if the data get converted back to
bytes with the python-escape error handler also").  That happens
naturally when both sides are in the same Python process, so PEP 383
naturally looks good in that context.  However, if the filenames are
going to be stored persistently or transmitted over a network, then it
seems simpler, easier, and more portable to use some other method than
PEP 383 to handle badly encoded names.
...
...
I'm not sure that it can help if you are going to store the results
of your os.listdir() persistently or if you are going to transmit
them over a network.  Indeed, using the results that way could lead
to unpleasant surprises.
No more than any other system for giving a canonical Unicode spelling
to the results of an OS call.
I think PEP 383 yields more surprises than the alternative of decoding
with error handler 'replace' and then including the original bytes
along with the unicode.  During the course of this process I have also
considered using two other mechanisms instead of decoding with error
handler 'replace' -- mojibake using windows-1252 or a simple
placeholder like "badly_encoded_filename_#1".  Any of these three seem
to be less surprising and similarly functional to PEP 383.  I have to
admit that they are not as elegant.  Utf-8b is a really neat hack, and
MvL's generalization of it to all unicode encodings is, too.

I'm still being surprised by it after trying to understand it for many
days now.  For example, what happens if you decode a filename with PEP
383, store that filename somewhere, and then later try to write a file
under that name on Windows?  If it only 'works' if the data get
converted back to bytes with the python-escape error handler, then can
you use the python-escape error handler when trying to, say, create a
new file on Windows?

Regards,

Zooko