I checked how GUI libraries deal with half surrogates. In pygtk, a warning gets issued to the console /tmp/helloworld.py:71: PangoWarning: Invalid UTF-8 string passed to pango_layout_set_text() self.window.show() and then the widget contains three crossed boxes. wxpython (in its wxgtk version) behaves the same way. PyQt displays a single square box. Regards, Martin
On approximately 4/30/2009 1:48 AM, came the following characters from the keyboard of Martin v. Löwis:
I checked how GUI libraries deal with half surrogates. In pygtk, a warning gets issued to the console
/tmp/helloworld.py:71: PangoWarning: Invalid UTF-8 string passed to pango_layout_set_text() self.window.show()
and then the widget contains three crossed boxes.
wxpython (in its wxgtk version) behaves the same way.
PyQt displays a single square box.
Interesting. Did you use a name with other characters? Were they displayed? Both before and after the surrogates? Did you use one or three half surrogates, to produce the three crossed boxes? Did you use one or three half surrogates, to produce the single square box? -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
Did you use a name with other characters? Were they displayed? Both before and after the surrogates?
Yes, yes, and yes (IOW, I put the surrogate in the middle).
Did you use one or three half surrogates, to produce the three crossed boxes?
Only one, and it produced three boxes - probably one for each UTF-8 byte that pango considered invalid.
Did you use one or three half surrogates, to produce the single square box?
Again, only one. Apparently, PyQt passes the Python Unicode string to Qt in a character-by-character representation, rather than going through UTF-8. Regards, Martin
FWIW, I'm in agreement with this PEP (i.e. its status is now Accepted). Martin, you can update the PEP and start the implementation. On Thu, Apr 30, 2009 at 2:12 AM, "Martin v. Löwis" <martin@v.loewis.de> wrote:
Did you use a name with other characters? Were they displayed? Both before and after the surrogates?
Yes, yes, and yes (IOW, I put the surrogate in the middle).
Did you use one or three half surrogates, to produce the three crossed boxes?
Only one, and it produced three boxes - probably one for each UTF-8 byte that pango considered invalid.
Did you use one or three half surrogates, to produce the single square box?
Again, only one. Apparently, PyQt passes the Python Unicode string to Qt in a character-by-character representation, rather than going through UTF-8.
-- --Guido van Rossum (home page: http://www.python.org/~guido/)
Folks: My use case (Tahoe-LAFS [1]) requires that I am *able* to read arbitrary binary names from the filesystem and store them so that I can regenerate the same byte string later, but it also requires that I *know* whether what I got was a valid string in the expected encoding (which might be utf-8) or whether it was not and I need to fall back to storing the bytes. So far, it looks like PEP 383 doesn't provide both of these requirements, so I am going to have to continue working-around the Python API even after PEP 383. In fact, it might actually increase the amount of working-around that I have to do. If I understand correctly, .decode(encoding, 'strict') will not be changed by PEP 383. A new error handler is added, so .decode('utf-8', 'python-escape') performs the utf-8b decoding. Am I right so far? Therefore if I have a string of bytes, I can attempt to decode it with 'strict', and if that fails I can set the flag showing that it was not a valid byte string in the expected encoding, and then I can invoke .decode('utf-8', 'python-escape') on it. So far, so good. (Note that I never want to do .decode(expected_encoding, 'python-escape') -- if it wasn't a valid bytestring in the expected_encoding, then I want to decode it with utf-8b, regardless of what the expected encoding was.) Anyway, I can use it like this: class FName: def __init__(self, name, failed_decode=False): self.name = name self.failed_decode = failed_decode def fs_to_unicode(bytes): try: return FName(bytes.decode(sys.getfilesystemencoding(), 'strict')) except UnicodeDecodeError: return FName(fn.decode('utf-8', 'python-escape'), failed_decode=True) And what about unicode-oriented APIs such as os.listdir()? Uh-oh, the PEP says that on systems with locale 'utf-8', it will automatically be changed to 'utf-8b'. This means I can't reliably find out whether the entries in the directory *were* named with valid encodings in utf-8? That's not acceptable for my use case. I would have to refrain from using the unicode-oriented os.listdir() on POSIX, and instead do something like this: if platform.system() in ('Windows', 'Darwin'): def listdir(d): return [FName(n) for n in os.listdir(d)] elif platform.system() in ('Linux', 'SunOs'): def listdir(d): bytesd = d.encode(sys.getfilesystemencoding()) return [fs_to_unicode(n) for n in os.listdir(bytesd)] else: raise NotImplementedError("Please classify platform.system() == %s \ as either unicode-safe or unicode-unsafe." % platform.system()) In fact, if 'utf-8' gets automatically converted to 'utf-8b' when *decoding* as well as encoding, then I would have to change my fs_to_unicode() function to check for that and make sure to use strict utf-8 in the first attempt: def fs_to_unicode(bytes): fse = sys.getfilesystemencoding() if fse == 'utf-8b': fse = 'utf-8' try: return FName(bytes.decode(fse, 'strict')) except UnicodeDecodeError: return FName(fn.decode('utf-8', 'python-escape'), failed_decode=True) Would it be possible for Python unicode objects to have a flag indicating whether the 'python-escape' error handler was present? That would serve the same purpose as my "failed_decode" flag above, and would basically allow me to use the Python APIs directory and make all this work-around code disappear. Failing that, I can't see any way to use the os.listdir() in its unicode-oriented mode to satisfy Tahoe's requirements. If you take the above code and then add the fact that you want to use the failed_decode flag when *encoding* the d argument to os.listdir(), then you get this code: [2]. Oh, I just realized that I *could* use the PEP 383 os.listdir(), like this: def listdir(d): fse = sys.getfilesystemencoding() if fse == 'utf-8b': fse = 'utf-8' ns = [] for fn in os.listdir(d): bytes = fn.encode(fse, 'python-escape') try: ns.append(FName(bytes.decode(fse, 'strict'))) except UnicodeDecodeError: ns.append(FName(fn.decode('utf-8', 'python-escape'), failed_decode=True)) return ns (And I guess I could define listdir() like this only on the non-unicode-safe platforms, as above.) However, that strikes me as even more horrible than the previous "listdir()" work-around, in part because it means decoding, re-encoding, and re-decoding every name, so I think I would stick with the previous version. Oh, one more note: for Tahoe's purposes you can, in all of the code above, replace ".decode('utf-8', 'python-replace')" with ".decode('windows-1252')" and it works just as well. While UTF-8b seems like a really cool hack, and it would produce more legible results if utf-8-encoded strings were partially corrupted, I guess I should just use 'windows-1252' which is already implemented in Python 2 (as well as in all other software in the world). I guess this means that PEP 383, which I have approved of and liked so far in this discussion, would actually not help Tahoe at all and would in fact harm Tahoe -- I would have to remember to detect and work-around the automatic 'utf-8b' filesystem encoding when porting Tahoe to Python 3. If anyone else has a concrete, real use case which would be helped by PEP 383, I would like to hear about it. Perhaps Tahoe can learn something from it. Oh, if this PEP could be extended to add a flag to each unicode object indicating whether it was created with the python-escape handler or not, then it would be useful to me. Regards, Zooko [1] http://mail.python.org/pipermail/python-dev/2009-April/089020.html [2] http://allmydata.org/trac/tahoe/attachment/ticket/534/fsencode.3.py
Zooko O'Whielacronx wrote:
[snip...] Would it be possible for Python unicode objects to have a flag indicating whether the 'python-escape' error handler was present? That would serve the same purpose as my "failed_decode" flag above, and would basically allow me to use the Python APIs directory and make all this work-around code disappear.
Failing that, I can't see any way to use the os.listdir() in its unicode-oriented mode to satisfy Tahoe's requirements.
If you take the above code and then add the fact that you want to use the failed_decode flag when *encoding* the d argument to os.listdir(), then you get this code: [2].
Oh, I just realized that I *could* use the PEP 383 os.listdir(), like this:
def listdir(d): fse = sys.getfilesystemencoding() if fse == 'utf-8b': fse = 'utf-8' ns = [] for fn in os.listdir(d): bytes = fn.encode(fse, 'python-escape') try: ns.append(FName(bytes.decode(fse, 'strict'))) except UnicodeDecodeError: ns.append(FName(fn.decode('utf-8', 'python-escape'), failed_decode=True)) return ns
(And I guess I could define listdir() like this only on the non-unicode-safe platforms, as above.)
However, that strikes me as even more horrible than the previous "listdir()" work-around, in part because it means decoding, re-encoding, and re-decoding every name, so I think I would stick with the previous version.
The current unicode mode would skip the filenames you are interested (those that fail to decode correctly) - so you would have been forced to use the bytes mode. If you need access to the original bytes then you should continue to do this. PEP-383 is entirely neutral for your use case as far as I can see. Michael
Oh, one more note: for Tahoe's purposes you can, in all of the code above, replace ".decode('utf-8', 'python-replace')" with ".decode('windows-1252')" and it works just as well. While UTF-8b seems like a really cool hack, and it would produce more legible results if utf-8-encoded strings were partially corrupted, I guess I should just use 'windows-1252' which is already implemented in Python 2 (as well as in all other software in the world).
I guess this means that PEP 383, which I have approved of and liked so far in this discussion, would actually not help Tahoe at all and would in fact harm Tahoe -- I would have to remember to detect and work-around the automatic 'utf-8b' filesystem encoding when porting Tahoe to Python 3.
If anyone else has a concrete, real use case which would be helped by PEP 383, I would like to hear about it. Perhaps Tahoe can learn something from it.
Oh, if this PEP could be extended to add a flag to each unicode object indicating whether it was created with the python-escape handler or not, then it would be useful to me.
Regards,
Zooko
[1] http://mail.python.org/pipermail/python-dev/2009-April/089020.html [2] http://allmydata.org/trac/tahoe/attachment/ticket/534/fsencode.3.py _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.u...
-- http://www.ironpythoninaction.com/ http://www.voidspace.org.uk/blog
On Thu, 30 Apr 2009 at 23:44, Zooko O'Whielacronx wrote:
Would it be possible for Python unicode objects to have a flag indicating whether the 'python-escape' error handler was present? That
Unless I'm misunderstanding something, couldn't you implement what you need by looking in a given string for the half surrogates? If you find one, you have a string python-escape modified, if you don't, it didn't. What does Tahoe do on Windows when it gets a filename that is not valid Unicode? You might not even have to conditionalize the above code on platform (ie: instead you have a generalized is_valid_unicode test function that you always use). --David
Following-up to my own post to correct a major error: On Thu, Apr 30, 2009 at 11:44 PM, Zooko O'Whielacronx <zookog@gmail.com> wrote:
Folks:
My use case (Tahoe-LAFS [1]) requires that I am *able* to read arbitrary binary names from the filesystem and store them so that I can regenerate the same byte string later, but it also requires that I *know* whether what I got was a valid string in the expected encoding (which might be utf-8) or whether it was not and I need to fall back to storing the bytes.
Okay, I am wrong about this. Having a flag to remember whether I had to fall back to the utf-8b trick is one method to implement my requirement, but my actual requirement is this: Requirement: either the unicode string or the bytes are faithfully transmitted from one system to another. That is: if you read a filename from the filesystem, and transmit that filename to another system and use it, then there are two cases: Requirement 1: the byte string was valid in the encoding of source system, in which case the unicode name is faithfully transmitted (i.e. the bytes that finally land on the target system are the result of sourcebytes.decode(source_sys_encoding).encode(target_sys_encoding). Requirement 2: the byte string was not valid in the encoding of source system, in which case the bytes are faithfully transmitted (i.e. the bytes that finally land on the target system are the same as the bytes that originated in the source system). Now I finally understand how fiendishly clever MvL's PEP 383 generalization of Markus Kuhn's utf-8b trick is! The only thing necessary to achieve both of those requirements above is that the 'python-escape' error handler is used on the target system .encode() as well as on the source system .decode()! Well, I'm going to have to let this sink in and maybe write some code to see if I really understand it. But if this is right, then I can do away with some of the mechanism that I've built up, and instead: Backport PEP 383 to Python 2. And, document the PEP 383 trick in some generic, widely respected format such as an Internet Draft so that I can explain to other users of the Tahoe data (many of whom use other languages than Python) what they have to do if they find invalid utf-8 in the data. Oh good, I just realized that Tahoe emits only utf-8, so all I have to do is point them to the utf-8b documents (such as they are) and explain that to read filenames produced by Tahoe they have to implement utf-8b. That's really good that they don't have to implement MvL's generalization of that trick to other encodings, since utf-8b is already understood by some folks. Okay, I find it surprisingly easy to make subtle errors in this encoding stuff, so please let me know if you spot one. Is it true that srcbytes.encode(srcencoding, 'python-escape').decode('utf-8', 'python-escape') will always produce srcbytes ? That is my Requirement 2. Regards, Zooko
Zooko O'Whielacronx wrote:
Following-up to my own post to correct a major error:
On Thu, Apr 30, 2009 at 11:44 PM, Zooko O'Whielacronx <zookog@gmail.com> wrote:
Folks:
My use case (Tahoe-LAFS [1]) requires that I am *able* to read arbitrary binary names from the filesystem and store them so that I can regenerate the same byte string later, but it also requires that I *know* whether what I got was a valid string in the expected encoding (which might be utf-8) or whether it was not and I need to fall back to storing the bytes.
Okay, I am wrong about this. Having a flag to remember whether I had to fall back to the utf-8b trick is one method to implement my requirement, but my actual requirement is this:
Requirement: either the unicode string or the bytes are faithfully transmitted from one system to another.
That is: if you read a filename from the filesystem, and transmit that filename to another system and use it, then there are two cases:
Requirement 1: the byte string was valid in the encoding of source system, in which case the unicode name is faithfully transmitted (i.e. the bytes that finally land on the target system are the result of sourcebytes.decode(source_sys_encoding).encode(target_sys_encoding).
Requirement 2: the byte string was not valid in the encoding of source system, in which case the bytes are faithfully transmitted (i.e. the bytes that finally land on the target system are the same as the bytes that originated in the source system).
Now I finally understand how fiendishly clever MvL's PEP 383 generalization of Markus Kuhn's utf-8b trick is! The only thing necessary to achieve both of those requirements above is that the 'python-escape' error handler is used on the target system .encode() as well as on the source system .decode()!
Well, I'm going to have to let this sink in and maybe write some code to see if I really understand it.
But if this is right, then I can do away with some of the mechanism that I've built up, and instead:
Backport PEP 383 to Python 2.
And, document the PEP 383 trick in some generic, widely respected format such as an Internet Draft so that I can explain to other users of the Tahoe data (many of whom use other languages than Python) what they have to do if they find invalid utf-8 in the data. Oh good, I just realized that Tahoe emits only utf-8, so all I have to do is point them to the utf-8b documents (such as they are) and explain that to read filenames produced by Tahoe they have to implement utf-8b. That's really good that they don't have to implement MvL's generalization of that trick to other encodings, since utf-8b is already understood by some folks.
Okay, I find it surprisingly easy to make subtle errors in this encoding stuff, so please let me know if you spot one. Is it true that srcbytes.encode(srcencoding, 'python-escape').decode('utf-8', 'python-escape') will always produce srcbytes ? That is my Requirement 2.
No, but srcbytes.encode('utf-8', 'python-escape').decode('utf-8', 'python-escape') == srcbytes. The encodings on both ends need to be the same. For example:
b'\x80'.decode('windows-1252') u'\u20ac' u'\u20ac'.encode('utf-8') '\xe2\x82\xac'
Currently:
b'\x80'.decode('utf-8')
Traceback (most recent call last): File "<pyshell#7>", line 1, in <module> b'\x80'.decode('utf-8') File "C:\Python26\lib\encodings\utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: unexpected code byte But under this PEP:
b'x80'.decode('utf-8', 'python-escape') u'\xdc80' u'\xdc80'.encode('utf-8', 'python-escape') '\x80'
Okay, I am wrong about this. Having a flag to remember whether I had to fall back to the utf-8b trick is one method to implement my requirement, but my actual requirement is this:
Requirement: either the unicode string or the bytes are faithfully transmitted from one system to another.
I don't understand this requirement very well, in particular not the "faithfully" part.
That is: if you read a filename from the filesystem, and transmit that filename to another system and use it, then there are two cases:
What do you mean by "use it"? Things like opening files? How does that work? In general, a file name valid on one system is invalid on a different system - or, at least, refers to a different file over there. This is independent of encodings.
Requirement 1: the byte string was valid in the encoding of source system, in which case the unicode name is faithfully transmitted (i.e. the bytes that finally land on the target system are the result of sourcebytes.decode(source_sys_encoding).encode(target_sys_encoding).
In all your descriptions, I'm puzzled as to where exactly you get the source bytes from. If you use the PEP 383 interfaces, you will start with character strings, not byte strings, always.
Okay, I find it surprisingly easy to make subtle errors in this encoding stuff, so please let me know if you spot one. Is it true that srcbytes.encode(srcencoding, 'python-escape').decode('utf-8', 'python-escape') will always produce srcbytes ?
I think you mixed up bytes and unicode here: if srcbytes is indeed a bytes object, then you can't apply .encode to it. Regards, Martin
On 01May2009 18:38, Martin v. L?wis <martin@v.loewis.de> wrote: | > Okay, I am wrong about this. Having a flag to remember whether I had to | > fall back to the utf-8b trick is one method to implement my requirement, | > but my actual requirement is this: | > | > Requirement: either the unicode string or the bytes are faithfully | > transmitted from one system to another. | | I don't understand this requirement very well, in particular not | the "faithfully" part. | | > That is: if you read a filename from the filesystem, and transmit that | > filename to another system and use it, then there are two cases: | | What do you mean by "use it"? Things like opening files? How does | that work? In general, a file name valid on one system is invalid | on a different system - or, at least, refers to a different file | over there. This is independent of encodings. I think he's doing a file transfer of some kind and needs to preserve the names. Or I would guess the two systems are not both UNIX or there is some subtlety not yet mentioned, or he'd just use tar or some other byte-level UNIX tool. | > Requirement 1: the byte string was valid in the encoding of source | > system, in which case the unicode name is faithfully transmitted | > (i.e. the bytes that finally land on the target system are the result of | > sourcebytes.decode(source_sys_encoding).encode(target_sys_encoding). | | In all your descriptions, I'm puzzled as to where exactly you get | the source bytes from. If you use the PEP 383 interfaces, you will | start with character strings, not byte strings, always. But if both system do present POSIX layers, it's bytes underneath and the system tools will natively use bytes. He wants to ensure that he can read using python, using listdir, and elsewhere when he writing using python, preserve the bytes layer. I think. In fact it sounds like he may be translating valid unicode and carefully not altering byte names that don't decode. That in turn implies that the codec may be different on the two systems. | > Okay, I find it surprisingly easy to make subtle errors in this encoding | > stuff, so please let me know if you spot one. Is it true that | > srcbytes.encode(srcencoding, 'python-escape').decode('utf-8', | > 'python-escape') will always produce srcbytes ? | | I think you mixed up bytes and unicode here: if srcbytes is indeed | a bytes object, then you can't apply .encode to it. I think he has encode/decode swapped (I did too back in the uber-thread; if your mapping is one-to-one the distinction is almost arbitrary). However, his assertion/hope is true only if srcencoding == 'utf-8'. The PEP itself says that it works if the decode and encode use the same mapping. -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ "How do you know I'm Mad?" asked Alice. "You must be," said the Cat, "or you wouldn't have come here."
Folks: Being new to the use of gmail, I accidentally sent the following only to MvL and not to the list. He promptly replied with a helpful counterexample showing that my design can suffer collisions. :-) Regards, Zooko On Fri, May 1, 2009 at 10:38 AM, "Martin v. Löwis" <martin@v.loewis.de> wrote:
Requirement: either the unicode string or the bytes are faithfully transmitted from one system to another.
I don't understand this requirement very well, in particular not the "faithfully" part.
That is: if you read a filename from the filesystem, and transmit that filename to another system and use it, then there are two cases:
What do you mean by "use it"? Things like opening files? How does that work? In general, a file name valid on one system is invalid on a different system - or, at least, refers to a different file over there. This is independent of encodings.
Tahoe is a backup and filesharing program, so you might for example, execute "tahoe cp -r Motörhead tahoe:" to copy all the contents of your "Motörhead" directory to your Tahoe filesystem. Later you or a friend, might execute "tahoe cp -r tahoe:Motörhead ." to copy everything from that directory within your Tahoe filesystem to your local filesystem. So in this case the flow of information is local_system_1 -> Tahoe -> local_system_2. The Requirement 1 is that for each filename encountered which is a valid encoding in local_system_1, then the resulting (unicode) name is transmitted through the Tahoe filesystem and then written out into local_system_2 in the expected way (i.e. just by using the Python unicode APIs and passing the unicode object to them). Requirement 2 is that for each filename encountered which is not a valid encoding in local_system_1, then the original bytes are transmitted through the Tahoe filesystem and then, if the target system is a byte-oriented system such as Linux, the original bytes are written into the target filesystem. (If the target is not Linux then mojibake! but we don't have to go into that now.) Does that make sense?
In all your descriptions, I'm puzzled as to where exactly you get the source bytes from. If you use the PEP 383 interfaces, you will start with character strings, not byte strings, always.
On Mac and Windows, we use the Python unicode APIs e.g. os.listdir(u"Motörhead"). On Linux and Solaris, we use the Python bytestring APIs e.g. os.listdir("Motörhead".encode(sys.getfilesystemencoding())).
Okay, I find it surprisingly easy to make subtle errors in this encoding stuff, so please let me know if you spot one. Is it true that srcbytes.encode(srcencoding, 'python-escape').decode('utf-8', 'python-escape') will always produce srcbytes ?
I think you mixed up bytes and unicode here: if srcbytes is indeed a bytes object, then you can't apply .encode to it.
Yep, I reversed the order of encode() and decode(). However, my whole statement was utterly wrong and shows that I still didn't fully get it yet. I have flip-flopped again and currently think that PEP 383 is useless for this use case and that my original plan [1] is still the way to go. Please let me know if you spot a flaw in my plan or a ridiculousity in my requirements, or if you see a way that PEP 383 can help me. Thank you very much. Regards, Zooko [1] http://allmydata.org/trac/tahoe/ticket/534#comment:47
On May 1, 2009, at 9:42 PM, Zooko O'Whielacronx wrote:
Yep, I reversed the order of encode() and decode(). However, my whole statement was utterly wrong and shows that I still didn't fully get it yet. I have flip-flopped again and currently think that PEP 383 is useless for this use case and that my original plan [1] is still the way to go. Please let me know if you spot a flaw in my plan or a ridiculousity in my requirements, or if you see a way that PEP 383 can help me.
If I were designing a new system such as this, I'd probably just go for utf8b *always*. That is, set the filesystem encoding to utf-8b. The end. All files always keep the same bytes transferring between unix systems. Thus, for the 99% of the world that uses either windows or a utf-8 locale, they get useful filenames inside tahoe. The other 1% of the world that uses something like latin-1, EUC_JP, etc. on their local system sees mojibake filenames in tahoe, but will see the same filename that they put in when they take it back out. Gnome already uses only utf-8 for filename displays for a few years now, for example, so this isn't exactly an unheard-of position to take... But if you don't do that, then, I still don't see what purpose your requirements serve. If I have two systems: one with a UTF-8 locale, and one with a Latin-1 locale, why should transmitting filenames from system 1 to system 2 through tahoe preserve the raw bytes, but doing the reverse *not* preserve the raw bytes? (all byte-sequences are valid in latin-1, remember, so they'll all decode into unicode without error, and then be reencoded in utf-8...). This seems rather a useless behavior to me. James
[cross-posting to python-dev and tahoe-dev] On Fri, May 1, 2009 at 8:12 PM, James Y Knight <foom@fuhm.net> wrote:
If I were designing a new system such as this, I'd probably just go for utf8b *always*.
Ah, this would be a very tempting possibility -- abandon all unix users who are slow to embrace our utf-8b future! However, it is moot because Tahoe is not a new system. It is currently at v1.4.1, has a strong policy of backwards-compatibility, and already has lots of data, lots of users, and programmers building on top of it. It currently uses utf-8 for its internal storage (note: nothing to do with reading or writing files from external sources -- only for storing filenames in the decentralized storage system which is accessed by Tahoe clients), and we can't start putting non-utf-8-valid sequences in the "filename" slot because other Tahoe clients would then get a UnicodeDecodeError exception when trying to read those directories. We *could* create a new metadata entry to hold things other than utf-8. Current Tahoe clients would never look at that entry (the metadata is a JSON-serialized dictionary, so we can add a new key name into it without disturbing the existing clients), but future Tahoe clients could look for that new key. That is where it is possible that future versions of Tahoe might be able to benefit from utf-8b or PEP 383, although what PEP 383 offers for this use case remains unclear to me.
But if you don't do that, then, I still don't see what purpose your requirements serve. If I have two systems: one with a UTF-8 locale, and one with a Latin-1 locale, why should transmitting filenames from system 1 to system 2 through tahoe preserve the raw bytes, but doing the reverse *not* preserve the raw bytes? (all byte-sequences are valid in latin-1, remember, so they'll all decode into unicode without error, and then be reencoded in utf-8...). This seems rather a useless behavior to me.
I see I'm not explaining the Tahoe requirements clearly. It's probably that I'm not understanding them clearly myself. Hopefully the following will help. There are two different things stored in Tahoe for each directory entry: the filename and the metadata. Suppose you have run "tahoe cp -r myfiles/ tahoe:" on a Linux system and then you inspect the files in the Tahoe filesystem, such as by examining the web interface [1] or by running "tahoe ls", either of which you could do either from the same machine where you ran "tahoe cp" or from a different machine (which could be using any operating system). We have the following requirements about what ends up in your Tahoe directory after that cp -r. Requirement 1 (unicode): Each filename that you see needs to be valid unicode (it is stored internally in utf-8). This eliminates utf-8b and PEP 383 from being directly applicable to the filename part, although perhaps they could be useful for the metadata part (about which more below). Requirement 2 (faithful if unicode): For each filename (byte string) in your myfiles directory, if that bytestring is the valid encoding of some string in your stated locale, then the resulting filename in Tahoe is that (unicode) string. Nobody ever doesn't want this, right? Well, maybe some people don't want this sometimes, because it could be that the locale was wrong for this byte string and the resulting successfully-decoded unicode name is gibberish. This is especially acute if the locale is an 8-bit encoding such as latin-1 or windows-1252. However, what's the alternative? Guessing that their locale shouldn't be set to latin-1 and instead decoding their bytes some other way? It seems like we're not going to do better than requirement 2 (faithful if unicode). Requirement 3 (no file left behind): For each filename (byte string) in your myfiles directory, whether or not that byte string is the valid encoding of anything in your stated locale, then that file will be added into the Tahoe filesystem under *some* name (a good candidate would be mojibake, e.g. decode the bytes with latin-1, but that is not the only possibility). I have heard some developers say that they don't want to support this requirement and would rather tell the users to fix their filenames before they can back up or share those files through Tahoe. On the other hand, users have said that they require this and they are not going to go mucking about with all their filenames just so that they can use my backup and filesharing tool. Now already we can say that these three requirements mean that there can be collisions -- for example a directory could have two entries, one of which is not a valid encoding in the locale, and whatever unicode string we invent to name it with in order to satisfy requirements 3 (no file left behind) and 1 (unicode) might happen to be the same as the (correctly-encoded) name of the other file. Therefore these three requirements imply that we have to detect such collisions and deal with them somehow. (Thanks to Martin v. Löwis for reminding me of this.) Possible Requirement 4 (faithful bytes if not unicode, a.k.a. "round-tripping"): Suppose you have a directory with some files with Japanese names, encoded using shift-jis, and some files with Russian names, encoded using koi8-r. Suppose your locale is set to shift-jis, and then you do "tahoe cp -r myfiles/ tahoe:". Then suppose you or someone else does "tahoe cp -r tahoe: copy_of_myfiles/". The "round-tripping" feature is that the files with Russian names that did not accidentally decode cleanly with shift-jis still have the same bytes in their names as they did in the original myfiles directory. As I write this, I am becoming skeptical of this (faithful bytes if not unicode, a.k.a. "round-tripping"), thanks in part to criticism from James Knight, MvL, Thomas Breuel, and others. One reason to be skeptical is that about a third of the Russian files will happen to decode cleanly as shift-jis anyway, and will therefore come out as something entirely different if the target filesystem's encoding is something other than shift-jis. But an even worse problem -- the show-stopper for me -- is that I don't want what Tahoe shows when you do "tahoe ls" or view it in a web browser to differ from what it writes out when you do "tahoe cp -r tahoe: newfiles/". So I'm ready to reject this one. Now about the "metadata" part which is separate from the filename itself. I have another requirement: Requirement 5 (no loss of information): I don't want Tahoe to destroy information -- every transformation should be (in principle) reversible by some future computer-augmented archaeologist. For example, if a bytestring decodes cleanly with the locale's suggested encoding, and we use the resulting unicode as the filename, then we also store the original byte string in the metadata since we don't know if the locale's suggested encoding was good. This allows the later invention of a tool which shows the user what the filename would have been with other encodings and let the user choose one that makes sense. It is important to note that this does not impose any requirement on the *filename* itself -- all such information can be stored in the metadata. Okay, in light of the above four requirements and the rejection of #4, I hereby propose to change from the previous Tahoe design [2] to the following: To copy an entry from a local filesystem into Tahoe: 1. On Windows or Mac read the filename with the unicode APIs. Normalize the string with filename = unicodedata.normalize('NFC', filename). Leave the "original_bytes" key and the "failed_decode" flag out of the metadata. 2. On Linux or Solaris read the filename with the string APIs, and store the result in the "original_bytes" part of the metadata. Call sys.getfilesystemencoding() to get an alleged_encoding. Then, call bytes.decode(alleged_encoding, 'strict') to try to get a unicode object. 2.a. If this decoding succeeds then normalize the unicode filename with filename = unicodedata.normalize('NFC', filename), store the resulting filename and leave the "failed_decode" flag out of the metadata. 2.b. If this decoding fails, then we decode it again with bytes.decode('latin-1', 'strict'). Do not normalize it. Store the resulting unicode object into the "filename" part, set the "failed_decode" flag to True. This is mojibake! 3. (handling collisions) In either case 2.a or 2.b the resulting unicode string may already be present in the directory. If so, check the failed_decode flags on the current entry and the new entry. If they are both set or both unset then the new entry overwrites the old entry -- they had the same name. If the failed_decode flags differ then this is a case of collision -- the old entry and the new entry had (as far as we are concerned) different names that accidentally generated the same unicode. Alter the new entry's name, for example by appending "~1" and then trying again and incrementing the number until it doesn't match any extant entry. To copy an entry from Tahoe into a local filesystem: Always use the Python unicode API. The original_bytes field and the failed_decode field in the metadata are not consulted. Now a question for python-dev people: could utf-8b or PEP 383 be useful for requirements like the four requirements listed above? If not, what requirements does PEP 383 help with? I'm sure that if can help with the use case of "I'm doing os.listdir() and then I'm going to turn around and use the resulting unicode objects on the same local filesystem in the same Python process". I'm not sure that it can help if you are going to store the results of your os.listdir() persistently or if you are going to transmit them over a network. Indeed, using the results that way could lead to unpleasant surprises. Does that sound right to you? Perhaps this could be documented somehow to help other programmers along the way. Thanks very much for your help, everyone. Regards, Zooko [1] http://testgrid.allmydata.org:3567/uri/URI%3ADIR2%3Adjrdkfawoqihigoett4g6auz... [2] http://allmydata.org/trac/tahoe/ticket/534#comment:47
Zooko O'Whielacronx writes:
However, it is moot because Tahoe is not a new system. It is currently at v1.4.1, has a strong policy of backwards-compatibility, and already has lots of data, lots of users, and programmers building on top of it.
Cool! Question: is there a way to negotiate versions, or better yet, features?
I see I'm not explaining the Tahoe requirements clearly. It's probably that I'm not understanding them clearly myself.
Well, it's a high-dimensional problem. Keeping track of all the variables is hard. That's why something like PEP 383 can be important to you even though it's only a partial solution; it eliminates one variable.
Suppose you have run "tahoe cp -r myfiles/ tahoe:" on a Linux system and then you inspect the files in the Tahoe filesystem, such as by examining the web interface [1] or by running "tahoe ls", either of which you could do either from the same machine where you ran "tahoe cp" or from a different machine (which could be using any operating system). We have the following requirements about what ends up in your Tahoe directory after that cp -r.
Whoa! Slow down! Where's "my" "Tahoe directory"? Do you mean the directory listing? A copy to whatever system I'm on? The bytes that the Tahoe host has just loaded into a network card buffer to tell me about it? The bytes on disk at the Tahoe host? You'll find it a lot easier to explain things if you adopt a precise, consistent terminology.
Requirement 1 (unicode): Each filename that you see needs to be valid unicode
What does "see" mean? In directory listings? Under what circumstances, if any, can what I see be different from what I get?
Requirement 2 (faithful if unicode): For each filename (byte string) in your myfiles directory,
My local myfiles directory, or my Tahoe myfiles directory?
if that bytestring is the valid encoding of some string in your stated locale,
Who stated the locale? How? Are you referring to what getfilesystemencoding returns? This is a "(unicode) string", right?
then the resulting filename in Tahoe is that (unicode) string. Nobody ever doesn't want this, right? Well, maybe some people don't want this sometimes, [...]. However, what's the alternative? Guessing that their locale shouldn't be set to latin-1 and instead decoding their bytes some other way?
Sure. Emacsen do that, you know. Of course it's hard to guess something else if ISO-8859/1 is the preferred encoding, but it does happen. This probably cannot be done accurately enough for Tahoe, though.
It seems like we're not going to do better than requirement 2 (faithful if unicode).
Requirement 3 (no file left behind): For each filename (byte string) in your myfiles directory, whether or not that byte string is the valid encoding of anything in your stated locale, then that file will be added into the Tahoe filesystem under *some* name (a good candidate would be mojibake, e.g. decode the bytes with latin-1, but that is not the only possibility).
That's not even a possibility, actually. Technically, Latin-1 has a "hole" from U+0080 to U+009F. You need to add the C1 controls to fill in that gap. (I don't think it actually matters in practice, everybody seems to implement ISO-8859/1 as though it contained the control characters ... except when detecting encodings ... but it pays to be precise in these things ....)
Now already we can say that these three requirements mean that there can be collisions -- for example a directory could have two entries, one of which is not a valid encoding in the locale, and whatever unicode string we invent to name it with in order to satisfy requirements 3 (no file left behind) and 1 (unicode) might happen to be the same as the (correctly-encoded) name of the other file.
This is false with rather high probability, but you need some extra structure to deal with it. First, claim the Unicode private planes for Tahoe. Then allocate characters from the private planes on demand as encountered, *including* such characters encountered in external file names to be stored in Tahoe *and* the surrogates used by PEP 383. "Display names" using these private characters would be valid Unicode, but not very useful. However, an algorithmically generated font (like the 4-hex-digit-square used to give a glyph to unknown code points in the BMP) could be used by those who care. Also store mappings from (system encoding, UTF-8b representation) to private char and back. For simplicity, that could be global on your server (IIRC, there are at least two private planes up there, so you'd need to run into almost 128Ki *unique* such characters to run out). I guess you'd be subject to a DOS attack where somebody decided to map all of 80000-odd CNS characters into private space, and then write 80000 files, each with a different 1-character name .... Note that Martin does *not* do this in PEP 383 because PEP 383 only cares about the semantics that a filename read from a directory can be used to access the file associated with it in that directory. For that, a private, non-Unicode encoding is perfectly acceptable. But you want valid Unicode. This scheme gives it to you. The registry of characters is somewhat unpleasant, but it does allow you to detect filenames that are the same reliably.
Possible Requirement 4 (faithful bytes if not unicode, a.k.a. "round-tripping"):
PEP 383 gives you this, but you must store the encoding used for each such file name.
One reason to be skeptical is that about a third of the Russian files will happen to decode cleanly as shift-jis anyway, and will therefore come out as something entirely different if the target filesystem's encoding is something other than shift-jis.
The only way to handle this is to store the encoding used to convert to Unicode as part of *every* file's metadata. This could be also used in Tahoe to warn the user that the current system encoding does not match the alleged_encoding used to make the backup. Some users might prefer to use the alleged_encoding on restore.
But an even worse problem -- the show-stopper for me -- is that I don't want what Tahoe shows when you do "tahoe ls" or view it in a web browser to differ from what it writes out when you do "tahoe cp -r tahoe: newfiles/".
But as a requirement, that's incoherent. What you are "seeing" is Unicode, what it will write out is bytes. That means that if multiple locales are in use on both the backup and restore systems, and the nominal system encodings are different, people whose personal default locales are not the same as the system's will see what they expect on the backup system (using system ls), mojibake on Tahoe (using tahoe ls), and *different* mojibake on the restore system (system ls, again). Note that "use Tahoe, not system, ls" doesn't help at all (unless the weirdo has learned to read mojibake, which actually does happen, but it's not worth betting on). How likely is that? Hate to tell you this: if you need the "unknown bytes scheme at all, this scenerio is *extremely* likely. How do you think that KOI8-R got into a directory on a Shift-JIS system in the first place? Yup, a Russian visiting professor in Tokyo who set his personal locale to ru_RU.KOI8-R wrote it there. And he's very likely to have the same personal locale on a very up-to-date system with a UTF-8 system encoding when he gets back to Moscow. Bingo! it's mojibake all the way to Moscow.
Now about the "metadata" part which is separate from the filename itself. I have another requirement:
Requirement 5 (no loss of information): I don't want Tahoe to destroy information -- every transformation should be (in principle) reversible by some future computer-augmented archaeologist. For example, if a bytestring decodes cleanly with the locale's suggested encoding, and we use the resulting unicode as the filename, then we also store the original byte string in the metadata since we don't know if the locale's suggested encoding was good.
UTF-8b would be just as good for storing the original bytestring, as long as you keep the original encoding. It's actually probably preferable if PEP 383 can be assumed to be implemented in the versions of Python you use.
This allows the later invention of a tool
It will be called "Emacs", by the way.<wink>
which shows the user what the filename would have been with other encodings and let the user choose one that makes sense.
To copy an entry from a local filesystem into Tahoe:
1. On Windows or Mac read the filename with the unicode APIs. Normalize the string with filename = unicodedata.normalize('NFC', filename). Leave the "original_bytes" key and the "failed_decode" flag out of the metadata.
NFD is probably better for fuzzy matching and display on legacy terminals.
2. On Linux or Solaris read the filename with the string APIs, and store the result in the "original_bytes" part of the metadata. Call sys.getfilesystemencoding() to get an alleged_encoding. Then, call bytes.decode(alleged_encoding, 'strict') to try to get a unicode object.
2.a. If this decoding succeeds then normalize the unicode filename with filename = unicodedata.normalize('NFC', filename), store the resulting filename and leave the "failed_decode" flag out of the metadata.
Per the koi8-lucky example, you don't know if it succeeded for the right reason or the wrong reason. You really should store the alleged_encoding used in the metadata, always. Note that you should *also* store the failed_decode flag, because the presence of multiple fail_decodes is a very strong indication that some of the users had default encoding != system encoding. If you use the scheme I propose above, of course you have the same information by scanning the file name for Tahoe-only private use characters, but that would be relatively expensive.
2.b. If this decoding fails, then we decode it again with bytes.decode('latin-1', 'strict'). Do not normalize it. Store the resulting unicode object into the "filename" part, set the "failed_decode" flag to True. This is mojibake!
Not necessarily. Most ISO-8859/X names will fail to decode if the alleged_encoding is UTF-8, for example, but many (even for X != 1) will be correctly readable because of the policy of trying to share code points across Latin-X encodings. Certainly ISO-8859/1 (and much ISO-8859/15) will be correct.
3. (handling collisions) In either case 2.a or 2.b the resulting unicode string may already be present in the directory. If so, check the failed_decode flags on the current entry and the new entry. If they are both set or both unset then the new entry overwrites the old entry -- they had the same name.
If both are set, you're OK, because you are forcing ISO-8859/1. If both are unset, however, you don't know for sure because alleged_encoding is not necessarily a constant.
To copy an entry from Tahoe into a local filesystem:
Always use the Python unicode API. The original_bytes field and the failed_decode field in the metadata are not consulted.
Now a question for python-dev people: could utf-8b or PEP 383 be useful for requirements like the four requirements listed above? If not, what requirements does PEP 383 help with?
By giving you a standard, invertible way to represent anything that the OS can throw at you, it helps with all of them.
I'm not sure that it can help if you are going to store the results of your os.listdir() persistently or if you are going to transmit them over a network. Indeed, using the results that way could lead to unpleasant surprises.
No more than any other system for giving a canonical Unicode spelling to the results of an OS call.
Thank you for sharing your extensive knowledge of these issues, SJT. On Sun, May 3, 2009 at 3:32 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Zooko O'Whielacronx writes:
However, it is moot because Tahoe is not a new system. It is currently at v1.4.1, has a strong policy of backwards- compatibility, and already has lots of data, lots of users, and programmers building on top of it.
Cool!
Thanks! Actually yes it is extremely cool that it really does this encryption, erasure-encoding, capability-based access control, and decentralized topology all in a fully functional, stable system. If you're interested in such stuff then you should definitely check it out!
Question: is there a way to negotiate versions, or better yet, features?
For the peer-to-peer protocol there is, but the persistent storage is an inherently one-way communication. A Tahoe client writes down information, and at a later point a Tahoe client, possibly of a different version, reads it. There is no way for the original writer to ask what versions or features the readers may eventually have. But, the writer can write down optional information which will be invisible to readers that don't know to look for it, but adding it into the "metadata" dictionary. For example: http://testgrid.allmydata.org:3567/uri/URI%3ADIR2%3Adjrdkfawoqihigoett4g6auz... renders the directory contents into json and results in this: "r\u00e9sum\u00e9.html": [ "filenode", { "mutable": false, "verify_uri": "URI:CHK-Verifier:63y4b5bziddi73jc6cmyngyqdq:5p7cxw7ofacblmctmjtgmhi6jq7g5wf77tx6befn2rjsfpedzkia:3:10:8328", "metadata": { "ctime": 1241365319.0695441, "mtime": 1241365319.0695441 }, "ro_uri": "URI:CHK:no2l46woyeri6xmhcrhhomgr5a:5p7cxw7ofacblmctmjtgmhi6jq7g5wf77tx6befn2rjsfpedzkia:3:10:8328", "size": 8328 } ], A new version of Tahoe writing entries like this is constrained to making the primary key (the filename) be a valid unicode string (if it wants older Tahoe clients to be able to read the directory at all). However, it is not constrained about what new keys it may add to the "metadata" dict, which is where we propose to add the "failed_decode" flag and the "original_bytes".
Well, it's a high-dimensional problem. Keeping track of all the variables is hard.
Well put.
That's why something like PEP 383 can be important to you even though it's only a partial solution; it eliminates one variable.
Would that it were so! The possibility that PEP 383 could help me or other like me is why I am trying so hard to explain what kind of help I need. :-)
Suppose you have run "tahoe cp -r myfiles/ tahoe:" on a Linux system and then you inspect the files in the Tahoe filesystem, such as by examining the web interface [1] or by running "tahoe ls", either of which you could do either from the same machine where you ran "tahoe cp" or from a different machine (which could be using any operating system). We have the following requirements about what ends up in your Tahoe directory after that cp -r.
Whoa! Slow down! Where's "my" "Tahoe directory"? Do you mean the directory listing? A copy to whatever system I'm on? The bytes that the Tahoe host has just loaded into a network card buffer to tell me about it? The bytes on disk at the Tahoe host? You'll find it a lot easier to explain things if you adopt a precise, consistent terminology.
Okay here's some more detail. There exists a Tahoe directory, the bytes of which are encrypted, erasure-coded, and spread out over multiple Tahoe servers. (To the servers it is utterly opaque, since it is encrypted with a symmetric encryption key that they don't have.) A Tahoe client has the decryption key and it recovers the cleartext bytes. (Note: the internal storage format is not the json encoding shown above -- it is a custom format -- the json format above is what is produced to be exported through the API, and it serves as a useful example for e-mail discussions.) Then for each bytestring childname in the directory it decodes it with utf-8 to get the unicode childname. Does that all make sense?
Requirement 1 (unicode): Each filename that you see needs to be valid unicode
What does "see" mean? In directory listings?
Yes, either with "tahoe ls", with a FUSE plugin, wht the web UI. Remove the trailing "?t=json" from the URL above to see an example.
Under what circumstances, if any, can what I see be different from what I get?
This a good question! In the previous iteration of the Tahoe design, you could sometimes get something from "tahoe cp" which is different from what you saw with "tahoe ls". In the current design -- http://allmydata.org/trac/tahoe/ticket/534#comment:66 , this is no longer the case, because we abandon the requirement to have "round-trip fidelity of bytes".
Requirement 2 (faithful if unicode): For each filename (byte string) in your myfiles directory,
My local myfiles directory, or my Tahoe myfiles directory?
The local one.
if that bytestring is the valid encoding of some string in your stated locale,
Who stated the locale? How? Are you referring to what getfilesystemencoding returns? This is a "(unicode) string", right?
Yes, and yes.
Requirement 3 (no file left behind): For each filename (byte string) in your myfiles directory, whether or not that byte string is the valid encoding of anything in your stated locale, then that file will be added into the Tahoe filesystem under *some* name (a good candidate would be mojibake, e.g. decode the bytes with latin-1, but that is not the only possibility).
That's not even a possibility, actually. Technically, Latin-1 has a "hole" from U+0080 to U+009F. You need to add the C1 controls to fill in that gap. (I don't think it actually matters in practice, everybody seems to implement ISO-8859/1 as though it contained the control characters ... except when detecting encodings ... but it pays to be precise in these things ....)
Perhaps windows-1252 would be a better codec for this purpose? However it would be clearer for the purposes of this discussion, and also perhaps for actual users of Tahoe, if instead of decoding with windows-1252 in order to get a mojibake name, Tahoe would simply generate a name like "badly_encoded_filename_#1". Let's run with that. For clarity, assume that the arbitrary unicode filename that Tahoe comes up with is "badly_encoded_filename_#1". This doesn't change anything in this story. In particular it doesn't change the fact that there might already be an entry in the directory which is named "badly_encoded_filename_#1" even though it was *not* a badly encoded filename, but a correctly encoded one.
Now already we can say that these three requirements mean that there can be collisions -- for example a directory could have two entries, one of which is not a valid encoding in the locale, and whatever unicode string we invent to name it with in order to satisfy requirements 3 (no file left behind) and 1 (unicode) might happen to be the same as the (correctly-encoded) name of the other file.
This is false with rather high probability, but you need some extra structure to deal with it. First, claim the Unicode private planes for Tahoe. [snip on long and intriguin instructions to perform unicode magic that I don't understand]
Wait, wait. What good would this do? The current plan is that if the filenames collide we increment the number at the end "#$NUMBER", if we are just naming them "badly_encoded_filename_#1", or that we append "~1" if we are naming them by mojibake. And the current plan is that the original bytes are saved in the metadata for future cyborg archaeologists. How would this complex unicode magic that I don't understand improve the current plan? Would it provide filenames that are more meaningful or useful to the users than the "badly_encoded_filename_#1" or the mojibake?
The registry of characters is somewhat unpleasant, but it does allow you to detect filenames that are the same reliably.
There is no server, so to implement such a registry we would probably have to include a copy of the registry inside each (encrypted, erasure-encoded) directory.
Possible Requirement 4 (faithful bytes if not unicode, a.k.a. "round-tripping"):
PEP 383 gives you this, but you must store the encoding used for each such file name.
Well, at this point this has become an anti-requirement because it causes the filename as displayed when examining the directory to be different from the filename that results when cp'ing the directory. Also I don't see why PEP 383's implementation of this would be better than the previous iteration of the design in which this was accomplished by simply storing the original bytes and then writing them back out again on demand, or the design before that in which this was accomplished by mojibake'ing the bytes (by decoding them with windows-1252) and setting a flag indicating that this has been done. I think I understand now that PEP 383 is better for the case that you can't store extra metadata (such as our failed_decode flag or our original_bytes), but you can ensure that the encoding that will be used later matches the one that was used for decoding now. Neither of these two criteria apply to Tahoe, and I suspect that neither of them apply to most uses other than the entirely local and non-persistent "for x in os.listdir(): open(x)".
But an even worse problem -- the show-stopper for me -- is that I don't want what Tahoe shows when you do "tahoe ls" or view it in a web browser to differ from what it writes out when you do "tahoe cp -r tahoe: newfiles/".
But as a requirement, that's incoherent. What you are "seeing" is Unicode, what it will write out is bytes.
In the new plan, we write the unicode filename out using Python's unicode filesystem APIs, so Python will attempt to encode it into the appropriate filesystem encoding (raising UnicodeEncodeError if it won't fit).
That means that if multiple locales are in use on both the backup and restore systems, and the nominal system encodings are different, people whose personal default locales are not the same as the system's will see what they expect on the backup system (using system ls), mojibake on Tahoe (using tahoe ls), and *different* mojibake on the restore system (system ls, again).
Let's see... Tahoe is a user-space program and lets Python determine what the appropriate "sys.getfilesystemencoding()" is based on what the user's locale was at Python startup. So I don't think what you wrote above is correct. I think that in the first transition, from source system to Tahoe, that either the name will be correctly transcoded (i.e., it looks the same to the user as long as the locale they are using to "look" at it, e.g. with "ls" or Nautilus or whatever is the same as the locale that was set when their Python process started up), or else it will be undecodable under their current locale and instead will be replaced with either mojibake or "badly_encoded_filename_#1". Hm, here is a good argument in favor of using mojibake to generate the arbitrary unicode name instead of naming it "badly_encoded_filename_#1": because that's probably what ls and Nautilus will show! Let me try that... Oh, cool, Nautilus and GNU ls both replace invalid chars with U+FFFD (like the 'replace' error handler does in Python's decode()) and append " (invalid encoding)" to the end. That sounds like an even better way to handle it than either mojibake or "badly_encoded_filename_#1", and it also means that it will look the same in Tahoe as it does in GNU ls and Nautilus. Excellent. On the next transition, from Tahoe to system, Tahoe uses the Python unicode API, which will attempt to encode the unicode filename into the local filesystem encoding and raise UnicodeEncodeError if it can't.
Requirement 5 (no loss of information): I don't want Tahoe to destroy information -- every transformation should be (in principle) reversible by some future computer-augmented archaeologist. ... UTF-8b would be just as good for storing the original bytestring, as long as you keep the original encoding. It's actually probably preferable if PEP 383 can be assumed to be implemented in the versions of Python you use.
It isn't -- Tahoe doesn't run on Python 3. Also Tahoe is increasingly interoperating with tools written in completely different languages. It is much easier for to tell all of those programmers (in my documentation) that in the filename slot is the (normal, valid, standard) unicode, and in the metadata slot there are the bytes than to tell them about utf-8b (which is not even implemented in their tools: JavaScript, JSON, C#, C, and Ruby). I imagine that it would be a deal-killer for many or most of them if I said they couldn't use Tahoe reliably without first implementing utf-8b for their toolsets.
1. On Windows or Mac read the filename with the unicode APIs. Normalize the string with filename = unicodedata.normalize('NFC', ... NFD is probably better for fuzzy matching and display on legacy terminals.
I don't know anything about them, other than that Macintosh uses NFD and everything else uses NFC. Should I specify NFD? What are these "legacy terminals" of which you speak? Will NFD make it look better when I cat it to my vt102? (Just kidding -- I don't have one.)
Per the koi8-lucky example, you don't know if it succeeded for the right reason or the wrong reason. You really should store the alleged_encoding used in the metadata, always.
Right -- got it.
2.b. If this decoding fails, then we decode it again with bytes.decode('latin-1', 'strict'). Do not normalize it. Store the resulting unicode object into the "filename" part, set the "failed_decode" flag to True. This is mojibake!
Not necessarily. Most ISO-8859/X names will fail to decode if the alleged_encoding is UTF-8, for example, but many (even for X != 1) will be correctly readable because of the policy of trying to share code points across Latin-X encodings. Certainly ISO-8859/1 (and much ISO-8859/15) will be correct.
Ah. What is the Japanese word for "word with some characters right and other characters mojibake!"? :-)
Now a question for python-dev people: could utf-8b or PEP 383 be useful for requirements like the four requirements listed above? If not, what requirements does PEP 383 help with?
By giving you a standard, invertible way to represent anything that the OS can throw at you, it helps with all of them.
So, it is invertible only if you can assume that the same encoding will be used on the second leg of the trip, right? Which you can do by writing down what encoding was used on this leg of the trip and forcing it to use the same encoding on the other leg. Except that we can't force that to happen on Windows at all as far as I understand, which is a show-stopper right there. But even if we could, this would require us to write down a bit of information and transmit it to the other side and use it to do the encoding. And if we are going to do that, why don't we just transmit the original bytes? Okay, maybe because that would roughly double the amount of data we have to transmit, and maybe we are stingy. But if we are stingy we could instead transmit a single added bit to indicate whether the name is normal or mojibake, and then use windows-1252 to stuff the bytes into the name. One of those options has the advantage of simplicity to the programmer ("There is the unicode, and there are the bytes."), and the other has the advantage of good compression. Both of them have the advantage that nobody involved has to understand and possibly implement a non-standard unicode hack. I'm trying not to be too pushy about this (heaven knows I've been completely wrong about things a dozen times in a row so far in this design process), but as far as I can understand it, PEP 383 can be used only when you can force the same encoding on both sides (the PEP says that encoding "only 'works' if the data get converted back to bytes with the python-escape error handler also"). That happens naturally when both sides are in the same Python process, so PEP 383 naturally looks good in that context. However, if the filenames are going to be stored persistently or transmitted over a network, then it seems simpler, easier, and more portable to use some other method than PEP 383 to handle badly encoded names.
I'm not sure that it can help if you are going to store the results of your os.listdir() persistently or if you are going to transmit them over a network. Indeed, using the results that way could lead to unpleasant surprises.
No more than any other system for giving a canonical Unicode spelling to the results of an OS call.
I think PEP 383 yields more surprises than the alternative of decoding with error handler 'replace' and then including the original bytes along with the unicode. During the course of this process I have also considered using two other mechanisms instead of decoding with error handler 'replace' -- mojibake using windows-1252 or a simple placeholder like "badly_encoded_filename_#1". Any of these three seem to be less surprising and similarly functional to PEP 383. I have to admit that they are not as elegant. Utf-8b is a really neat hack, and MvL's generalization of it to all unicode encodings is, too. I'm still being surprised by it after trying to understand it for many days now. For example, what happens if you decode a filename with PEP 383, store that filename somewhere, and then later try to write a file under that name on Windows? If it only 'works' if the data get converted back to bytes with the python-escape error handler, then can you use the python-escape error handler when trying to, say, create a new file on Windows? Regards, Zooko
Zooko O'Whielacronx wrote:
Following-up to my own post to correct a major error:
Is it true that srcbytes.encode(srcencoding, 'python-escape').decode('utf-8', 'python-escape') will always produce srcbytes ? That is my Requirement
If you start with bytes, decode with utf-8b to unicode (possibly 'invalid'), and encode the result back to bytes with utf-8b, you should get the original bytes, regardless of what they were. That is the point of PEP 383 -- to reliably roundtrip file 'names' that start as bytes and must end as the same bytes but which may not otherwise have a unicode decoding. If you start with invalid unicode text, encode to bytes with utf-8b, and decode back to unicode, you might instead get a different and valid unicode text. An example was given in the discussion. I believe this would be hard to avoid. An any case, it does not matter for the use case of starting with bytes that one wants to temporarily but surely work with as text. Terry Jan Reedy
participants (12)
-
"Martin v. Löwis"
-
Cameron Simpson
-
Glenn Linderman
-
Guido van Rossum
-
James Y Knight
-
Michael Foord
-
Mike Klaas
-
MRAB
-
R. David Murray
-
Stephen J. Turnbull
-
Terry Reedy
-
Zooko O'Whielacronx