Zooko O'Whielacronx wrote:
Following-up to my own post to correct a major error:
On Thu, Apr 30, 2009 at 11:44 PM, Zooko O'Whielacronx <zookog@gmail.com> wrote:
Folks:
My use case (Tahoe-LAFS [1]) requires that I am *able* to read arbitrary binary names from the filesystem and store them so that I can regenerate the same byte string later, but it also requires that I *know* whether what I got was a valid string in the expected encoding (which might be utf-8) or whether it was not and I need to fall back to storing the bytes.
Okay, I am wrong about this. Having a flag to remember whether I had to fall back to the utf-8b trick is one method to implement my requirement, but my actual requirement is this:
Requirement: either the unicode string or the bytes are faithfully transmitted from one system to another.
That is: if you read a filename from the filesystem, and transmit that filename to another system and use it, then there are two cases:
Requirement 1: the byte string was valid in the encoding of source system, in which case the unicode name is faithfully transmitted (i.e. the bytes that finally land on the target system are the result of sourcebytes.decode(source_sys_encoding).encode(target_sys_encoding).
Requirement 2: the byte string was not valid in the encoding of source system, in which case the bytes are faithfully transmitted (i.e. the bytes that finally land on the target system are the same as the bytes that originated in the source system).
Now I finally understand how fiendishly clever MvL's PEP 383 generalization of Markus Kuhn's utf-8b trick is! The only thing necessary to achieve both of those requirements above is that the 'python-escape' error handler is used on the target system .encode() as well as on the source system .decode()!
Well, I'm going to have to let this sink in and maybe write some code to see if I really understand it.
But if this is right, then I can do away with some of the mechanism that I've built up, and instead:
Backport PEP 383 to Python 2.
And, document the PEP 383 trick in some generic, widely respected format such as an Internet Draft so that I can explain to other users of the Tahoe data (many of whom use other languages than Python) what they have to do if they find invalid utf-8 in the data. Oh good, I just realized that Tahoe emits only utf-8, so all I have to do is point them to the utf-8b documents (such as they are) and explain that to read filenames produced by Tahoe they have to implement utf-8b. That's really good that they don't have to implement MvL's generalization of that trick to other encodings, since utf-8b is already understood by some folks.
Okay, I find it surprisingly easy to make subtle errors in this encoding stuff, so please let me know if you spot one. Is it true that srcbytes.encode(srcencoding, 'python-escape').decode('utf-8', 'python-escape') will always produce srcbytes ? That is my Requirement 2.
No, but srcbytes.encode('utf-8', 'python-escape').decode('utf-8', 'python-escape') == srcbytes. The encodings on both ends need to be the same. For example:
b'\x80'.decode('windows-1252') u'\u20ac' u'\u20ac'.encode('utf-8') '\xe2\x82\xac'
Currently:
b'\x80'.decode('utf-8')
Traceback (most recent call last): File "<pyshell#7>", line 1, in <module> b'\x80'.decode('utf-8') File "C:\Python26\lib\encodings\utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: unexpected code byte But under this PEP:
b'x80'.decode('utf-8', 'python-escape') u'\xdc80' u'\xdc80'.encode('utf-8', 'python-escape') '\x80'