[Python-Dev] PEP 383 and GUI libraries

Fri May 1 17:52:50 CEST 2009

Zooko O'Whielacronx wrote:
> Following-up to my own post to correct a major error:
> 
> 
> On Thu, Apr 30, 2009 at 11:44 PM, Zooko O'Whielacronx <zookog at gmail.com> wrote:
>> Folks:
>>
>> My use case (Tahoe-LAFS [1]) requires that I am *able* to read arbitrary
>> binary names from the filesystem and store them so that I can regenerate
>> the same byte string later, but it also requires that I *know* whether
>> what I got was a valid string in the expected encoding (which might be
>> utf-8) or whether it was not and I need to fall back to storing the
>> bytes.
> 
> Okay, I am wrong about this.  Having a flag to remember whether I had to
> fall back to the utf-8b trick is one method to implement my requirement,
> but my actual requirement is this:
> 
> Requirement: either the unicode string or the bytes are faithfully
> transmitted from one system to another.
> 
> That is: if you read a filename from the filesystem, and transmit that
> filename to another system and use it, then there are two cases:
> 
> Requirement 1: the byte string was valid in the encoding of source
> system, in which case the unicode name is faithfully transmitted
> (i.e. the bytes that finally land on the target system are the result of
> sourcebytes.decode(source_sys_encoding).encode(target_sys_encoding).
> 
> Requirement 2: the byte string was not valid in the encoding of source
> system, in which case the bytes are faithfully transmitted (i.e. the
> bytes that finally land on the target system are the same as the bytes
> that originated in the source system).
> 
> Now I finally understand how fiendishly clever MvL's PEP 383
> generalization of Markus Kuhn's utf-8b trick is!  The only thing
> necessary to achieve both of those requirements above is that the
> 'python-escape' error handler is used on the target system .encode() as
> well as on the source system .decode()!
> 
> Well, I'm going to have to let this sink in and maybe write some code to
> see if I really understand it.
> 
> But if this is right, then I can do away with some of the mechanism that
> I've built up, and instead:
> 
> Backport PEP 383 to Python 2.
> 
> And, document the PEP 383 trick in some generic, widely respected format
> such as an Internet Draft so that I can explain to other users of the
> Tahoe data (many of whom use other languages than Python) what they have
> to do if they find invalid utf-8 in the data.  Oh good, I just realized
> that Tahoe emits only utf-8, so all I have to do is point them to the
> utf-8b documents (such as they are) and explain that to read filenames
> produced by Tahoe they have to implement utf-8b.  That's really good
> that they don't have to implement MvL's generalization of that trick to
> other encodings, since utf-8b is already understood by some folks.
> 
> 
> Okay, I find it surprisingly easy to make subtle errors in this encoding
> stuff, so please let me know if you spot one.  Is it true that
> srcbytes.encode(srcencoding, 'python-escape').decode('utf-8',
> 'python-escape') will always produce srcbytes ?  That is my Requirement
> 2.
> 
No, but srcbytes.encode('utf-8', 'python-escape').decode('utf-8',
'python-escape') == srcbytes. The encodings on both ends need to be the
same.

For example:

 >>> b'\x80'.decode('windows-1252')
u'\u20ac'
 >>> u'\u20ac'.encode('utf-8')
'\xe2\x82\xac'

Currently:

 >>> b'\x80'.decode('utf-8')

Traceback (most recent call last):
   File "<pyshell#7>", line 1, in <module>
     b'\x80'.decode('utf-8')
   File "C:\Python26\lib\encodings\utf_8.py", line 16, in decode
     return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: 
unexpected code byte

But under this PEP:

 >>> b'x80'.decode('utf-8', 'python-escape')
u'\xdc80'
 >>> u'\xdc80'.encode('utf-8', 'python-escape')
'\x80'