[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Zooko O'Whielacronx zooko at zooko.com
Sat Apr 25 17:29:54 CEST 2009


Thanks for writing this PEP 383, MvL.  I recently ran into this  
problem in Python 2.x in the Tahoe project [1].  The Tahoe project  
should be considered a good use case showing what some people need.   
For example, the assumption that a file will later be written back  
into the same local filesystem (and thus luckily use the same  
encoding) from which it originally came doesn't hold for us, because  
Tahoe is used for file-sharing as well as for backup-and-restore.

One of my first conclusions in pursuing this issue is that we can  
never use the Python 2.x unicode APIs on Linux, just as we can never  
use the Python 2.x str APIs on Windows [2].  (You mentioned this  
ugliness in your PEP.)  My next conclusion was that the Linux way of  
doing encoding of filenames really sucks compared to, for example,  
the Mac OS X way.  I'm heartened to see what David Wheeler is trying  
to persuade the maintainers of Linux filesystems to improve some of  
this: [3].

My final conclusion was that we needed to have two kinds of  
workaround for the Linux suckage: first, if decoding using the  
suggested filesystem encoding fails, then we fall back to mojibake  
[4] by decoding with iso-8859-1 (or else with windows-1252 -- I'm not  
sure if it matters and I haven't yet understood if utf-8b offers  
another alternative for this case).  Second, if decoding succeeds  
using the suggested filesystem encoding on Linux, then write down the  
encoding that we used and include that with the filename.  This  
expands the size of our filenames significantly, but it is the only  
way to allow some future programmer to undo the damage of a falsely- 
successful decoding.  Here's our whole plan: [5].

Regards,

Zooko

[1] http://allmydata.org
[2] http://allmydata.org/pipermail/tahoe-dev/2009-March/001379.html #  
see the footnote of this message
[3] http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html
[4] http://en.wikipedia.org/wiki/Mojibake
[5] http://allmydata.org/trac/tahoe/ticket/534#comment:47


More information about the Python-Dev mailing list