[Python-3000] Unicode and OS strings

James Y Knight foom at fuhm.net
Fri Sep 14 05:41:12 CEST 2007


On Sep 13, 2007, at 12:22 PM, Marcin 'Qrczak' Kowalczyk wrote:
> What should happen when a command line argument or an environment
> variable is not decodable using the system encoding (on Unix where
> from the OS point of view it is an array of bytes)?

Here's a suggestion I made on the SBCL dev list a while back, in  
response to the same issues. I am responding to myself here, where my  
first suggestion was to keep all the environmental gunk in byte- 
arrays rather than strings. That is still a very nice and simple  
possibility.

My second inclination was to use a variant of utf8 which can handle  
all bytestrings, instead of utf8 itself: utf-8b. This obviously works  
best when the system encoding is actually utf8.

> On Aug 2, 2007, at 4:55 PM, James Y Knight wrote:
>
>> Yeah -- it's pretty clear the environment isn't _actually_ in the
>> default encoding. It's just binary junk which often but not always
>> contains some text encoded in some arbitrary superset of ASCII. Just
>> like command line arguments (and filenames on linux).
>>
>> The hard part is that users expect command line arguments, filenames,
>> and environment values to be strings (because they normally do
>> contain text-like things), when strictly they cannot be because there
>> is no reliable encoding.
>>
>
> A good alternative to this is for SBCL to use the UTF8b encoding to  
> decode unix environment gunk (filenames, env vars, command line  
> args) which are *probably* in utf8, but might not be. utf8b has the  
> nice property that any arbitrary bytestring can be decoded into  
> unicode, and then round-tripped back to the same bytes. Valid utf8  
> sequences turns into the same unicode characters as with the utf8  
> codec. Invalid utf8 sequences turn into invalid surrogate pair  
> sequences in the unicode string.
>
> Thus, SBCL can return strings, and never throw an error. If you  
> actually wanted the random binary, you can losslessly convert the  
> unicode string back to binary. Win win.
>
> Some references:
> Original mail:
> http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html
>
> Blog entry:
> http://bsittler.livejournal.com/10381.html
>
> Python implementation: http://hyperreal.org/~est/libutf8b/

James




More information about the Python-3000 mailing list