[Python-3000] Unicode and OS strings
James Y Knight
foom at fuhm.net
Fri Sep 14 05:41:12 CEST 2007
On Sep 13, 2007, at 12:22 PM, Marcin 'Qrczak' Kowalczyk wrote:
> What should happen when a command line argument or an environment
> variable is not decodable using the system encoding (on Unix where
> from the OS point of view it is an array of bytes)?
Here's a suggestion I made on the SBCL dev list a while back, in
response to the same issues. I am responding to myself here, where my
first suggestion was to keep all the environmental gunk in byte-
arrays rather than strings. That is still a very nice and simple
possibility.
My second inclination was to use a variant of utf8 which can handle
all bytestrings, instead of utf8 itself: utf-8b. This obviously works
best when the system encoding is actually utf8.
> On Aug 2, 2007, at 4:55 PM, James Y Knight wrote:
>
>> Yeah -- it's pretty clear the environment isn't _actually_ in the
>> default encoding. It's just binary junk which often but not always
>> contains some text encoded in some arbitrary superset of ASCII. Just
>> like command line arguments (and filenames on linux).
>>
>> The hard part is that users expect command line arguments, filenames,
>> and environment values to be strings (because they normally do
>> contain text-like things), when strictly they cannot be because there
>> is no reliable encoding.
>>
>
> A good alternative to this is for SBCL to use the UTF8b encoding to
> decode unix environment gunk (filenames, env vars, command line
> args) which are *probably* in utf8, but might not be. utf8b has the
> nice property that any arbitrary bytestring can be decoded into
> unicode, and then round-tripped back to the same bytes. Valid utf8
> sequences turns into the same unicode characters as with the utf8
> codec. Invalid utf8 sequences turn into invalid surrogate pair
> sequences in the unicode string.
>
> Thus, SBCL can return strings, and never throw an error. If you
> actually wanted the random binary, you can losslessly convert the
> unicode string back to binary. Win win.
>
> Some references:
> Original mail:
> http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html
>
> Blog entry:
> http://bsittler.livejournal.com/10381.html
>
> Python implementation: http://hyperreal.org/~est/libutf8b/
James
More information about the Python-3000
mailing list