[Python-3000] Unicode and OS strings

Guido van Rossum guido at python.org
Thu Sep 13 18:48:47 CEST 2007


Yes, I have noticed this too. Environment variables, command line
arguments, locale properties, TZ names, and so on, are often given as
8-bit strings in who knows what encoding. I'm not sure what the
solution is, but we need one. I'm guessing one thing we need to do is
research how various systems decide what encoding to use. Even on OSX,
I managed to create an environment variable containing non-ASCII
non-UTF-8 bytes.

I believe Tcl/Tk used to have some kind of heuristic where they would
try UTF-8 first and if that failed used Latin-1 for the bytes that
aren't valid UTF-8, but I'm not at all sure that that's the right
solution in places where Latin-1 is not spoken.

--Guido

On 9/13/07, Marcin 'Qrczak' Kowalczyk <qrczak at knm.org.pl> wrote:
> What should happen when a command line argument or an environment
> variable is not decodable using the system encoding (on Unix where
> from the OS point of view it is an array of bytes)?
>
> This is an unfortunate side effect of switching to Unicode. It's
> unfortunate because often the data is only passed back to another
> function, and thus lack of round trip is a pure loss caused by
> choosing a Unicode string as the representation of such data.
> I opt for Unicode strings nevertheless, Python did a right step.
>
> I once checked what other languages with Unicode strings do, and the
> results were not enlightening: inconsistency, weird errors, damaged or
> truncated data.
>
> Python 3.0a1 mostly fails with weird errors, and fails a bit too early:
>
> [qrczak ~]$ echo $LANG
> pl_PL.UTF-8
>
> [qrczak ~]$ python3.0 - $(printf '\x80')
> Python 3.0a1 (py3k, Sep  8 2007, 15:57:56)
> [GCC 4.2.1 20070719 (release) (PLD-Linux)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> Fatal Python error: no mem for sys.argv
> zsh: abort      python3.0 - $(printf '\x80')
>
> [qrczak ~]$ FOO=$(printf '\x80') python3.0
> Python 3.0a1 (py3k, Sep  8 2007, 15:57:56)
> [GCC 4.2.1 20070719 (release) (PLD-Linux)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import os
> object  : UnicodeDecodeError('utf8', b'\x80', 0, 1, 'unexpected code byte')
> type    : UnicodeDecodeError
> refcount: 4
> address : 0xb7a5142c
> lost sys.stderr
> >>>
>
> [qrczak ~]$ mkdir $(printf '\x80')
>
> [qrczak ~]$ cd $(printf '\x80')
>
> [qrczak ~/\M-^@]$ python3.0
> Python 3.0a1 (py3k, Sep  8 2007, 15:57:56)
> [GCC 4.2.1 20070719 (release) (PLD-Linux)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import os
> object  : UnicodeDecodeError('utf8', b'/home/users/qrczak/\x80', 19, 20, 'unexpected code byte')
> type    : UnicodeDecodeError
> refcount: 4
> address : 0xb7a1242c
> lost sys.stderr
> >>>
>
> os.listdir returns undecodable filenames as str8.
>
> I don't know what it should do. Choices:
>
> 1. Fail in a controlled way (without losing sys.stderr), and no earlier
>    than necessary, i.e. fail when the given string is requested, not
>    when a module is imported.
>
> 1a. Guarantee that choosing a different encoding and retrying works,
>     for a rare case when the programmer wishes to handle such strings by
>     explicitly trying latin1.
>
> 2. Return undecodable information as bytes, and accept bytes when it is
>    passed back to similar functions in the other direction.
>
> 3. Have an option to use a modified UTF-8 in these places, where
>    undecodable bytes are e.g. escaped as U+0000 U+00xx.
>
> I will not advocate any choice other than 1, but perhaps someone has
> another idea.
>
> My language Kogut uses 1a (even for things like sys.argv which look like
> variables), experimentally with 3 as an option to be requested either by
> choosing such encoding by the program or with an environment variable.
>
> --
>    __("<         Marcin Kowalczyk
>    \__/       qrczak at knm.org.pl
>     ^^     http://qrnik.knm.org.pl/~qrczak/
>
> _______________________________________________
> Python-3000 mailing list
> Python-3000 at python.org
> http://mail.python.org/mailman/listinfo/python-3000
> Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org
>


-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)


More information about the Python-3000 mailing list