[Python-3000] Unicode and OS strings

Thu Sep 13 18:22:12 CEST 2007

What should happen when a command line argument or an environment
variable is not decodable using the system encoding (on Unix where
from the OS point of view it is an array of bytes)?

This is an unfortunate side effect of switching to Unicode. It's
unfortunate because often the data is only passed back to another
function, and thus lack of round trip is a pure loss caused by
choosing a Unicode string as the representation of such data.
I opt for Unicode strings nevertheless, Python did a right step.

I once checked what other languages with Unicode strings do, and the
results were not enlightening: inconsistency, weird errors, damaged or
truncated data.

Python 3.0a1 mostly fails with weird errors, and fails a bit too early:

[qrczak ~]$ echo $LANG
pl_PL.UTF-8

[qrczak ~]$ python3.0 - $(printf '\x80')           
Python 3.0a1 (py3k, Sep  8 2007, 15:57:56) 
[GCC 4.2.1 20070719 (release) (PLD-Linux)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Fatal Python error: no mem for sys.argv
zsh: abort      python3.0 - $(printf '\x80')

[qrczak ~]$ FOO=$(printf '\x80') python3.0
Python 3.0a1 (py3k, Sep  8 2007, 15:57:56) 
[GCC 4.2.1 20070719 (release) (PLD-Linux)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
object  : UnicodeDecodeError('utf8', b'\x80', 0, 1, 'unexpected code byte')
type    : UnicodeDecodeError
refcount: 4
address : 0xb7a5142c
lost sys.stderr
>>>

[qrczak ~]$ mkdir $(printf '\x80')

[qrczak ~]$ cd $(printf '\x80')

[qrczak ~/\M-^@]$ python3.0
Python 3.0a1 (py3k, Sep  8 2007, 15:57:56) 
[GCC 4.2.1 20070719 (release) (PLD-Linux)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
object  : UnicodeDecodeError('utf8', b'/home/users/qrczak/\x80', 19, 20, 'unexpected code byte')
type    : UnicodeDecodeError
refcount: 4
address : 0xb7a1242c
lost sys.stderr
>>>

os.listdir returns undecodable filenames as str8.

I don't know what it should do. Choices:

1. Fail in a controlled way (without losing sys.stderr), and no earlier
   than necessary, i.e. fail when the given string is requested, not
   when a module is imported.

1a. Guarantee that choosing a different encoding and retrying works,
    for a rare case when the programmer wishes to handle such strings by
    explicitly trying latin1.

2. Return undecodable information as bytes, and accept bytes when it is
   passed back to similar functions in the other direction.

3. Have an option to use a modified UTF-8 in these places, where
   undecodable bytes are e.g. escaped as U+0000 U+00xx.

I will not advocate any choice other than 1, but perhaps someone has
another idea.

My language Kogut uses 1a (even for things like sys.argv which look like
variables), experimentally with 3 as an option to be requested either by
choosing such encoding by the program or with an environment variable.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/