[Python-3000] Unicode and OS strings

Victor Stinner victor.stinner at haypocalc.com
Wed Sep 19 12:40:33 CEST 2007


Hi,

On Thursday 13 September 2007 18:22:12 Marcin 'Qrczak' Kowalczyk wrote:
> What should happen when a command line argument or an environment
> variable is not decodable using the system encoding (on Unix where
> from the OS point of view it is an array of bytes)?

On Linux, filenames are *byte* string and not *character* string. I always 
have his problem with Python 2.x. I converted filename (argv[x]) to Unicode 
to be able to format error messages in full unicode... but it's not possible. 
Linux allows invalid utf8 filename even on full utf8 installation (ubuntu), 
see Marcin's examples.

So I propose to keep sys.argv as byte string array. If you try to create 
unicode strings, you will be unable to write a program to convert filesystem 
with "broken" filenames (see convmv program for example) or open file with 
broken "filename" (broken: invalid byte sequence for UTF/JIS/Big5/... 
charset).

---

For Python 2.x, my solution is to keep byte string for I/O and use unicode 
string for error messages. Function to convert any byte string (filename 
string) to Unicode:

def unicodeFilename(filename, charset=None):
    if not charset:
        charset = getTerminalCharset()
    try:
        return unicode(filename, charset)
    except UnicodeDecodeError:
        return makePrintable(filename, charset, to_unicode=True)

makePrintable() replace invalid byte sequence by escape string, example:

>>> from hachoir_core.tools import makePrintable
>>> makePrintable("a\x80", "utf8", to_unicode=True)
u'a\\x80'
>>> print makePrintable("a\x80", "utf8", to_unicode=True)
a\x80

Source code of function makePrintable:
http://hachoir.org/browser/trunk/hachoir-core/hachoir_core/tools.py#L225

Source code of function getTerminalCharset():
http://hachoir.org/browser/trunk/hachoir-core/hachoir_core/i18n.py#L23

Victor Stinner
http://hachoir.org/


More information about the Python-3000 mailing list