[Python-3000] Unicode and OS strings
Victor Stinner
victor.stinner at haypocalc.com
Wed Sep 19 12:40:33 CEST 2007
Hi,
On Thursday 13 September 2007 18:22:12 Marcin 'Qrczak' Kowalczyk wrote:
> What should happen when a command line argument or an environment
> variable is not decodable using the system encoding (on Unix where
> from the OS point of view it is an array of bytes)?
On Linux, filenames are *byte* string and not *character* string. I always
have his problem with Python 2.x. I converted filename (argv[x]) to Unicode
to be able to format error messages in full unicode... but it's not possible.
Linux allows invalid utf8 filename even on full utf8 installation (ubuntu),
see Marcin's examples.
So I propose to keep sys.argv as byte string array. If you try to create
unicode strings, you will be unable to write a program to convert filesystem
with "broken" filenames (see convmv program for example) or open file with
broken "filename" (broken: invalid byte sequence for UTF/JIS/Big5/...
charset).
---
For Python 2.x, my solution is to keep byte string for I/O and use unicode
string for error messages. Function to convert any byte string (filename
string) to Unicode:
def unicodeFilename(filename, charset=None):
if not charset:
charset = getTerminalCharset()
try:
return unicode(filename, charset)
except UnicodeDecodeError:
return makePrintable(filename, charset, to_unicode=True)
makePrintable() replace invalid byte sequence by escape string, example:
>>> from hachoir_core.tools import makePrintable
>>> makePrintable("a\x80", "utf8", to_unicode=True)
u'a\\x80'
>>> print makePrintable("a\x80", "utf8", to_unicode=True)
a\x80
Source code of function makePrintable:
http://hachoir.org/browser/trunk/hachoir-core/hachoir_core/tools.py#L225
Source code of function getTerminalCharset():
http://hachoir.org/browser/trunk/hachoir-core/hachoir_core/i18n.py#L23
Victor Stinner
http://hachoir.org/
More information about the Python-3000
mailing list