Mailman 3 Unicode strings as filenames - Python-Dev

Unicode strings as filenames

older
Re: [XML-SIG] printing Unicode xml...

Skip Montanaro

Jan. 3, 2002

9:11 a.m.

What's the correct way to deal with filenames in a Unicode environment? Consider this: >>> import site >>> site.encoding 'latin-1' >>> a = "abc\xe4\xfc\xdf.txt" >>> u = unicode (a, "latin-1") >>> uu = u.encode ("utf-8") >>> open(a, "w") <open file 'abcäüß.txt', mode 'w' at 0x823c2a0> >>> open(u, "w") <open file 'abcäüß.txt', mode 'w' at 0x823a1e8> >>> open(uu, "w") <open file 'abcÃ¤Ã¼Ã.txt', mode 'w' at 0x81d6160> If I change my site's default encoding back to ascii, the second open fails: >>> import site >>> site.encoding 'ascii' >>> a = "abc\xe4\xfc\xdf.txt" >>> u = unicode (a, "latin-1") >>> uu = u.encode ("utf-8") >>> open(a, "w") <open file 'abcäüß.txt', mode 'w' at 0x822b448> >>> open(u, "w") Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeError: ASCII encoding error: ordinal not in range(128) >>> open(uu, "w") <open file 'abcÃ¤Ã¼Ã.txt', mode 'w' at 0x822d260> as I expect it should. The third open is a problem as well, even though it succeeds with either encoding. (Why doesn't it fail when the default encoding is ascii?) My thought is that before using a plain string or a unicode string as a filename it should first be coerced to a unicode string with the default encoding, something like: if type(fname) == types.StringType: fname = unicode(fname, site.encoding) elif type(fname) == types.UnicodeType: fname = fname.encode(site.encoding) else: raise TypeError, ("unrecognized type for filename: %s"%type(fname)) Is that the correct approach? Apparently Python's file object doesn't do this under the covers. Should it? Thx, Skip

Show replies by date

Neil Hodgson

January 2002

3:20 p.m.

Skip:

...

On Windows NT/2K/XP the right thing to do is to use the wide char open function such as _CRTIMP FILE * __cdecl _wfopen(const wchar_t *, const wchar_t *); _CRTIMP int __cdecl _wopen(const wchar_t *, int, ...); There may also be techniques for doing this on Windows 9x as the file system stores Unicode file names but I have never looked into this. Neil

Skip Montanaro

3:28 p.m.

Skip> What's the correct way to deal with filenames in a Unicode Skip> environment? Consider this: Skip> [Attempts to use encoding] Neil> On Windows NT/2K/XP the right thing to do is to use the wide char Neil> open function such as Neil> _CRTIMP FILE * __cdecl _wfopen(const wchar_t *, const wchar_t *); Neil> _CRTIMP int __cdecl _wopen(const wchar_t *, int, ...); Neil> There may also be techniques for doing this on Windows 9x as the Neil> file system stores Unicode file names but I have never looked into Neil> this. How is this exposed (if at all) to Python programmers? I happen to be developing on Linux, but the eventual delivery platform will be Windows. Is there no way to handle this in a cross-platform way? Skip

Neil Hodgson

3:43 p.m.

Skip:

...

How is this exposed (if at all) to Python programmers?

Currently not exposed AFAICT except through calldll.

...

Cross-platform is tricky as the file systems used on Linux have narrow string file names. Some higher level software (such as the forthcoming version of GTK+/GNOME) assume file names are encoded in UTF-8 but this is a somewhat dangerous assumption. The problem on Windows is that there are files you can not open by performing encoding operations on the Unicode names. They do have narrow generated names, but these are mangled and look like Z8F22~1.HTM so are hard to discover. Neil

Unicode strings as filenames

Neil Hodgson

Neil Hodgson

Neil Hodgson

Neil Hodgson

M.-A. Lemburg

M.-A. Lemburg

M.-A. Lemburg

Neil Hodgson

Neil Hodgson

Neil Hodgson

M.-A. Lemburg

Neil Hodgson

Neil Hodgson

Neil Hodgson

Neil Hodgson

M.-A. Lemburg

M.-A. Lemburg

M.-A. Lemburg

Neil Hodgson

Neil Hodgson

Neil Hodgson

M.-A. Lemburg

tags

participants (4)