[I18n-sig] Passing unicode strings to file system calls

Bleyer, Michael MBleyer@DEFiNiENS.com
Wed, 17 Jul 2002 18:16:39 +0200

Assume I have a list of unicode strings in UTF-16-le. Reading and parsing
the list all works really fine.

Now I want to create/copy a number of files and I want the file/directory
names to be these unicode strings.
When I give a unicode string to a file system call like
Python converts the unicode string to a "regular" string using the default
site encoding (which usually fails if 'ascii').
I can influence this by encode()'ing myself before I pass the string to the
system function call, so far so good.

However, I do have a problem if I have unicode strings from different,
non-compatible encodings in my list (e.g. ISO latin-1 and some asian
encoding), as I cannot use the same encoding conversion for all strings,
some will fail. I can of course convert to UTF8 which will always work, but
the filenames turn out to be garbage (because the OS does not interpret them
as UTF8 but in the local encoding).

My question is thus: since modern-day operating systems claim to support
unicode (I assume) in filenames, how do I pass a unicode string directly to
a system function call without having to convert to a "localized" encoding?

Alternatively how can I find out the "proper" or "legal" encoding for a
unicode string just by looking at the string (e.g. not with a brute force
try-encode-except trial and error loop).

As a side problem: how do I deal with filename length limits, since these
are actually byte limits not character limits?
If I do a u''[:255] followed by an encode I end up with a unicode string
thats at most 255 characters long, but may be longer than 255 bytes after
If I do encode followed by ''[:255] I get at most 255 bytes but my string
may be illegal because I cut off in the middle of a 3-byte character.

Any insights and suggestions greatly appreciated.