[Python-Dev] fun with unicode, part 1

Tim Peters tim_one@email.msn.com
Tue, 2 May 2000 03:20:52 -0400


[Guido asks good questions about how Windows deals w/ Unicode filenames,
 last Thursday, but gets no answers]

> ...
> I'd like to solve this problem, but I have some questions: what *IS*
> the encoding used for filenames on Windows?  This may differ per
> Windows version; perhaps it can differ drive letter?  Or per
> application or per thread?  On Windows NT, filenames are supposed to
> be Unicode.  (I suppose also on Windowns 2000?)  How do I open a file
> with a given Unicode string for its name, in a C program?  I suppose
> there's a Win32 API call for that which has a Unicode variant.
>
> On Windows 95/98, the Unicode variants of the Win32 API calls don't
> exist.  So what is the poor Python runtime to do there?
>
> Can Japanese people use Japanese characters in filenames on Windows
> 95/98?  Let's assume they can.  Since the filesystem isn't Unicode
> aware, the filenames must be encoded.  Which encoding is used?  Let's
> assume they use Microsoft's multibyte encoding.  If they put such a
> file on a floppy and ship it to Linköping, what will Fredrik see as
> the filename?  (I.e., is the encoding fixed by the disk volume, or by
> the operating system?)
>
> Once we have a few answers here, we can solve the problem.  Note that
> sometimes we'll have to refuse a Unicode filename because there's no
> mapping for some of the characters it contains in the filename
> encoding used.

I just thought I'd repeat the questions <wink>.  However, I don't think
you'll really want the answers -- Windows is a legacy-encrusted mess, and
there are always many ways to get a thing done in the end.  For example ...

> Question: how does Fredrik create a file with a Euro
> character (u'\u20ac') in its name?

This particular one is shallower than you were hoping:  in many of the
TrueType fonts (e.g., Courier New but not Courier), Windows extended its
Latin-1 encoding by mapping the Euro symbol to the "control character" 0x80.
So I can get a Euro symbol into a file name just by typing Alt+0+1+2+8.
This is true even on US Win98 (which has no visible Unicode support) -- but
was not supported in US Win95.

i've-been-tracking-down-what-appears-to-be-a-hw-bug-on-a-japanese-laptop-
    at-work-so-can-verify-ms-sure-got-japanese-characters-into-the-
    filenames-somehow-but-doubt-it's-via-unicode-ly y'rs  - tim