unicode filenames

Just just at xs4all.nl
Sun Feb 16 16:09:41 EST 2003


In article <wzptps13t5.fsf at nono.cs.uu.nl>,
 Piet van Oostrum <piet at cs.uu.nl> wrote:

> >>>>> David Eppstein <eppstein at ics.uci.edu> (DE) wrote:
> 
> DE> Under Mac OS X, the shell displays text (e.g. from cat, or from ls 
> DE> without the -q option) as utf-8 by default, and the Finder (gui file 
> DE> browser) uses utf-8 for accented characters in file names.  So I infer 
> DE> that the correct interpretation of filenames under my OS is utf-8.
> DE> But other unixes may differ...
> 
> On Mac OS X, it is a bit more complicated. First cat will indeed show the
> unicode (utf-8) contents of a file, but ls won't display filenames with
> non-ASCII characters right. At least not in 10.1.5. Maybe 10.2 does it better.
> Like if my filename is "¤200", ls will display "???200".

Although in Terminal.app supports utf-8 in 10.2, what you describe is 
still true.

> Secondly, the filesystem requires the unicode characters to be normalized,
> which means that accented characters like "é" will be broken up into "e"
> followed by "´". So if the finder has a file with name "é200", the bytes
> used in the filename will be 0x65 followed by 0xCC 0x81 (unicode character
> 0x301). ls will print this as "e??200".

You don't have to worry about that: the file system will _give_ you 
normalized unicode, but it does the right thing if you feed it 
non-normalized unicode.

Btw. in 2.3 (current CVS, not a1), the file system calls fully support 
unicode strings on OSX. I've also got a patch pending that makes 
os.listdir() return unicode strings when appropriate: 
http://python.org/sf/683592. I think this has a fair chance to make it 
in.

Just




More information about the Python-list mailing list