[Python-Dev] Import and unicode: part two

Toshio Kuratomi a.badger at gmail.com
Tue Jan 25 17:35:25 CET 2011


On Tue, Jan 25, 2011 at 10:22:41AM +0100, Xavier Morel wrote:
> On 2011-01-25, at 04:26 , Toshio Kuratomi wrote:
> > 
> > * If you can pick a set of encodings that are valid (utf-8 for Linux and
> >  MacOS
> 
> HFS+ uses UTF-16 in NFD (actually in an Apple-specific variant of NFD). Right here you've already broken Python modules on OSX.
>
Others have been saying that Mac OSX's HFS+ uses UTF-8.  But the question is
not whether UTF-16 or UTF-8 is used by HFS+.  It's whether you can sensibly
decide on an encoding from the type of system that is being run on.  This
could be querying the filesystem or a check on sys.platform or some other
method.  I don't know what detection the current code does.

On Linux there's no defined encoding that will work; file names are just
bytes to the Linux kernel so based on people's argument that the convention
is and should be that filenames are utf-8 and anything else is
a misconfigured system -- python should mandate that its module filenames on
Linux are utf-8 rather than using the user's locale settings.
> 
> And as far as I know, Linux software/FS generally use NFC (I've already seen this issue cause trouble)
> 
Linux FS's are bytes with a small blacklist (so you can't use the NULL byte
in a filename, for instance).  Linux software would be free to use any
normal form that they want.  If one software used NFC and another used NFD,
the FS would record two separate files with two separate filenames.  Other
programs might or might not display this correctly.

Example:
<zsh>$ touch cafe
<zsh>$ python
Python 2.7 (r27:82500, Sep 16 2010, 18:02:00) 
>>> import os
>>> import unicodedata
>>> a=u'café'
>>> b=unicodedata.normalize('NFC', a)
>>> c=unicodedata.normalize('NFD', a)
>>> open(b.encode('utf8'), 'w').close()
>>> open(c.encode('utf8'), 'w').close()
>>> os.listdir(u'.')
>>> [u'people-etc-changes.txt', u'cafe\u0301', u'cafe', u'people-etc-changes.sha256sum', u'caf\xe9']
>>> os.listdir('.')
>>> ['people-etc-changes.txt', 'cafe\xcc\x81', 'cafe', 'people-etc-changes.sha256sum', 'caf\xc3\xa9']
>>> ^D

<zsh>$ ls -al .
drwxrwxr-x.  2 badger badger  4096 Jan 25 07:46 .
drwxr-xr-x. 17 badger badger  4096 Jan 24 18:27 ..
-rw-rw-r--.  1 badger badger     0 Jan 25 07:45 cafe
-rw-rw-r--.  1 badger badger     0 Jan 25 07:46 cafe
-rw-rw-r--.  1 badger badger     0 Jan 25 07:46 café

<zsh>$ ls -al cafe
-rw-rw-r--.  1 badger badger     0 Jan 25 07:45 cafe
<zsh>$ ls -al cafe?
-rw-rw-r--.  1 badger badger     0 Jan 25 07:46 cafe

Now in this case, the decomposed form of the filename is being displayed
incorrectly and the shell treats the decomposed character as two characters
instead of one.  However, when you view these files in dolphin (the KDE file
manager) you properly see café repeated twice.

-Toshio
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-dev/attachments/20110125/9963a847/attachment.pgp>


More information about the Python-Dev mailing list