[Python-3000] Filename: unicode normalization

Guido van Rossum guido at python.org
Wed Oct 1 01:23:01 CEST 2008


Martin answered a similar question from Jack Jansen in another thread.
OSX doesn't normalize either. It's unlikely to confuse users in
practice.

On Tue, Sep 30, 2008 at 4:11 PM, Victor Stinner
<victor.stinner at haypocalc.com> wrote:
> Since it's hard to follow the filename thread on two mailing list, i'm
> starting a new thread only on python-3000 about unicode normalization of the
> filenames.
>
> Bad news: it looks like Linux doesn't normalize filenames. So if you used NFC
> to create a file, you have to reuse NFC to open your file (and the same for
> NFD).
>
> Python2 example to create files in the different forms:
>>>> name=u'xäx'
>>>> from unicodedata import normalize
>>>> open(u'NFD-' + normalize('NFD', name), 'w').close()
>>>> open(u'NFC-' + normalize('NFC', name), 'w').close()
>>>> open(u'NFKC-' + normalize('NFKC', name), 'w').close()
>>>> open(u'NFKD-' + normalize('NFKD', name), 'w').close()
>>>> import os
>>>> os.listdir('.')
> ['NFD-xa\xcc\x88x', 'NFC-x\xc3\xa4x', 'NFKC-x\xc3\xa4x', 'NFKD-xa\xcc\x88x']
>>>> os.listdir(u'.')
> [u'NFD-xa\u0308x', u'NFC-x\xe4x', u'NFKC-x\xe4x', u'NFKD-xa\u0308x']
>
> Directory listing using Python3:
>>>> import os
>>>> [ name.encode('utf-8') for name in  os.listdir('.') ]
> [b'NFD-xa\xcc\x88x', b'NFC-x\xc3\xa4x', b'NFKC-x\xc3\xa4x',
> b'NFKD-xa\xcc\x88x']
>>>> os.listdir('.')
> ['NFD-xäx', 'NFC-xäx', 'NFKC-xäx', 'NFKD-xäx']
>
> Same results, correct. Then try to open files:
>>>> open(normalize('NFC', 'NFC-xäx')).close()
>>>> open(normalize('NFD', 'NFC-xäx')).close()
> IOError: [Errno 2] No such file or directory: 'NFC-xäx'
>>>> open(normalize('NFD', 'NFD-xäx')).close()
>>>> open(normalize('NFC', 'NFD-xäx')).close()
> IOError: [Errno 2] No such file or directory: 'NFD-xäx'
>
> If the user chooses a result from os.listdir(): no problem (if he has good
> eyes and he's able to find the difference between 'xäx' (NFD) and 'xäx'
> (NFC) :-D).
>
> If the user enters the filename using the keyboard (on the command line or a
> GUI dialog), you have to hope that the keyboard is encoded in the same norm
> than the filename was encoded...
>
> --
> Victor Stinner aka haypo
> http://www.haypocalc.com/blog/
> _______________________________________________
> Python-3000 mailing list
> Python-3000 at python.org
> http://mail.python.org/mailman/listinfo/python-3000
> Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org
>



-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)


More information about the Python-3000 mailing list