python 2.7.12 on Linux behaving differently than on Windows

eryk sun eryksun at gmail.com
Fri Dec 9 06:01:57 EST 2016


On Fri, Dec 9, 2016 at 7:41 AM, Steve D'Aprano
<steve+python at pearwood.info> wrote:
> Frankly, I think that Apple HFS+ is the only modern file system that gets
> Unicode right. Not only does it restrict file systems to valid UTF-8
> sequences, but it forces them to a canonical form to avoid the é é gotcha,
> and treats file names as case preserving but case insensitive.

Windows NTFS doesn't normalize names to a canonical form. It also
allows lone surrogate codes, which is invalid UTF-16.

For case insensitive matches it converts to upper case, but the
conversion table it uses is extremely conservative. Here's a simple
function to convert a string to upper case using NT's runtime library
function RtlUpcaseUnicodeChar:

    import ctypes

    ntdll = ctypes.WinDLL('ntdll')

    def upcase(s):
        up = []
        for c in s:
            b = bytearray()
            for c in memoryview(c.encode('utf-16le')).cast('H'):
                c_up = ntdll.RtlUpcaseUnicodeChar(c)
                b += c_up.to_bytes(2, 'little')
            up.append(b.decode('utf-16le'))
        return ''.join(up)

For example:

    >>> upcase('abcd')
    'ABCD'
    >>> upcase('αβψδ')
    'ΑΒΨΔ'
    >>> upcase('ßẞıİÅσςσ')
    'ßẞıİÅΣςΣ'

Attempting to create two files named 'ßẞıİÅσςσ' and 'ßẞıİÅΣςΣ' in the
same NTFS directory fails, as expected:

    >>> s = 'ßẞıİÅσςσ'
    >>> open(s, 'x').close()
    >>> open(upcase(s), 'x').close()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    FileExistsError: [Errno 17] File exists: 'ßẞıİÅΣςΣ'

Note that Windows thinks standard case conversions of this name are all unique:

    >>> open(s.upper(), 'x').close()
    >>> open(s.lower(), 'x').close()
    >>> open(s.casefold(), 'x').close()
    >>> os.listdir()
    ['ssssıi̇åσσσ', 'SSẞIİÅΣΣΣ', 'ßßıi̇åσςσ', 'ßẞıİÅσςσ']


More information about the Python-list mailing list