python 2.7.12 on Linux behaving differently than on Windows
eryk sun
eryksun at gmail.com
Fri Dec 9 06:01:57 EST 2016
On Fri, Dec 9, 2016 at 7:41 AM, Steve D'Aprano
<steve+python at pearwood.info> wrote:
> Frankly, I think that Apple HFS+ is the only modern file system that gets
> Unicode right. Not only does it restrict file systems to valid UTF-8
> sequences, but it forces them to a canonical form to avoid the é é gotcha,
> and treats file names as case preserving but case insensitive.
Windows NTFS doesn't normalize names to a canonical form. It also
allows lone surrogate codes, which is invalid UTF-16.
For case insensitive matches it converts to upper case, but the
conversion table it uses is extremely conservative. Here's a simple
function to convert a string to upper case using NT's runtime library
function RtlUpcaseUnicodeChar:
import ctypes
ntdll = ctypes.WinDLL('ntdll')
def upcase(s):
up = []
for c in s:
b = bytearray()
for c in memoryview(c.encode('utf-16le')).cast('H'):
c_up = ntdll.RtlUpcaseUnicodeChar(c)
b += c_up.to_bytes(2, 'little')
up.append(b.decode('utf-16le'))
return ''.join(up)
For example:
>>> upcase('abcd')
'ABCD'
>>> upcase('αβψδ')
'ΑΒΨΔ'
>>> upcase('ßẞıİÅσςσ')
'ßẞıİÅΣςΣ'
Attempting to create two files named 'ßẞıİÅσςσ' and 'ßẞıİÅΣςΣ' in the
same NTFS directory fails, as expected:
>>> s = 'ßẞıİÅσςσ'
>>> open(s, 'x').close()
>>> open(upcase(s), 'x').close()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
FileExistsError: [Errno 17] File exists: 'ßẞıİÅΣςΣ'
Note that Windows thinks standard case conversions of this name are all unique:
>>> open(s.upper(), 'x').close()
>>> open(s.lower(), 'x').close()
>>> open(s.casefold(), 'x').close()
>>> os.listdir()
['ssssıi̇åσσσ', 'SSẞIİÅΣΣΣ', 'ßßıi̇åσςσ', 'ßẞıİÅσςσ']
More information about the Python-list
mailing list