Windows: Remove support of bytes filenames in the os module?
data:image/s3,"s3://crabby-images/b3d87/b3d872f9a7bbdbbdbd3c3390589970e6df22385a" alt=""
Hi, Since 3.3, functions of the os module started to emit DeprecationWarning when called with bytes filenames. The rationale is quite simple: Windows native type for filenames is Unicode, and the Windows has a weird behaviour when you use bytes. For example, os.listdir(b'.') gives you paths which cannot be used with open() on filenames which are not encodable the ANSI code page. Unencodable characters are replaced with "?". The following issue was opened to document this weird behaviour (but the doc was never completed): "Document that bytes OS API can returns unusable results on Windows" http://bugs.python.org/issue16700 When the new os.scandir() API was designed, I asked to *not* support bytes filenames since they are "broken by design". https://www.python.org/dev/peps/pep-0471/ Recently, an user complained that os.walk() doesn't work with bytes on Windows anymore: "Regression: os.walk now using os.scandir() breaks bytes filenames on windows" http://bugs.python.org/issue25911 Serhiy Storchaka just pushed a change to reintroduce support bytes support on Windows in os.walk(), but I would prefer to do the *opposite*: drop supports for bytes filenames on Windows. Are we brave enough to force users to use the "right" type for filenames? -- On Python 2, it wasn't possible to use Unicode for filenames, many functions fail badly with Unicode, especially when you mix bytes and Unicode. On Python 3, Unicode is the "natural" types, most Python functions prefer Unicode, and the PEP 383 (surrogateescape) allows to safetely use Unicode on UNIX even with undecodable filenames (invalid bytes are stored as Unicode surrogate characters). Victor
data:image/s3,"s3://crabby-images/b3d87/b3d872f9a7bbdbbdbd3c3390589970e6df22385a" alt=""
2016-02-08 15:32 GMT+01:00 Victor Stinner <victor.stinner@gmail.com>:
Since 3.3, functions of the os module started to emit DeprecationWarning when called with bytes filenames. (...) Recently, an user complained that os.walk() doesn't work with bytes on Windows anymore: (...)
It's also sad to see that deprecation warnings are completly ignored. Python 3.3 was release in 2011, 5 years ago. I would prefer to show deprecation warnings by default. But I know that it's an old debate: developers vs users :-) I like to see my users as potential developers ;-) Victor
data:image/s3,"s3://crabby-images/946ff/946ff124e4fcadd77b862b3c2606ec15920edd87" alt=""
On Feb 8, 2016, at 06:40, Victor Stinner <victor.stinner@gmail.com> wrote:
2016-02-08 15:32 GMT+01:00 Victor Stinner <victor.stinner@gmail.com>:
Since 3.3, functions of the os module started to emit DeprecationWarning when called with bytes filenames. (...) Recently, an user complained that os.walk() doesn't work with bytes on Windows anymore: (...)
It's also sad to see that deprecation warnings are completly ignored. Python 3.3 was release in 2011, 5 years ago.
I would prefer to show deprecation warnings by default. But I know that it's an old debate: developers vs users :-) I like to see my users as potential developers ;-)
This is tracked in this issue: http://bugs.python.org/issue24294 <http://bugs.python.org/issue24294> : DeprecationWarnings should be visible by default in the interactive REPL IPython have enabled them only if they come from __main__. From totally subjective experience, that has already pushed a few library to update their code to new apis[1]. -- M [1] or sometime to wrap code in ignore warnings...
Victor _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/bussonniermatthias%40gmai...
data:image/s3,"s3://crabby-images/e87f3/e87f3c7c6d92519a9dac18ec14406dd41e3da93d" alt=""
On Mon, 8 Feb 2016 at 06:33 Victor Stinner <victor.stinner@gmail.com> wrote:
Hi,
Since 3.3, functions of the os module started to emit DeprecationWarning when called with bytes filenames.
The rationale is quite simple: Windows native type for filenames is Unicode, and the Windows has a weird behaviour when you use bytes. For example, os.listdir(b'.') gives you paths which cannot be used with open() on filenames which are not encodable the ANSI code page. Unencodable characters are replaced with "?". The following issue was opened to document this weird behaviour (but the doc was never completed):
"Document that bytes OS API can returns unusable results on Windows" http://bugs.python.org/issue16700
When the new os.scandir() API was designed, I asked to *not* support bytes filenames since they are "broken by design". https://www.python.org/dev/peps/pep-0471/
Recently, an user complained that os.walk() doesn't work with bytes on Windows anymore:
"Regression: os.walk now using os.scandir() breaks bytes filenames on windows" http://bugs.python.org/issue25911
Serhiy Storchaka just pushed a change to reintroduce support bytes support on Windows in os.walk(), but I would prefer to do the *opposite*: drop supports for bytes filenames on Windows.
Are we brave enough to force users to use the "right" type for filenames?
--
On Python 2, it wasn't possible to use Unicode for filenames, many functions fail badly with Unicode, especially when you mix bytes and Unicode.
On Python 3, Unicode is the "natural" types, most Python functions prefer Unicode, and the PEP 383 (surrogateescape) allows to safetely use Unicode on UNIX even with undecodable filenames (invalid bytes are stored as Unicode surrogate characters).
If Unicode string don't work in Python 2 then what is Python 2/3 to do as a cross-platform solution if we completely remove bytes support in Python 3? Wouldn't that mean there is no common type between Python 2 & 3 that one can use which will work with the os module except native strings (which are difficult to get right)?
data:image/s3,"s3://crabby-images/b95e3/b95e396bc8fdf61a56bb414dc1bca38be1beca74" alt=""
On 2/8/2016 12:02, Brett Cannon wrote:
If Unicode string don't work in Python 2 then what is Python 2/3 to do as a cross-platform solution if we completely remove bytes support in Python 3? Wouldn't that mean there is no common type between Python 2 & 3 that one can use which will work with the os module except native strings (which are difficult to get right)?
The only solution then would be to do `if not PY3: arg = arg.encode(...);; os.SOMEFUNC(arg)`, pardon my psudocode. Its annoying, but at least its not a language syntax change which means it isn't intractable, just an annoying roadblock. If I had my druthers it would be put off until after 2.x is well and truly dead.
data:image/s3,"s3://crabby-images/d224a/d224ab3da731972caafa44e7a54f4f72b0b77e81" alt=""
On Monday, February 8, 2016 9:11 AM, Alexander Walters <tritium-list@sdamon.com> wrote:
On 2/8/2016 12:02, Brett Cannon wrote:
If Unicode string don't work in Python 2 then what is Python 2/3 to do as a cross-platform solution if we completely remove bytes support in Python 3? Wouldn't that mean there is no common type between Python 2 & 3 that one can use which will work with the os module except native strings (which are difficult to get right)?
The only solution then would be to do `if not PY3: arg = arg.encode(...);; os.SOMEFUNC(arg)`, pardon my psudocode.
That's exactly what you _don't_ want to do. More generally, the assumption here is wrong. It's not true that you can't use Unicode for Window filenames on Python 2. What is true is that you have to be a lot more careful about using Unicode _consistently_. And that Python 2 gives you very little help in doing so. And some third-party modules may make it harder on you. But if you always use unicode, `os.listdir(u'.')` calls FindFirstFileW instead of FindFirstFileA and gives you back unicode filenames, os.stat or open call _wstat or _wopen with those unicode filenames, etc. The problem is that on POSIX, you're often better off using str everywhere, because Python 2.7 doesn't do surrogate escape. And once you're using str on one platform/unicode on the other for filenames, it gets very easy to mix str and unicode in other places (like strings you want to print out for the user or store in a database), and then you're in mojibake hell. The io module, the pathlib backport, and six can help a bit (at the cost of performance and/or simplicity), but there's no easy answer--if there _were_ an easy answer, we wouldn't have Python 3 in the first place, right?
data:image/s3,"s3://crabby-images/b3d87/b3d872f9a7bbdbbdbd3c3390589970e6df22385a" alt=""
Hi, 2016-02-08 18:02 GMT+01:00 Brett Cannon <brett@python.org>:
If Unicode string don't work in Python 2 then what is Python 2/3 to do as a cross-platform solution if we completely remove bytes support in Python 3? Wouldn't that mean there is no common type between Python 2 & 3 that one can use which will work with the os module except native strings (which are difficult to get right)?
IMHO we have to put a line somewhere between Python 2 and Python 3. For some specific use cases, there is no good solution which works on both Python versions. For filenames, there is no simple design on Python 2. bytes is the natural choice on UNIX, whereas Unicode is preferred on Windows. But it's difficult to handle two types in the same code base. As a consequence, most users use bytes on Python 2, which is a bad choice for Windows... On Python 3, it's much simpler: always use Unicode. Again, the PEP 383 helps on UNIX. I wrote a PoC for Mercurial to always use Unicode, but the idea was rejected since Mercurial must support undecodable filenames on UNIX. It's possible on Python 3 (str+PEP 383), not on Python 2. I tried to port Mercurial to Python 3 and use Unicode for filenames in the same change. It's probably better to do that in two steps: first port to Python 3, then use Unicode. I guess that the final change is to drop Python 2? I don't know if it's feasible for Mercurial. Victor
data:image/s3,"s3://crabby-images/8e91b/8e91bd2597e9c25a0a8c3497599699707003a9e9" alt=""
On 9 February 2016 at 10:13, Victor Stinner <victor.stinner@gmail.com> wrote:
IMHO we have to put a line somewhere between Python 2 and Python 3. For some specific use cases, there is no good solution which works on both Python versions.
For filenames, there is no simple design on Python 2. bytes is the natural choice on UNIX, whereas Unicode is preferred on Windows. But it's difficult to handle two types in the same code base. As a consequence, most users use bytes on Python 2, which is a bad choice for Windows...
On Python 3, it's much simpler: always use Unicode. Again, the PEP 383 helps on UNIX.
So if you were proposing "drop the bytes APIs everywhere" that might be acceptable (for Python 3). But of course it makes porting harder, so it's probably not a good idea until Python 2 is no longer relevant. Paul
data:image/s3,"s3://crabby-images/8e91b/8e91bd2597e9c25a0a8c3497599699707003a9e9" alt=""
On 8 February 2016 at 14:32, Victor Stinner <victor.stinner@gmail.com> wrote:
Since 3.3, functions of the os module started to emit DeprecationWarning when called with bytes filenames.
Everywhere? Or just on Windows? I can't tell from your email and I don't have a Unix system to hand to check.
The rationale is quite simple: Windows native type for filenames is Unicode, and the Windows has a weird behaviour when you use bytes. For example, os.listdir(b'.') gives you paths which cannot be used with open() on filenames which are not encodable the ANSI code page. Unencodable characters are replaced with "?". The following issue was opened to document this weird behaviour (but the doc was never completed):
"Document that bytes OS API can returns unusable results on Windows" http://bugs.python.org/issue16700
OK, that seems fine, but obviously of limited interest to Unix users who aren't worried about cross-platform portability :-)
When the new os.scandir() API was designed, I asked to *not* support bytes filenames since they are "broken by design". https://www.python.org/dev/peps/pep-0471/
Recently, an user complained that os.walk() doesn't work with bytes on Windows anymore:
"Regression: os.walk now using os.scandir() breaks bytes filenames on windows" http://bugs.python.org/issue25911
Serhiy Storchaka just pushed a change to reintroduce support bytes support on Windows in os.walk(), but I would prefer to do the *opposite*: drop supports for bytes filenames on Windows.
But leave those APIs as Unix only? That seems like a regression, too (sure, the bytes APIs are problematic on Windows, but only for certain characters AIUI). Windows users currently using programs written using the bytes API (presumably originally intended for Unix where the bytes API was a deliberate choice), who don't hit any encoding issues currently, will see those programs broken for no reason other than "users using different character sets than you may have been hitting issues before". That seems like a weird justification to me...
Are we brave enough to force users to use the "right" type for filenames?
If it were *all* users I'd say it's worth considering. But practicality beats purity here IMO, and I feel that allowing people's code to be "portable by default" is a more important goal than enforcing encoding purity on a single platform. Paul
data:image/s3,"s3://crabby-images/b3d87/b3d872f9a7bbdbbdbd3c3390589970e6df22385a" alt=""
2016-02-08 19:26 GMT+01:00 Paul Moore <p.f.moore@gmail.com>:
On 8 February 2016 at 14:32, Victor Stinner <victor.stinner@gmail.com> wrote:
Since 3.3, functions of the os module started to emit DeprecationWarning when called with bytes filenames.
Everywhere? Or just on Windows? I can't tell from your email and I don't have a Unix system to hand to check.
I propose to only drop support for bytes filenames on Windows. Victor
data:image/s3,"s3://crabby-images/db10b/db10b2d1a4e2ea017d0603aaba5317668217a8b8" alt=""
Could we perhaps redefine bytes paths on Windows as utf8 and use Unicode everywhere internally? I really don't like the idea of not being able to use bytes in cross platform code. Unless it's become feasible to use Unicode for lossless filenames on Linux - last I heard it wasn't. Top-posted from my Windows Phone -----Original Message----- From: "Victor Stinner" <victor.stinner@gmail.com> Sent: 2/9/2016 5:05 To: "Paul Moore" <p.f.moore@gmail.com> Cc: "Python Dev" <Python-Dev@python.org> Subject: Re: [Python-Dev] Windows: Remove support of bytes filenames in theos module? 2016-02-08 19:26 GMT+01:00 Paul Moore <p.f.moore@gmail.com>:
On 8 February 2016 at 14:32, Victor Stinner <victor.stinner@gmail.com> wrote:
Since 3.3, functions of the os module started to emit DeprecationWarning when called with bytes filenames.
Everywhere? Or just on Windows? I can't tell from your email and I don't have a Unix system to hand to check.
I propose to only drop support for bytes filenames on Windows. Victor _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/steve.dower%40python.org
data:image/s3,"s3://crabby-images/0f8ec/0f8eca326d99e0699073a022a66a77b162e23683" alt=""
On Wed, Feb 10, 2016 at 12:37 PM, Steve Dower <python@stevedower.id.au> wrote:
I really don't like the idea of not being able to use bytes in cross platform code. Unless it's become feasible to use Unicode for lossless filenames on Linux - last I heard it wasn't.
It has, but only in Python 3 - anyone who needs to support 2.7 and arbitrary bytes in filenames can't use Unicode strings. ChrisA
data:image/s3,"s3://crabby-images/6a9ad/6a9ad89a7f4504fbd33d703f493bf92e3c0cc9a9" alt=""
On Wed, Feb 10, 2016 at 12:41:08PM +1100, Chris Angelico wrote:
On Wed, Feb 10, 2016 at 12:37 PM, Steve Dower <python@stevedower.id.au> wrote:
I really don't like the idea of not being able to use bytes in cross platform code. Unless it's become feasible to use Unicode for lossless filenames on Linux - last I heard it wasn't.
It has, but only in Python 3 - anyone who needs to support 2.7 and arbitrary bytes in filenames can't use Unicode strings.
Are you sure? Unless I'm confused, which I may be, I don't think you can specify file names with arbitrary bytes in Python 3. Writing, and reading, filenames including odd bytes works in Python 2.7: [steve@ando ~]$ python -c 'open("/tmp/abc\xD8\x01", "w").write("Hello World\n")' [steve@ando ~]$ ls /tmp/abc* /tmp/abc?? [steve@ando ~]$ python -c 'print open("/tmp/abc\xD8\x01", "r").read()' Hello World [steve@ando ~]$ And I can read the file using bytes in Python 3: [steve@ando ~]$ python3.3 -c 'print(open(b"/tmp/abc\xD8\x01", "r").read())' Hello World [steve@ando ~]$ But Unicode fails: [steve@ando ~]$ python3.3 -c 'print(open("/tmp/abc\xD8\x01", "r").read())' Traceback (most recent call last): File "<string>", line 1, in <module> FileNotFoundError: [Errno 2] No such file or directory: '/tmp/abcØ\x01' What Unicode string does one need to give in order to open file b"/tmp/abc\xD8\x01"? I think one would need to find a valid unicode string which, when encoded to UTF-8, gives the byte sequence \xD8\x01, but since that's half of a surrogate pair it is an illegal UTF-8 byte sequence. So I don't think it can be done. Am I mistaken? -- Steve
data:image/s3,"s3://crabby-images/b3d87/b3d872f9a7bbdbbdbd3c3390589970e6df22385a" alt=""
2016-02-10 11:18 GMT+01:00 Steven D'Aprano <steve@pearwood.info>:
[steve@ando ~]$ python3.3 -c 'print(open(b"/tmp/abc\xD8\x01", "r").read())' Hello World
[steve@ando ~]$ python3.3 -c 'print(open("/tmp/abc\xD8\x01", "r").read())' Traceback (most recent call last): File "<string>", line 1, in <module> FileNotFoundError: [Errno 2] No such file or directory: '/tmp/abcØ\x01'
What Unicode string does one need to give in order to open file b"/tmp/abc\xD8\x01"?
Use os.fsdecode(b"/tmp/abc\xD8\x01") to get the filename as an Unicode string, it will work. Removing 'b' in front of byte strings is not enough to convert an arbitrary byte strings to Unicode :-D Encodings are more complex than that... See http://unicodebook.readthedocs.org/ The problem on Python 2 is that the UTF-8 encoders encode surrogate characters, which is wrong. You cannot use an error handler to choose how to handle these surrogate characters. On Python 3, you have a wide choice of builtin error handlers, and you can even write your own error handlers. Example with Python 3.6 and its new "namereplace" error handler.
def format_filename(filename, encoding='ascii', errors='backslashreplace'): ... return filename.encode(encoding, errors).decode(encoding) ...
print(format_filename(os.fsdecode(b'abc\xff'))) abc\udcff
print(format_filename(os.fsdecode(b'abc\xff'), errors='replace')) abc?
print(format_filename(os.fsdecode(b'abc\xff'), errors='ignore')) abc
print(format_filename(os.fsdecode(b'abc\xff') + "é", errors='namereplace')) abc\udcff\N{LATIN SMALL LETTER E WITH ACUTE}
My locale encoding is UTF-8. Victor
data:image/s3,"s3://crabby-images/d224a/d224ab3da731972caafa44e7a54f4f72b0b77e81" alt=""
On Feb 9, 2016, at 17:37, Steve Dower <python@stevedower.id.au> wrote:
Could we perhaps redefine bytes paths on Windows as utf8 and use Unicode everywhere internally?
When you receive bytes from argv, stdin, a text file, a GUI, a named pipe, etc., and then use them as a path, Python treating them as UTF-8 would break everything. Plus, the problem only exists in Python 2, and Python is not going to fix Unicode support in Python 2, both because it's too late for such a major change in Python 2, and because it's probably impossible* (which is why we have Python 3 in the first place).
I really don't like the idea of not being able to use bytes in cross platform code. Unless it's become feasible to use Unicode for lossless filenames on Linux - last I heard it wasn't.
It is, and has been for years. Surrogate escaping solved the linux problem. That doesn't help for Python 2, but again, it's too late for Python 2. * Well, maybe in the future, some linux distros will bite the same bullet OS X did and mandate that filesystem drivers must expose UTF-8, doing whatever transcoding or other munging is necessary under the covers, to be valid. But I'm guessing any such distros will be all-Python-3 long before then, and the people using Python 2 will also be using old versions or conservative distros.
data:image/s3,"s3://crabby-images/db10b/db10b2d1a4e2ea017d0603aaba5317668217a8b8" alt=""
On 09Feb2016 1801, Andrew Barnert wrote:
On Feb 9, 2016, at 17:37, Steve Dower <python@stevedower.id.au <mailto:python@stevedower.id.au>> wrote:
Could we perhaps redefine bytes paths on Windows as utf8 and use Unicode everywhere internally?
When you receive bytes from argv, stdin, a text file, a GUI, a named pipe, etc., and then use them as a path, Python treating them as UTF-8 would break everything.
Sure, but that's already broken today if you're communicating bytes via some protocol without manually managing the encoding, in which case you should be decoding it (and potentially re-encoding to sys.getfilesystemencoding()). The problem here is the protocol that Python uses to return bytes paths, and that protocol is inconsistent between APIs and information is lost. It really requires going through all the OS calls and either (a) making them consistently decode bytes to str using the declared FS encoding (currently 'mbcs', but I see no reason we can't make it 'utf_8'), or (b) make them consistently use the user's current system locale setting by always using the *A Win32 APIs rather than the *W ones.
I really don't like the idea of not being able to use bytes in cross platform code. Unless it's become feasible to use Unicode for lossless filenames on Linux - last I heard it wasn't.
It is, and has been for years. Surrogate escaping solved the linux problem. That doesn't help for Python 2, but again, it's too late for Python 2.
Okay, guess I was operating out of old information. Thanks (and thanks Chris for the same answer).
data:image/s3,"s3://crabby-images/b96f7/b96f788b988da8930539f76bf56bada135c1ba88" alt=""
Steve Dower writes:
On 09Feb2016 1801, Andrew Barnert wrote:
On Feb 9, 2016, at 17:37, Steve Dower <python@stevedower.id.au <mailto:python@stevedower.id.au>> wrote:
Could we perhaps redefine bytes paths on Windows as utf8 and use Unicode everywhere internally?
When you receive bytes from argv, stdin, a text file, a GUI, a named pipe, etc., and then use them as a path, Python treating them as UTF-8 would break everything.
Sure, but that's already broken today if you're communicating bytes via some protocol without manually managing the encoding, in which case you should be decoding it (and potentially re-encoding to sys.getfilesystemencoding()).
The problem is that treating them as UTF-8 in Python will raise errors on any file name that isn't valid UTF-8, or corrupt the filename if you use one of the handlers available in Python 2. If you use Latin-1, that (1) handles all 256 bytes, and (2) roundtrips to Unicode. Although semantically useless to users, if it's just read from a directory, then used to manipulate file contents, no problem. Of course if you then edit a multibyte file name as Unicode it is likely that all hell will break loose. But you can take that sentence and s/Unicode/bytes/, too. :-/
The problem here is the protocol that Python uses to return bytes paths, and that protocol is inconsistent between APIs and information is lost.
No, the problem is that the necessary information simply isn't always available. Not even today: think removable media, especially archival content. Also network file systems: I don't know if it still happens, but I've seen Shift JIS, GB2312, and KOI8-R all in the same directory, and sometimes two of those in the *same path*. (Don't ask me how non-malicious users managed to do the latter!)
It really requires going through all the OS calls and either (a) making them consistently decode bytes to str using the declared FS encoding (currently 'mbcs', but I see no reason we can't make it 'utf_8'),
If it were that easy, it would have been done two decades ago. I'm no fan of Windows[1], but it's obvious that Microsoft has devoted enormous amounts of brainpower to the problem of encoding rationalization since the early 90s. I don't think they would have missed this idea. Footnotes: [1] Its complete inability to DTRT for mixed English and Japanese was what drove me to Unix-like OSes in the early 90s.
data:image/s3,"s3://crabby-images/db10b/db10b2d1a4e2ea017d0603aaba5317668217a8b8" alt=""
On 09Feb2016 2017, Stephen J. Turnbull wrote:
The problem here is the protocol that Python uses to return bytes paths, and that protocol is inconsistent between APIs and information is lost.
No, the problem is that the necessary information simply isn't always available. Not even today: think removable media, especially archival content. Also network file systems: I don't know if it still happens, but I've seen Shift JIS, GB2312, and KOI8-R all in the same directory, and sometimes two of those in the *same path*. (Don't ask me how non-malicious users managed to do the latter!)
But if we return bytes paths and the user passes them back in unchanged, that should be irrelevant. The earlier issue was that that doesn't work (e.g. a bytes path from os.scandir couldn't be passed back into open()).
It really requires going through all the OS calls and either (a) making them consistently decode bytes to str using the declared FS encoding (currently 'mbcs', but I see no reason we can't make it 'utf_8'),
If it were that easy, it would have been done two decades ago. I'm no fan of Windows[1], but it's obvious that Microsoft has devoted enormous amounts of brainpower to the problem of encoding rationalization since the early 90s. I don't think they would have missed this idea.
I meant with Python's calls into the API. Anywhere Python does the conversion from bytes to LPCWSTR (the UTF-16 type) there's a chance it'll be wrong. Your earlier comments (regarding encoding/decoding to/from Unicode, which I didn't have anything valuable to add to) basically reflect the fact that developers need to treat bytes paths as blobs on all platforms and the core Python runtime needs to obtain and use them consistently. Which means *always* using the Win32 *A APIs and never doing a conversion ourselves. Microsoft's solution here is the user's active code page, much like *nix's solution as I understand it, except that where *nix will convert _to_ the encoding as a normalized form, Windows will convert _from_ the encoding to its UTF-16 "normalized" form. Back-compat concerns have prevented any significant changes being made here, otherwise there wouldn't be a 'bytes' interface at all. (Or more likely, everything would be UTF-8 based, but back-compat is king in Windows-land.) Cheers, Steve
data:image/s3,"s3://crabby-images/b96f7/b96f788b988da8930539f76bf56bada135c1ba88" alt=""
Executive summary: Code pages and POSIX locales aren't solutions, they're the Original Sin. Steve Dower writes:
On 09Feb2016 2017, Stephen J. Turnbull wrote:
The problem here is the protocol that Python uses to return bytes paths, and that protocol is inconsistent between APIs and information is lost.
No, the problem is that the necessary information simply isn't always available.
But if we return bytes paths and the user passes them back in unchanged, that should be irrelevant.
Yes. That's pretty much exactly the semantics of using the latin-1 codec. UTF-8 can't do that without surrogateescape, which Python 2 lacks.
The earlier issue was that that doesn't work (e.g. a bytes path from os.scandir couldn't be passed back into open()).
My purely-from-the-user-side take is that that's just a bug in os.scandir that should be fixed, and that even though the complexity that occasions such bugs is an undesirable aspect of Python (v2) programming, it's not a bug because it *can't* be fixed -- you have to fix the world, not Python. Or switch to Python 3. I don't know enough to have an opinion on whether "fixing" os.scandir could cause other problems.
I meant with Python's calls into the API. Anywhere Python does the conversion from bytes to LPCWSTR (the UTF-16 type) there's a chance it'll be wrong.
Indeed. That's why converting the bytes is often the wrong thing to do *period*. The reasons that Python might be wrong apply to every agent that might decide the conversion -- except the user; the user is never wrong about these things.
Microsoft's solution here is the user's active code page, much like *nix's solution as I understand it, except that where *nix will convert _to_ the encoding as a normalized form, Windows will convert _from_ the encoding to its UTF-16 "normalized" form.
Not quite accurate. Unix by original design doesn't *have* a normalized form.[1] Bytez-iz-bytez-R-Us, that's Unix. Recently everybody (except for a few nationalist lunatics and the unteachables in some legislatures) has learned that some form of Unicode is the way to go internally. But that's "best practice", not POSIX requirement, and tons of software continues to operate[2] based on the assumption that users are monolingual with a canonical one-byte encoding, so it doesn't matter as long as *no conversion is ever done*, and the input methods and fonts are consistent with each other. Code pages just try to *enforce* that constraint (and as I already mentioned, that pissed me off so much in 1990 that I'm still a Windows refusenik today).
Back-compat concerns have prevented any significant changes being made here, otherwise there wouldn't be a 'bytes' interface at all.
It's not just back-compat, it's absolutely necessary in a code-page- based world because you just can't be sure what encoding your content is in until the user tells you the crap you've spewed on her screen might be Klingon, but it's not any of the 7 human languages she knows. "Toto! I don't think we're in Kansas any more...." The fact is that code-page-based content continues to be produced in significant quantities, despite the universal availability and absolute superiority (except for workstation reconfiguration costs) of Unicode. Footnotes: [1] The POSIX locale selects encodings for console input and output. File I/O is just bytes, both the content and the file name. The code page also defines the file name encoding as I understand it. [2] I would hope that nobody is *writing* software like that any more, but I live in Japan. That hope is years in the future for me.
data:image/s3,"s3://crabby-images/8e91b/8e91bd2597e9c25a0a8c3497599699707003a9e9" alt=""
On 10 February 2016 at 08:00, Stephen J. Turnbull <stephen@xemacs.org> wrote:
The earlier issue was that that doesn't work (e.g. a bytes path from os.scandir couldn't be passed back into open()).
My purely-from-the-user-side take is that that's just a bug in os.scandir that should be fixed, and that even though the complexity that occasions such bugs is an undesirable aspect of Python (v2) programming, it's not a bug because it *can't* be fixed -- you have to fix the world, not Python. Or switch to Python 3.
I don't know enough to have an opinion on whether "fixing" os.scandir could cause other problems.
The original os.scandir issue was encountered on Python 3. And I do agree with Victor that the correct answer was to point out to the user that they should be using unicode/surrogateescape. What I disagree with is mandating that (by removing the bytes interface) on anything other than all platforms at once, because that doesn't remove the problem (of coders using the wrong approach on Python 3) it just makes the code such users write non-portable. Whether removing the bytes interface is feasible, given that there's then no way that works across Python 2 and 3 of writing code that manipulates the sort of bytes-that-use-multiple-encodings data that you mention, is a separate issue. Paul
data:image/s3,"s3://crabby-images/b3d87/b3d872f9a7bbdbbdbd3c3390589970e6df22385a" alt=""
2016-02-10 9:30 GMT+01:00 Paul Moore <p.f.moore@gmail.com>:
Whether removing the bytes interface is feasible, given that there's then no way that works across Python 2 and 3 of writing code that manipulates the sort of bytes-that-use-multiple-encodings data that you mention, is a separate issue.
It's annoying that 8 years after the release of Python 3.0, Python 3 is still stuck by Python 2 :-( Victor
data:image/s3,"s3://crabby-images/8e91b/8e91bd2597e9c25a0a8c3497599699707003a9e9" alt=""
On 10 February 2016 at 08:45, Victor Stinner <victor.stinner@gmail.com> wrote:
2016-02-10 9:30 GMT+01:00 Paul Moore <p.f.moore@gmail.com>:
Whether removing the bytes interface is feasible, given that there's then no way that works across Python 2 and 3 of writing code that manipulates the sort of bytes-that-use-multiple-encodings data that you mention, is a separate issue.
It's annoying that 8 years after the release of Python 3.0, Python 3 is still stuck by Python 2 :-(
Agreed. Of course personally, I'm in favour of going Python 3/Unicode everywhere, it's the Unix guys with their legacy distros and Python installations and bytes-based filesystems that get in the way of that :-) And I don't think we're brave enough to force *Unix* users to use the right type for filenames :-) Paul
data:image/s3,"s3://crabby-images/d224a/d224ab3da731972caafa44e7a54f4f72b0b77e81" alt=""
On Wednesday, February 10, 2016 12:47 AM, Victor Stinner <victor.stinner@gmail.com> wrote:
2016-02-10 9:30 GMT+01:00 Paul Moore <p.f.moore@gmail.com>: Whether removing the bytes interface is feasible, given that there's then no way that works across Python 2 and 3 of writing code that manipulates the sort of bytes-that-use-multiple-encodings data that you mention, is a separate issue.
Well, there's a surrogate-escape backport on PyPI (I think there's a standalone one, and one in python-future), so you _could_ do everything the same as in 3.x. Depending on what you're doing, you may also need to use the io module instead of file (which may just mean "from io import open", but could mean more work), wrap the stdio streams explicitly, manually decode argv, etc. But someone could write a six-like module (or add it to six) that does all of that. It may be a little slower and more memory-intensive in 2.7 than in 3.x, but for most apps, that doesn't matter. The big problem would be third-party libraries (and stdlib modules like csv) that want to use bytes in 2.x; convincing them all to support full-on-unicode in 2.x might be more trouble than it's worth. Still, if I were feeling the pain of maintaining lots of linux-bytes-Windows-unicode-2.7 code, I'd try it and see how far I get.
It's annoying that 8 years after the release of Python 3.0, Python 3 is still stuck by Python 2 :-(
I understand the frustration, but... time already goes too fast at my age; don't skip me ahead almost a whole year to December 2016. :) Also, unless you're the one guy who actually abandoned 2.6 for 3.0, it's probably more useful to count from 2.7, 3.2, or the no-2.8 declaration, which are all about 5 years ago.
data:image/s3,"s3://crabby-images/b96f7/b96f788b988da8930539f76bf56bada135c1ba88" alt=""
Victor Stinner writes:
It's annoying that 8 years after the release of Python 3.0, Python 3 is still stuck by Python 2 :-(
I prefer to think of it as the irritant that reminds me that I am very much alive, and so is Python, vibrantly so.
data:image/s3,"s3://crabby-images/d224a/d224ab3da731972caafa44e7a54f4f72b0b77e81" alt=""
On Feb 9, 2016, at 20:17, Stephen J. Turnbull <stephen@xemacs.org> wrote:
It really requires going through all the OS calls and either (a) making them consistently decode bytes to str using the declared FS encoding (currently 'mbcs', but I see no reason we can't make it 'utf_8'),
If it were that easy, it would have been done two decades ago. I'm no fan of Windows[1], but it's obvious that Microsoft has devoted enormous amounts of brainpower to the problem of encoding rationalization since the early 90s. I don't think they would have missed this idea.
Microsoft spent a lot of time and effort on the idea that UTF-16 (or, originally, UCS-2) everywhere was the answer. Never call the A functions (or the msvcrt functions that emulate the C and POSIX stdlib), and there's never a problem. What if you read filenames out of a text file? No problem; text files are UTF-16-BOM. Over a socket? All network protocols are also UTF-16. What if you have to read a file written in Unix? Come on, nobody's ever created a useful file without Windows. What about Windows 3.1? Uh... that's a problem. Also, what happens when Unicode goes over 64k characters? And so on. So their grand project failed. That doesn't mean the problem can't be solved. Apple solved their equivalent problem, albeit by sacrificing backward compatibility in a way Microsoft can't get away with. I haven't seen a MacRoman or Shift-JIS filename since they broke the last holdout (the low-level AppleEvent interface) in 10.7--and most of the apps I was using back then don't run on 10.10 without an update. So Python 2 works great on Macs, whether you use bytes or unicode. But that doesn't help us on Windows, where you can't use bytes, or Linux, where you can't use Unicode (without surrogate escape or some other mechanism that Python 2 doesn't have).
data:image/s3,"s3://crabby-images/b96f7/b96f788b988da8930539f76bf56bada135c1ba88" alt=""
Andrew Barnert via Python-Dev writes:
That doesn't mean the problem can't be solved. Apple solved their equivalent problem, albeit by sacrificing backward compatibility in a way Microsoft can't get away with. I haven't seen a MacRoman or Shift-JIS filename since they broke the last holdout
If you lived where I do, you'd still be seeing both, because you wouldn't be able to escape archival files on CD and removable media (typically written on Windows boxen). They still work, sort of == same as always, and as far as I know, that's because Apple has *not* sacrificed backward compatibility: under the hood, Darwin is still a POSIX kernel which thinks of file names and everything else outside of memory as bytestreams. One place they *fail very badly* is Shift JIS filenames in zipfiles, which nothing provided by Apple can deal with safely, and InfoZip breaks too (at least in MacPorts). Yes, I know that is specifically disallowed. Feel free to tell 1_0000_0000 Japanese Windows users. Thank heaven for Python there! A three-line hack and I'm free!
So Python 2 works great on Macs, whether you use bytes or unicode. But that doesn't help us on Windows, where you can't use bytes, or Linux, where you can't use Unicode (without surrogate escape or some other mechanism that Python 2 doesn't have).
You contradict yourself! ;-)
data:image/s3,"s3://crabby-images/d224a/d224ab3da731972caafa44e7a54f4f72b0b77e81" alt=""
On Wednesday, February 10, 2016 6:50 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Andrew Barnert via Python-Dev writes:
That doesn't mean the problem can't be solved. Apple solved their equivalent problem, albeit by sacrificing backward compatibility in a way Microsoft can't get away with. I haven't seen a MacRoman or Shift-JIS filename since they broke the last holdout
If you lived where I do, you'd still be seeing both, because you wouldn't be able to escape archival files on CD and removable media (typically written on Windows boxen). They still work, sort of == same as always, and as far as I know, that's because Apple has *not* sacrificed backward compatibility: under the hood, Darwin is still a POSIX kernel which thinks of file names and everything else outside of memory as bytestreams.
Sure, but the Darwin kernel can't read CDs; that's up to the CD filesystem driver. Anyway, Windows CDs can't cause this problem. Windows CDs use the Joliet filesystem,[^1] which stores everything in UCS2.[^2] When you call CreateFileA or fopen or _open with bytes, Windows decodes those bytes and stores them as UCS2. The filesystem drivers on POSIX platforms have to encode that UCS2 to _something_ (POSIX APIs make it very hard for you to deal with filename strings like "A\0B\0C\0.\0T\0X\0T\0\0\0"...). The linux driver uses a mount option to decide how to encode; the OS X driver always uses UTF-8. And every valid UCS2 string can be encoded as UTF-8, so you can use unicode everywhere, even in Python 2. Of course you can have mojibake problems, but that's a different issue,[^3] and no worse with unicode than with bytes.[^4] The same thing is true with NTFS external drives, VFAT USB drives, etc. Generally, it's usually not Windows media on *nix systems that break Python 2 unicode; it's native *nix filesystems where users mix locales.
One place they *fail very badly* is Shift JIS filenames in zipfiles, which nothing provided by Apple can deal with safely, and InfoZip breaks too (at least in MacPorts). Yes, I know that is specifically disallowed. Feel free to tell 1_0000_0000 Japanese Windows users.
The good news is, as far as I can tell, it's not disallowed anymore.[^5] So we just have to tell them that they shouldn't have been doing it in the past. :) Anyway, zipfiles are data files as far as the OS is concerned; the fact that they contain filenames is no more relevant to the kernel (or filesystem driver or userland) than the fact that "List of PDFs to Read This Weekend.txt" contains filenames. PS, everything Apple provides is already using Info-ZIP.
So Python 2 works great on Macs, whether you use bytes or unicode. But that doesn't help us on Windows, where you can't use bytes, or Linux, where you can't use Unicode (without surrogate escape or some other mechanism that Python 2 doesn't have).
You contradict yourself! ;-)
Yes, as I later realized, sometimes, you _can_ (or at least ought to be able to--I haven't actually tried) use Python 2 with unicode everywhere to write cross-platform software that actually works on linux, by using backports of surrogate-escape and pathlib, and the io module instead of the file type, as long as you only need stdlib and third-party modules that support unicode filenames. If that does work for at least some apps, then I'm perfectly happen to have been wrong earlier. And if catching myself before someone else did makes me a flip-flopper, well, I'm not running for president. :P [^1]: Except when Vista and 7 mistakenly think your CD is a DVD and use UDF instead of ISO9660--but in that case, the encoding is stored in the filesystem header, so it's also not a problem. [^2]: Actually, despite Microsoft's spec, later versions of Windows store UTF-16, even if there are surrogate pairs, or BMP-but-post-UCS2 code points. But that doesn't matter here; the linux, Mac, etc. drivers all assume UTF-16, which works either way. [^3]: Say you write a program that assumes it will only be run on Shift-JIS systems, and you use CreateFileA to create a file named "ハローワールド". The actual bytes you're sending are cp436 for "ânâìü[âÅü[âïâh", so the file on the CD is named, in Unicode, "ânâìü[âÅü[âïâh". So of course the Mac driver encodes that to UTF-8 b"ânâìü[âÅü[âïâh". You won't have any problems opening what you readdir, or what you copy from a UTF-8 terminal or a UTF-16 Cocoa app like Finder, etc. But of course you will have trouble getting your user to recognize that name as meaningful, unless you can figure out or guess or prompt the user to guess that it needs to be passed through s.encode('cp436').decode('shift-jis'). [^4]: Your locale is always UTF-8 on Mac. So the only significant difference is that if you're using bytes, you need b.decode('utf-8').encode('cp436').decode('shift-jis') to fix the problem. [^5]: Zipfiles using the Unicode extension can store a UTF-8 transcoding along with the local bytes, in which case the local bytes do not have to be in the header-declared encoding, because they will be ignored. And I think everything Microsoft ships now handles this properly. And Info-ZIP, and therefore all of Apple's tools, also handle it properly--so, not only is it legal, it even works.
data:image/s3,"s3://crabby-images/8daae/8daaee319d87a72826412fda4bc5f06e2c5ee594" alt=""
On Wed, Feb 10, 2016 at 2:30 PM, Andrew Barnert via Python-Dev <python-dev@python.org> wrote:
[^3]: Say you write a program that assumes it will only be run on Shift-JIS systems, and you use CreateFileA to create a file named "ハローワールド". The actual bytes you're sending are cp436 for "ânâìü[âÅü[âïâh", so the file on the CD is named, in Unicode, "ânâìü[âÅü[âïâh".
Unless the system default was changed or the program called SetFileApisToOEM, CreateFileA would decode using the ANSI codepage 1252, not the OEM codepage 437 (not 436), i.e. "ƒnƒ\x8d\x81[ƒ\x8f\x81[ƒ‹ƒh". Otherwise the example is right. But the transcoding strategy won't work in general. For example, if the tables are turned such that the ANSI codepage is 932 and the program passes a bytes name from codepage 1252, the user on the other end won't be able to transcode without error if the original bytes contained invalid DBCS sequences that were mapped to the default character, U+30FB. This transcodes as the meaningless string "\x81E". The user can replace that string with "--" and enjoy a nice game of hang man.
data:image/s3,"s3://crabby-images/d224a/d224ab3da731972caafa44e7a54f4f72b0b77e81" alt=""
On Feb 10, 2016, at 15:11, eryk sun <eryksun@gmail.com> wrote:
On Wed, Feb 10, 2016 at 2:30 PM, Andrew Barnert via Python-Dev <python-dev@python.org> wrote:
[^3]: Say you write a program that assumes it will only be run on Shift-JIS systems, and you use CreateFileA to create a file named "ハローワールド". The actual bytes you're sending are cp436 for "ânâìü[âÅü[âïâh", so the file on the CD is named, in Unicode, "ânâìü[âÅü[âïâh".
Unless the system default was changed or the program called SetFileApisToOEM, CreateFileA would decode using the ANSI codepage 1252, not the OEM codepage 437 (not 436), i.e. "ƒnƒ\x8d\x81[ƒ\x8f\x81[ƒ‹ƒh". Otherwise the example is right. But the transcoding strategy won't work in general. For example, if the tables are turned such that the ANSI codepage is 932 and the program passes a bytes name from codepage 1252, the user on the other end won't be able to transcode without error if the original bytes contained invalid DBCS sequences that were mapped to the default character, U+30FB. This transcodes as the meaningless string "\x81E". The user can replace that string with "--" and enjoy a nice game of hang man.
Of course there's no way to recover the actual intended filenames if that information was thrown out instead of being stored, but that's no different from the situation where the user mashed the keyboard instead of typing what they intended. The point remains: the Mac strategy (which is also the linux strategy for filesystems that are inherently UTF-16) always generates valid UTF-8, and doesn't try to magically cure mojibake but doesn't get in the way of the user manually curing it. When the Unicode encoding is lossy, of course the user can't cure that, but UTF-8 isn't making it any harder.
data:image/s3,"s3://crabby-images/b96f7/b96f788b988da8930539f76bf56bada135c1ba88" alt=""
Executive summary: My experience is that having bytes APIs in the os module is very useful. But perhaps higher-level functions like os.scandir can do without (I present no arguments either way on that, just acknowledge it). Andrew Barnert writes:
Anyway, Windows CDs can't cause this problem.
My bad. I meant archival Mac CDs (or perhaps they were taken from a network filesystem) which is where I see MacRoman, and Windows (ie, FAT-formatted) USB drives, which is where I see Shift JIS. The point here is not what is technically possible or even standard, it's that though what I see in practice may not *require* bytes APIs, it's *very convenient* to have them (especially interactively).
The same thing is true with NTFS external drives, VFAT USB drives, etc. Generally, it's usually not Windows media on *nix systems that break Python 2 unicode; it's native *nix filesystems where users mix locales.
IMHO, Python 2 unicode is not breakable, let alone broken. ;-) Mailman 2 has managed to almost get to a state where you can't get it to raise a Unicode exception (except where deliberately used as EAFP), let alone one that is not handled (before the catch-all "except Exception" that keeps the daemon running). And that's in an application whose original encoding support assumed standard conformance by design in a realm where spammers and junior high school hackers regularly violate the most ancient of RFCs (the restriction to ASCII in headers goes back to a 6xx RFC at the latest!) Python 2 Unicode turns out to have been an excellent compromise between the needs of backward compatibility with uniformly encoded bytestreams for Europe, and the forward-looking needs of a globalizing Internet. (But you knew that! :-) As I wrote earlier, the world is broken, or at least Japan. The world "got bettah", thus Python 3. And most of the time Python 3 is wonderful in Japan (specifically, it's trivial to get recalcitrant students to use best I18N practice). My point is that *where I live* the experience is very different. There are *no* Japanese who use *nix (other than Mac OS X) for paperwork in my neighborhood. Shift JIS filenames *are* from Windows media recently written, though probably not by Microsoft-provided software. Bytes APIs are a very useful tool in dealing with these issues, at least in the hands of someone who has become expert in dealing with them. I suspect the same is true of China, except that like their business partner Apple they are in a position to legislate uniformity, and do. (Unfortunately that's GB18030, not Unicode.) So maybe they're better off than a place that coined the phrase "politics that can't decide". I admit I've not yet used os.scandir, let alone its bytes API. Perhaps we can, and perhaps we should, restrict the bytes API in the os module to a few basic functions, and require that the environment be sane for cases where we want to use higher-level or optimized functions.
You contradict yourself! ;-)
I'm perfectly happen to have been wrong earlier. And if catching myself before someone else did makes me a flip-flopper, well, I'm not running for president. :P
I consider that the most important qualification for President, especially if your name is Trump or Sanders. That's one of the things I respect most about Python: with a few (negligible) exceptions, minds change to fit the facts. And, BTW, EAFP applies here, too. Make mistakes on the mailing lists before you commit them to code. Please!<wink/>
data:image/s3,"s3://crabby-images/a03e9/a03e989385213ae76a15b46e121c382b97db1cc3" alt=""
On Mon, Feb 8, 2016 at 6:32 AM, Victor Stinner <victor.stinner@gmail.com> wrote:
Windows native type for filenames is Unicode, and the Windows has a weird behaviour when you use bytes.
Just to clarify -- what does it currently do for bytes? IIUC, Windows uses UTF-16, so can you pass in UTF-16 bytes? Or when using bytes is is assuming some Windows ANSI-compatible encoding? (and what does it return?) Are we brave enough to force users to use the "right" type for filenames?
I think so :-) On Python 2, it wasn't possible to use Unicode for filenames, many
functions fail badly with Unicode,
I've had fine success using Unicode filenames with py2 on Windows -- in fact, as soon as my users have non-ansi characters in their names I'm pretty sure I have no choice.... especially when you mix bytes and
Unicode.
well yes, that sure does get ugly! -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
data:image/s3,"s3://crabby-images/8daae/8daaee319d87a72826412fda4bc5f06e2c5ee594" alt=""
On Mon, Feb 8, 2016 at 2:41 PM, Chris Barker <chris.barker@noaa.gov> wrote:
Just to clarify -- what does it currently do for bytes? IIUC, Windows uses UTF-16, so can you pass in UTF-16 bytes? Or when using bytes is is assuming some Windows ANSI-compatible encoding? (and what does it return?)
UTF-16 is used in the [W]ide-character API. Bytes paths use the [A]NSI codepage. For a single-byte codepage, the ANSI API rountrips, i.e. a bytes path that's passed to CreateFileA matches the listing from FindFirstFileA. But for a DBCS codepage arbitrary bytes paths do not roundtrip. Invalid byte sequences map to the default character. Note that an ASCII question mark is not always the default character. It depends on the codepage. For example, in codepage 932 (Japanese), it's an error if a lead byte (i.e. 0x81-0x9F, 0xE0-0xFC) is followed by a trailing byte with a value less than 0x40 (note that ASCII 0-9 is 0x30-0x39, so this is not uncommon). In this case the ANSI API substitutes the default character for Japanese, '・' (U+30FB, Katakana middle dot). >>> locale.getpreferredencoding() 'cp932' >>> open(b'\xe05', 'w').close() >>> os.listdir('.') ['・'] >>> os.listdir(b'.') [b'\x81E'] All invalid sequences get mapped to '・', which roundtrips as b'\x81\x45', so you can't reliably create and open files with arbitrary bytes paths in this locale.
data:image/s3,"s3://crabby-images/a03e9/a03e989385213ae76a15b46e121c382b97db1cc3" alt=""
All I can say is "ouch". Hard to call it a regression to no longer allow this mess... CHB
On Feb 8, 2016, at 4:37 PM, eryk sun <eryksun@gmail.com> wrote:
On Mon, Feb 8, 2016 at 2:41 PM, Chris Barker <chris.barker@noaa.gov> wrote: Just to clarify -- what does it currently do for bytes? IIUC, Windows uses UTF-16, so can you pass in UTF-16 bytes? Or when using bytes is is assuming some Windows ANSI-compatible encoding? (and what does it return?)
UTF-16 is used in the [W]ide-character API. Bytes paths use the [A]NSI codepage. For a single-byte codepage, the ANSI API rountrips, i.e. a bytes path that's passed to CreateFileA matches the listing from FindFirstFileA. But for a DBCS codepage arbitrary bytes paths do not roundtrip. Invalid byte sequences map to the default character. Note that an ASCII question mark is not always the default character. It depends on the codepage.
For example, in codepage 932 (Japanese), it's an error if a lead byte (i.e. 0x81-0x9F, 0xE0-0xFC) is followed by a trailing byte with a value less than 0x40 (note that ASCII 0-9 is 0x30-0x39, so this is not uncommon). In this case the ANSI API substitutes the default character for Japanese, '・' (U+30FB, Katakana middle dot).
locale.getpreferredencoding() 'cp932' open(b'\xe05', 'w').close() os.listdir('.') ['・'] os.listdir(b'.') [b'\x81E']
All invalid sequences get mapped to '・', which roundtrips as b'\x81\x45', so you can't reliably create and open files with arbitrary bytes paths in this locale.
data:image/s3,"s3://crabby-images/8e91b/8e91bd2597e9c25a0a8c3497599699707003a9e9" alt=""
On 9 February 2016 at 01:57, Chris Barker - NOAA Federal <chris.barker@noaa.gov> wrote:OTOH, it's a
All I can say is "ouch". Hard to call it a regression to no longer allow this mess..
OTOH, it's a major regression for someone using an 8-bit codepage that doesn't have these problems. Code that worked fine for them now doesn't. I dislike "works for some people" solutions as much as anyone, but breaking code that does the job that people need it to is not something we should do lightly (if at all). Paul
data:image/s3,"s3://crabby-images/b96f7/b96f788b988da8930539f76bf56bada135c1ba88" alt=""
Chris Barker - NOAA Federal writes:
All I can say is "ouch". Hard to call it a regression to no longer allow this mess...
We can't "disallow" the mess, it's embedded in the lunatic computing environment (which I happen to live in). We can't even stop people from using existing Python programs abusing bytes-oriented APIs. All we can do is make it harder for people to port to Python 3, and that would be bad because it's much easier to refactor once you're in Python 3. And as Paul points out, it works fine in ASCII-compatible one-byte environments (and probably in ISO-2022-compatible 8-bit multibyte environments, too -- the big problems are the abominations known as Shift JIS and Big5). Please, let's leave it alone.
data:image/s3,"s3://crabby-images/b3d87/b3d872f9a7bbdbbdbd3c3390589970e6df22385a" alt=""
2016-02-09 1:37 GMT+01:00 eryk sun <eryksun@gmail.com>:
For example, in codepage 932 (Japanese), it's an error if a lead byte (i.e. 0x81-0x9F, 0xE0-0xFC) is followed by a trailing byte with a value less than 0x40 (note that ASCII 0-9 is 0x30-0x39, so this is not uncommon). In this case the ANSI API substitutes the default character for Japanese, '・' (U+30FB, Katakana middle dot).
>>> locale.getpreferredencoding() 'cp932' >>> open(b'\xe05', 'w').close() >>> os.listdir('.') ['・'] >>> os.listdir(b'.') [b'\x81E']
Hum, I'm not sure that I understand your example. Can you pass the result of os.listdir(str) to open() on Python 3? Are you able to open the file? Same question for os.listdir(bytes). Victor
data:image/s3,"s3://crabby-images/8daae/8daaee319d87a72826412fda4bc5f06e2c5ee594" alt=""
On Tue, Feb 9, 2016 at 3:21 AM, Victor Stinner <victor.stinner@gmail.com> wrote:
2016-02-09 1:37 GMT+01:00 eryk sun <eryksun@gmail.com>:
For example, in codepage 932 (Japanese), it's an error if a lead byte (i.e. 0x81-0x9F, 0xE0-0xFC) is followed by a trailing byte with a value less than 0x40 (note that ASCII 0-9 is 0x30-0x39, so this is not uncommon). In this case the ANSI API substitutes the default character for Japanese, '・' (U+30FB, Katakana middle dot).
>>> locale.getpreferredencoding() 'cp932' >>> open(b'\xe05', 'w').close() >>> os.listdir('.') ['・'] >>> os.listdir(b'.') [b'\x81E']
Hum, I'm not sure that I understand your example.
Say I create a sequence of files with the names "file_à[N].txt" encoded in Latin-1, where N is 0-2. They all map to the same file in a Japanese system locale: >>> open(b'file_\xe00.txt', 'w').close(); os.listdir('.') ['file_・.txt'] >>> open(b'file_\xe01.txt', 'w').close(); os.listdir('.') ['file_・.txt'] >>> open(b'file_\xe02.txt', 'w').close(); os.listdir('.') ['file_・.txt'] >>> os.listdir(b'.') [b'file_\x81E.txt'] This isn't a problem with a single-byte codepage such as 1251. For example, codepage 1251 doesn't map b"\x98" to any character, but harmlessly maps it to "\x98" (SOS in the C1 Controls block). Single-byte code pages still have the problem that when a filename is created using the wide-character API, listing it as bytes may use either an approximate mapping (e.g. "à" => "a" in 1251) or the codepage default character (e.g. "\xd7" => "?" in 1251).
data:image/s3,"s3://crabby-images/b3d87/b3d872f9a7bbdbbdbd3c3390589970e6df22385a" alt=""
2016-02-09 1:37 GMT+01:00 eryk sun <eryksun@gmail.com>:
For example, in codepage 932 (Japanese), it's an error if a lead byte (i.e. 0x81-0x9F, 0xE0-0xFC) is followed by a trailing byte with a value less than 0x40 (note that ASCII 0-9 is 0x30-0x39, so this is not uncommon). In this case the ANSI API substitutes the default character for Japanese, '・' (U+30FB, Katakana middle dot).
>>> locale.getpreferredencoding() 'cp932' >>> open(b'\xe05', 'w').close() >>> os.listdir('.') ['・'] >>> os.listdir(b'.') [b'\x81E']
All invalid sequences get mapped to '・', which roundtrips as b'\x81\x45', so you can't reliably create and open files with arbitrary bytes paths in this locale.
Oh, and I forgot to ask: what is your filesystem? Is it the same behaviour for NTFS, FAT32, network shared directories, etc.? Victor
data:image/s3,"s3://crabby-images/8daae/8daaee319d87a72826412fda4bc5f06e2c5ee594" alt=""
On Tue, Feb 9, 2016 at 3:22 AM, Victor Stinner <victor.stinner@gmail.com> wrote:
2016-02-09 1:37 GMT+01:00 eryk sun <eryksun@gmail.com>:
For example, in codepage 932 (Japanese), it's an error if a lead byte (i.e. 0x81-0x9F, 0xE0-0xFC) is followed by a trailing byte with a value less than 0x40 (note that ASCII 0-9 is 0x30-0x39, so this is not uncommon). In this case the ANSI API substitutes the default character for Japanese, '・' (U+30FB, Katakana middle dot).
>>> locale.getpreferredencoding() 'cp932' >>> open(b'\xe05', 'w').close() >>> os.listdir('.') ['・'] >>> os.listdir(b'.') [b'\x81E']
All invalid sequences get mapped to '・', which roundtrips as b'\x81\x45', so you can't reliably create and open files with arbitrary bytes paths in this locale.
Oh, and I forgot to ask: what is your filesystem? Is it the same behaviour for NTFS, FAT32, network shared directories, etc.?
That was tested using NTFS, but the same would apply to FAT32, exFAT, and UDF since they all use Unicode [1]. CreateFile[A|W] wraps the NtCreateFile system call. The NT executive is Unicode, so the system call receives the filename using a Unicode-only OBJECT_ATTRIBUTES [2] record. I can't say what an arbitrary non-Microsoft filesystem will do with the U+30FB character when it processes the IRP_MJ_CREATE. I was only concerned with ANSI<=>Unicode conversion that's implemented in the ntdll.dll runtime library. [1]: https://msdn.microsoft.com/en-us/library/ee681827 [2]: https://msdn.microsoft.com/en-us/library/ff557749
data:image/s3,"s3://crabby-images/98c42/98c429f8854de54c6dfbbe14b9c99e430e0e4b7d" alt=""
On 08.02.16 16:32, Victor Stinner wrote:
On Python 2, it wasn't possible to use Unicode for filenames, many functions fail badly with Unicode, especially when you mix bytes and Unicode.
Even not all os functions support Unicode. See http://bugs.python.org/issue18695.
participants (14)
-
Alexander Walters
-
Andrew Barnert
-
Brett Cannon
-
Chris Angelico
-
Chris Barker
-
Chris Barker - NOAA Federal
-
eryk sun
-
Matthias Bussonnier
-
Paul Moore
-
Serhiy Storchaka
-
Stephen J. Turnbull
-
Steve Dower
-
Steven D'Aprano
-
Victor Stinner