Fix default encodings on Windows
I suspect there's a lot of discussion to be had around this topic, so I want to get it started. There are some fairly drastic ideas here and I need help figuring out whether the impact outweighs the value. Some background: within the Windows API, the preferred encoding is UTF-16. This is a 16-bit format that is typed as wchar_t in the APIs that use it. These APIs are generally referred to as the *W APIs (because they have a W suffix). There are also (broadly deprecated) APIs that use an 8-bit format (char), where the encoding is assumed to be "the user's active code page". These are *A APIs. AFAIK, there are no cases where a *A API should be preferred over a *W API, and many newer APIs are *W only. In general, Python passes byte strings into the *A APIs and text strings into the *W APIs. Right now, sys.getfilesystemencoding() on Windows returns "mbcs", which translates to "the system's active code page". As this encoding generally cannot represent all paths on Windows, it is deprecated and Unicode strings are recommended instead. This, however, means you need to write significantly different code between POSIX (use bytes) and Windows (use text). ISTM that changing sys.getfilesystemencoding() on Windows to "utf-8" and updating path_converter() (Python/posixmodule.c; likely similar code in other places) to decode incoming byte strings would allow us to undeprecate byte strings and add the requirement that they *must* be encoded with sys.getfilesystemencoding(). I assume that this would allow cross-platform code to handle paths similarly by encoding to whatever the sys module says they should and using bytes consistently (starting this thread is meant to validate/refute my assumption). (Yes, I know that people on POSIX should just change to using Unicode and surrogateescape. Unfortunately, rather than doing that they complain about Windows and drop support for the platform. If you want to keep hitting them with the stick, go ahead, but I'm inclined to think the carrot is more valuable here.) Similarly, locale.getpreferredencoding() on Windows returns a legacy value - the user's active code page - which should generally not be used for any reason. The one exception is as a default encoding for opening files when no other information is available (e.g. a Unicode BOM or explicit encoding argument). BOMs are very common on Windows, since the default assumption is nearly always a bad idea. Making open()'s default encoding detect a BOM before falling back to locale.getpreferredencoding() would resolve many issues, but I'm also inclined towards making the fallback utf-8, leaving locale.getpreferredencoding() solely as a way to get the active system codepage (with suitable warnings about it only being useful for back-compat). This would match the behavior that the .NET Framework has used for many years - effectively, utf_8_sig on read and utf_8 on write. Finally, the encoding of stdin, stdout and stderr are currently (correctly) inferred from the encoding of the console window that Python is attached to. However, this is typically a codepage that is different from the system codepage (i.e. it's not mbcs) and is almost certainly not Unicode. If users are starting Python from a console, they can use "chcp 65001" first to switch to UTF-8, and then *most* functionality works (input() has some issues, but those can be fixed with a slight rewrite and possibly breaking readline hooks). It is also possible for Python to change the current console encoding to be UTF-8 on initialize and change it back on finalize. (This would leave the console in an unexpected state if Python segfaults, but console encoding is probably the least of anyone's worries at that point.) So I'm proposing actively changing the current console to be Unicode while Python is running, and hence sys.std[in|out|err] will default to utf-8. So that's a broad range of changes, and I have little hope of figuring out all the possible issues, back-compat risks, and flow-on effects on my own. Please let me know (either on-list or off-list) how a change like this would affect your projects, either positively or negatively, and whether you have any specific experience with these changes/fixes and think they should be approached differently. To summarise the proposals (remembering that these would only affect Python 3.6 on Windows): * change sys.getfilesystemencoding() to return 'utf-8' * automatically decode byte paths assuming they are utf-8 * remove the deprecation warning on byte paths * make the default open() encoding check for a BOM or else use utf-8 * [ALTERNATIVE] make the default open() encoding check for a BOM or else use sys.getpreferredencoding() * force the console encoding to UTF-8 on initialize and revert on finalize So what are your concerns? Suggestions? Thanks, Steve
On 10 August 2016 at 19:10, Steve Dower <steve.dower@python.org> wrote:
To summarise the proposals (remembering that these would only affect Python 3.6 on Windows):
* change sys.getfilesystemencoding() to return 'utf-8' * automatically decode byte paths assuming they are utf-8 * remove the deprecation warning on byte paths * make the default open() encoding check for a BOM or else use utf-8 * [ALTERNATIVE] make the default open() encoding check for a BOM or else use sys.getpreferredencoding() * force the console encoding to UTF-8 on initialize and revert on finalize
So what are your concerns? Suggestions?
I presume you'd be targeting 3.7 for this change. Broadly, I'm +1 on all of this. Personally, I'm moving to UTF-8 everywhere, so it seems OK to me, but I suspect defaulting open() to UTF-8 in the absence of a BOM might cause issues for people. Most text editors still (AFAIK) use the ANSI codepage by default, and it's the one place where an identifying BOM isn't possible. So your alternative may be a safer choice. On the other hand, files from Unix (via say github) would typically be UTF-8 without BOM, so it becomes a question of choosing the best compromise. I'm inclined to go for cross-platform and UTF-8 and clearly document the change. We might want a more convenient short form for open(filename, "r", encoding=sys.getpreferredencoding()), though, to ease the transition... We'd also need to consider how the new default encoding would interact with PYTHONIOENCODING. For the console, does this mean that the win_unicode_console module will no longer be needed when these changes go in? Sorry, not much in the way of direct experience or information I can add, but a strong +1 on the change (and I'd be happy to help where needed). Paul
On 10Aug2016 1144, Paul Moore wrote:
I presume you'd be targeting 3.7 for this change.
Does 3.6 seem too aggressive? I think I have time to implement the changes before beta 1, as it's mostly changing default values and mopping up resulting breaks. (Doing something like reimplementing files using the Win32 API rather than the CRT would be too big a task for 3.6.)
Most text editors still (AFAIK) use the ANSI codepage by default, and it's the one place where an identifying BOM isn't possible. So your alternative may be a safer choice. On the other hand, files from Unix (via say github) would typically be UTF-8 without BOM, so it becomes a question of choosing the best compromise. I'm inclined to go for cross-platform and UTF-8 and clearly document the change.
That last point was my thinking. Notepad's default is just as bad as Python's default right now, but basically everyone acknowledges that it's bad. I don't think we should prevent Python from behaving better because one Windows tool doesn't.
We might want a more convenient short form for open(filename, "r", encoding=sys.getpreferredencoding()), though, to ease the transition... We'd also need to consider how the new default encoding would interact with PYTHONIOENCODING.
PYTHONIOENCODING doesn't affect locale.getpreferredencoding() (but it does affect sys.std*.encoding).
For the console, does this mean that the win_unicode_console module will no longer be needed when these changes go in?
That's the hope, though that module approaches the solution differently and may still uses. An alternative way for us to fix this whole thing would be to bring win_unicode_console into the standard library and use it by default (or probably whenever PYTHONIOENCODING is not specified).
Sorry, not much in the way of direct experience or information I can add, but a strong +1 on the change (and I'd be happy to help where needed).
Testing with obscure filenames and strings is where help will be needed most :) Cheers, Steve
On 10 August 2016 at 20:08, Steve Dower <steve.dower@python.org> wrote:
On 10Aug2016 1144, Paul Moore wrote:
I presume you'd be targeting 3.7 for this change.
Does 3.6 seem too aggressive? I think I have time to implement the changes before beta 1, as it's mostly changing default values and mopping up resulting breaks. (Doing something like reimplementing files using the Win32 API rather than the CRT would be too big a task for 3.6.)
I guess I just assumed it was a bigger change than that. I don't object to it going into 3.6 as such, but it might need longer for any debates to die down. I guess that comes down to how big this thread gets, though. Personally, I'd be OK with it being in 3.6, we'll see if others think it's too aggressive :-)
Testing with obscure filenames and strings is where help will be needed most :)
I'm happy to invent hard cases for you, but I'm in the UK. For real use, the Euro symbol's about as obscure as we get around here ;-) Paul
On 10Aug2016 1226, Random832 wrote:
On Wed, Aug 10, 2016, at 15:08, Steve Dower wrote:
Testing with obscure filenames and strings is where help will be needed most :)
How about filenames with invalid surrogates? For added fun, consider that the file system encoding is normally used with surrogateescape.
This is where it gets extra fun, since surrogateescape is not normally used on Windows because we receive paths as Unicode text and pass them back as Unicode text without ever encoding or decoding them. Currently a broken filename (such as '\udee1.txt') can be correctly seen with os.listdir('.') but not os.listdir(b'.') (because Windows will return it as '?.txt'). It can be passed to open(), but encoding the name to utf-8 or utf-16 fails, and I doubt there's any encoding that is going to succeed. As far as I can tell, if you get a weird name in bytes today you are broken, and there is no way to be unbroken without doing the actual right thing and converting paths on POSIX into Unicode with surrogateescape. So our official advice has to stay the same - treating paths as text with smuggled bytes is the *only* way to be truly correct. But unless we also deprecate byte paths on POSIX, we'll never get there. (Now there's a dangerous idea ;) ) Cheers, Steve
On Wed, Aug 10, 2016, at 15:08, Steve Dower wrote:
That's the hope, though that module approaches the solution differently and may still uses. An alternative way for us to fix this whole thing would be to bring win_unicode_console into the standard library and use it by default (or probably whenever PYTHONIOENCODING is not specified).
I have concerns about win_unicode_console: - For the "text_transcoded" streams, stdout.encoding is utf-8. For the "text" streams, it is utf-16. - There is no object, as far as I can find, which can be used as an unbuffered unicode I/O object. - raw output streams silently drop the last byte if an odd number of bytes are written. - The sys.stdout obtained via streams.enable does not support .buffer / .buffer.raw / .detach - All of these objects provide a fileno() interface. - When using os.read/write for data that represents text, the data still should be encoded in the console encoding and not in utf-8 or utf-16. How is it important to preserve the validity of the conventional advice for "putting stdin/stdout in binary mode" using .buffer or .detach? I suspect this is mainly used for programs intended to have their output redirected, but today it 'kind of works' to run such a program on the console and inspect its output. How important is it for os.read/write(stdxxx.fileno()) to be consistent with stdxxx.encoding? Should errors='surrogatepass' be used? It's unlikely, but not impossible, to paste an invalid surrogate into the console. With win_unicode_console, this results in a UnicodeDecodeError and, if this happened during a readline, disables the readline hook. Is it possible to break this by typing a valid surrogate pair that falls across a buffer boundary?
On Wed, Aug 10, 2016, at 14:10, Steve Dower wrote:
To summarise the proposals (remembering that these would only affect Python 3.6 on Windows):
* change sys.getfilesystemencoding() to return 'utf-8' * automatically decode byte paths assuming they are utf-8 * remove the deprecation warning on byte paths
Why? What's the use case?
* make the default open() encoding check for a BOM or else use utf-8 * [ALTERNATIVE] make the default open() encoding check for a BOM or else use sys.getpreferredencoding()
For reading, I assume. When opened for writing, it should probably be utf-8-sig [if it's not mbcs] to match what Notepad does. What about files opened for appending or updating? In theory it could ingest the whole file to see if it's valid UTF-8, but that has a time cost. Notepad, if there's no BOM, checks the first 256 bytes of the file for whether it's likely to be utf-16 or mbcs [utf-8 isn't considered AFAIK], and can get it wrong for certain very short files [i.e. the infamous "this app can break"] What to do on opening a pipe or device? [Is os.fstat able to detect these cases?] Maybe the BOM detection phase should be deferred until the first read. What should encoding be at that point if this is done? Is there a "utf-any" encoding that can handle all five BOMs? If not, should there be? how are "utf-16" and "utf-32" files opened for appending or updating handled today?
* force the console encoding to UTF-8 on initialize and revert on finalize
Why not implement a true unicode console? What if sys.stdin/stdout are pipes (or non-console devices such as a serial port)?
On 10Aug2016 1146, Random832 wrote:
On Wed, Aug 10, 2016, at 14:10, Steve Dower wrote:
To summarise the proposals (remembering that these would only affect Python 3.6 on Windows):
* change sys.getfilesystemencoding() to return 'utf-8' * automatically decode byte paths assuming they are utf-8 * remove the deprecation warning on byte paths
Why? What's the use case?
Allowing library developers who support POSIX and Windows to just use bytes everywhere to represent paths.
* make the default open() encoding check for a BOM or else use utf-8 * [ALTERNATIVE] make the default open() encoding check for a BOM or else use sys.getpreferredencoding()
For reading, I assume. When opened for writing, it should probably be utf-8-sig [if it's not mbcs] to match what Notepad does. What about files opened for appending or updating? In theory it could ingest the whole file to see if it's valid UTF-8, but that has a time cost.
Writing out the BOM automatically basically makes your files incompatible with other platforms, which rarely expect a BOM. By omitting it but writing and reading UTF-8 we ensure that Python can handle its own files on any platform, while potentially upsetting some older applications on Windows or platforms that don't assume UTF-8 as a default.
Notepad, if there's no BOM, checks the first 256 bytes of the file for whether it's likely to be utf-16 or mbcs [utf-8 isn't considered AFAIK], and can get it wrong for certain very short files [i.e. the infamous "this app can break"]
Yeah, this is a pretty horrible idea :) I don't want to go there by default, but people can install chardet if they want the functionality.
What to do on opening a pipe or device? [Is os.fstat able to detect these cases?]
We should be able to detect them, but why treat them any differently from a file? Right now they're just as broken as they will be after the change if you aren't specifying 'b' or an encoding - probably more broken, since at least you'll get less encoding errors when the encoding is UTF-8.
Maybe the BOM detection phase should be deferred until the first read. What should encoding be at that point if this is done? Is there a "utf-any" encoding that can handle all five BOMs? If not, should there be? how are "utf-16" and "utf-32" files opened for appending or updating handled today?
Yes, I think it would be. I suspect we'd have to leave the encoding unknown until the first read, and perhaps force it to utf-8-sig if someone asks before we start. I don't *think* this is any less predictable than the current behaviour, given it only applies when you've left out any encoding specification, but maybe it is. It probably also entails opening the file descriptor in bytes mode, which might break programs that pass the fd directly to CRT functions. Personally I wish they wouldn't, but it's too late to stop them now.
* force the console encoding to UTF-8 on initialize and revert on finalize
Why not implement a true unicode console? What if sys.stdin/stdout are pipes (or non-console devices such as a serial port)?
Mostly because it's much more work. As I mentioned in my other post, an alternative would be to bring win_unicode_console into the stdlib and enable it by default (which considering the package was largely developed on bugs.p.o is probably okay, but we'd probably need to rewrite it in C, which is basically implementing a true Unicode console). You're right that changing the console encoding after launching Python is probably going to mess with pipes. We can detect whether the streams are interactive or not and adjust accordingly, but that's going to get messy if you're only piping in/out and stdin/out end up with different encodings. I'll put some more thought into this part. Thanks, Steve
On Wed, Aug 10, 2016, at 15:22, Steve Dower wrote:
Why? What's the use case? [byte paths]
Allowing library developers who support POSIX and Windows to just use bytes everywhere to represent paths.
Okay, how is that use case impacted by it being mbcs instead of utf-8? What about only doing the deprecation warning if non-ascii bytes are present in the value?
For reading, I assume. When opened for writing, it should probably be utf-8-sig [if it's not mbcs] to match what Notepad does. What about files opened for appending or updating? In theory it could ingest the whole file to see if it's valid UTF-8, but that has a time cost.
Writing out the BOM automatically basically makes your files incompatible with other platforms, which rarely expect a BOM.
Yes but you're not running on other platforms, you're running on the platform you're running on. If files need to be moved between platforms, converting files with a BOM to without ought to be the responsibility of the same tool that converts CRLF line endings to LF.
By omitting it but writing and reading UTF-8 we ensure that Python can handle its own files on any platform, while potentially upsetting some older applications on Windows or platforms that don't assume UTF-8 as a default.
Okay, you haven't addressed updating and appending. I realized after posting that updating should be in binary, but that leaves appending. Should we detect BOMs and/or attempt to detect the encoding by other means in those cases?
Notepad, if there's no BOM, checks the first 256 bytes of the file for whether it's likely to be utf-16 or mbcs [utf-8 isn't considered AFAIK], and can get it wrong for certain very short files [i.e. the infamous "this app can break"]
Yeah, this is a pretty horrible idea :)
Eh, maybe the utf-16 because it can give some hilariously bad results, but using it to differentiate between utf-8 and mbcs might not be so bad. But what to do if all we see is ascii?
What to do on opening a pipe or device? [Is os.fstat able to detect these cases?]
We should be able to detect them, but why treat them any differently from a file?
Eh, I was mainly concerned about if the first few bytes aren't a BOM? What about blocking waiting for them? But if this is delayed until the first read then it's fine.
It probably also entails opening the file descriptor in bytes mode, which might break programs that pass the fd directly to CRT functions. Personally I wish they wouldn't, but it's too late to stop them now.
The only thing O_TEXT does rather than O_BINARY is convert CRLF line endings (and maybe end on ^Z), and I don't think we even expose the constants for the CRT's unicode modes.
On Thu, Aug 11, 2016 at 6:09 AM, Random832 <random832@fastmail.com> wrote:
On Wed, Aug 10, 2016, at 15:22, Steve Dower wrote:
Why? What's the use case? [byte paths]
Allowing library developers who support POSIX and Windows to just use bytes everywhere to represent paths.
Okay, how is that use case impacted by it being mbcs instead of utf-8?
AIUI, the data flow would be: Python bytes object -> decode to Unicode text -> encode to UTF-16 -> Windows API. If you do the first transformation using mbcs, you're guaranteed *some* result (all Windows codepages have definitions for all byte values, if I'm not mistaken), but a hard-to-predict one - and worse, one that can change based on system settings. Also, if someone naively types "bytepath.decode()", Python will default to UTF-8, *not* to the system codepage. I'd rather a single consistent default encoding.
What about only doing the deprecation warning if non-ascii bytes are present in the value?
-1. Data-dependent warnings just serve to strengthen the feeling that "weird characters" keep breaking your programs, instead of writing your program to cope with all characters equally. It's like being racist against non-ASCII characters :) On Thu, Aug 11, 2016 at 4:10 AM, Steve Dower <steve.dower@python.org> wrote:
To summarise the proposals (remembering that these would only affect Python 3.6 on Windows):
* change sys.getfilesystemencoding() to return 'utf-8' * automatically decode byte paths assuming they are utf-8 * remove the deprecation warning on byte paths
+1 on these.
* make the default open() encoding check for a BOM or else use utf-8
-0.5. Is there any precedent for this kind of data-based detection being the default? An explicit "utf-sig" could do a full detection, but even then it's not perfect - how do you distinguish UTF-32LE from UTF-16LE that starts with U+0000? Do you say "UTF-32 is rare so we'll assume UTF-16", or do you say "files starting U+0000 are rare, so we'll assume UTF-32"?
* [ALTERNATIVE] make the default open() encoding check for a BOM or else use sys.getpreferredencoding()
-1. Same concerns as the above, plus I'd rather use the saner default.
* force the console encoding to UTF-8 on initialize and revert on finalize
-0 for Python itself; +1 for Python's interactive interpreter. Programs that mess with console settings get annoying when they crash out and don't revert properly. Unless there is *no way* that you could externally kill the process without also bringing the terminal down, there's the distinct possibility of messing everything up. Would it be possible to have a "sys.setconsoleutf8()" that changes the console encoding and slaps in an atexit() to revert? That would at least leave it in the hands of the app. Overall I'm +1 on shifting from eight-bit encodings to UTF-8. Don't be held back by what Notepad does. ChrisA
On 10Aug2016 1431, Chris Angelico wrote:
On Thu, Aug 11, 2016 at 6:09 AM, Random832 <random832@fastmail.com> wrote:
On Wed, Aug 10, 2016, at 15:22, Steve Dower wrote:
Why? What's the use case? [byte paths]
Allowing library developers who support POSIX and Windows to just use bytes everywhere to represent paths.
Okay, how is that use case impacted by it being mbcs instead of utf-8?
AIUI, the data flow would be: Python bytes object -> decode to Unicode text -> encode to UTF-16 -> Windows API. If you do the first transformation using mbcs, you're guaranteed *some* result (all Windows codepages have definitions for all byte values, if I'm not mistaken), but a hard-to-predict one - and worse, one that can change based on system settings. Also, if someone naively types "bytepath.decode()", Python will default to UTF-8, *not* to the system codepage.
I'd rather a single consistent default encoding.
I'm proposing to make that single consistent default encoding utf-8. It sounds like we're in agreement?
What about only doing the deprecation warning if non-ascii bytes are present in the value?
-1. Data-dependent warnings just serve to strengthen the feeling that "weird characters" keep breaking your programs, instead of writing your program to cope with all characters equally. It's like being racist against non-ASCII characters :)
Agreed. This won't happen.
On Thu, Aug 11, 2016 at 4:10 AM, Steve Dower <steve.dower@python.org> wrote:
To summarise the proposals (remembering that these would only affect Python 3.6 on Windows):
* change sys.getfilesystemencoding() to return 'utf-8' * automatically decode byte paths assuming they are utf-8 * remove the deprecation warning on byte paths
+1 on these.
* make the default open() encoding check for a BOM or else use utf-8
-0.5. Is there any precedent for this kind of data-based detection being the default? An explicit "utf-sig" could do a full detection, but even then it's not perfect - how do you distinguish UTF-32LE from UTF-16LE that starts with U+0000? Do you say "UTF-32 is rare so we'll assume UTF-16", or do you say "files starting U+0000 are rare, so we'll assume UTF-32"?
The BOM exists solely for data-based detection, and the UTF-8 BOM is different from the UTF-16 and UTF-32 ones. So we either find an exact BOM (which IIRC decodes as a no-op spacing character, though I have a feeling some version of Unicode redefined it exclusively for being the marker) or we use utf-8. But the main reason for detecting the BOM is that currently opening files with 'utf-8' does not skip the BOM if it exists. I'd be quite happy with changing the default encoding to: * utf-8-sig when reading (so the UTF-8 BOM is skipped if it exists) * utf-8 when writing (so the BOM is *not* written) This provides the best compatibility when reading/writing files without making any guesses. We could reasonably extend this to read utf-16 and utf-32 if they have a BOM, but that's an extension and not necessary for the main change.
* force the console encoding to UTF-8 on initialize and revert on finalize
-0 for Python itself; +1 for Python's interactive interpreter. Programs that mess with console settings get annoying when they crash out and don't revert properly. Unless there is *no way* that you could externally kill the process without also bringing the terminal down, there's the distinct possibility of messing everything up.
The main problem here is that if the console is not forced to UTF-8 then it won't render any of the characters correctly.
Would it be possible to have a "sys.setconsoleutf8()" that changes the console encoding and slaps in an atexit() to revert? That would at least leave it in the hands of the app.
Yes, but if the app is going to opt-in then I'd suggest the win_unicode_console package, which won't require any particular changes. It sounds like we'll have to look into effectively merging that package into the core. I'm afraid that'll come with a much longer tail of bugs (and will quite likely break code that expects to use file descriptors to access stdin/out), but it's the least impactful way to do it. Cheers, Steve
On Thu, Aug 11, 2016 at 9:40 AM, Steve Dower <steve.dower@python.org> wrote:
On 10Aug2016 1431, Chris Angelico wrote:
I'd rather a single consistent default encoding.
I'm proposing to make that single consistent default encoding utf-8. It sounds like we're in agreement?
Yes, we are. I was disagreeing with Random's suggestion that mbcs would also serve. Defaulting to UTF-8 everywhere is (a) consistent on all systems, regardless of settings; and (b) consistent with bytes.decode() and str.encode(), both of which default to UTF-8.
-0.5. Is there any precedent for this kind of data-based detection being the default? An explicit "utf-sig" could do a full detection, but even then it's not perfect - how do you distinguish UTF-32LE from UTF-16LE that starts with U+0000? Do you say "UTF-32 is rare so we'll assume UTF-16", or do you say "files starting U+0000 are rare, so we'll assume UTF-32"?
The BOM exists solely for data-based detection, and the UTF-8 BOM is different from the UTF-16 and UTF-32 ones. So we either find an exact BOM (which IIRC decodes as a no-op spacing character, though I have a feeling some version of Unicode redefined it exclusively for being the marker) or we use utf-8.
But the main reason for detecting the BOM is that currently opening files with 'utf-8' does not skip the BOM if it exists. I'd be quite happy with changing the default encoding to:
* utf-8-sig when reading (so the UTF-8 BOM is skipped if it exists) * utf-8 when writing (so the BOM is *not* written)
This provides the best compatibility when reading/writing files without making any guesses. We could reasonably extend this to read utf-16 and utf-32 if they have a BOM, but that's an extension and not necessary for the main change.
AIUI the utf-8-sig encoding is happy to decode something that doesn't have a signature, right? If so, then yes, I would definitely support that mild mismatch in defaults. Chew up that UTF-8 aBOMination and just use UTF-8 as is. I've almost never seen files stored in UTF-32 (even UTF-16 isn't all that common compared to UTF-8), so I wouldn't stress too much about that. Recognizing FE FF or FF FE and decoding as UTF-16 might be worth doing, but it could easily be retrofitted (that byte sequence won't decode as UTF-8).
* force the console encoding to UTF-8 on initialize and revert on finalize
-0 for Python itself; +1 for Python's interactive interpreter. Programs that mess with console settings get annoying when they crash out and don't revert properly. Unless there is *no way* that you could externally kill the process without also bringing the terminal down, there's the distinct possibility of messing everything up.
The main problem here is that if the console is not forced to UTF-8 then it won't render any of the characters correctly.
Ehh, that's annoying. Is there a way to guarantee, at the process level, that the console will be returned to "normal state" when Python exits? If not, there's the risk that people run a Python program and then the *next* program gets into trouble. But if that happens only on abnormal termination ("I killed Python from Task Manager, and it left stuff messed up so I had to close the console"), it's probably an acceptable risk. And the benefit sounds well worthwhile. Revising my recommendation to +0.9. ChrisA
On 11 August 2016 at 01:41, Chris Angelico <rosuav@gmail.com> wrote:
I've almost never seen files stored in UTF-32 (even UTF-16 isn't all that common compared to UTF-8), so I wouldn't stress too much about that. Recognizing FE FF or FF FE and decoding as UTF-16 might be worth doing, but it could easily be retrofitted (that byte sequence won't decode as UTF-8).
I see UTF-16 relatively often as a result of redirecting stdout in Powershell and forgetting that it defaults (stupidly, IMO) to UTF-16.
The main problem here is that if the console is not forced to UTF-8 then it won't render any of the characters correctly.
Ehh, that's annoying. Is there a way to guarantee, at the process level, that the console will be returned to "normal state" when Python exits? If not, there's the risk that people run a Python program and then the *next* program gets into trouble.
There's also the risk that Python programs using subprocess.Popen start the subprocess with the console in a non-standard state. Should we be temporarily restoring the console codepage in that case? How does the following work? <start> set codepage to UTF-8 ... set codepage back spawn subprocess X, but don't wait for it set codepage to UTF-8 ... ... At this point what codepage does Python see? What codepage does process X see? (Note that they are both sharing the same console). ... <end> restore codepage Paul
On Thu, Aug 11, 2016 at 9:07 AM, Paul Moore <p.f.moore@gmail.com> wrote:
set codepage to UTF-8 ... set codepage back spawn subprocess X, but don't wait for it set codepage to UTF-8 ... ... At this point what codepage does Python see? What codepage does process X see? (Note that they are both sharing the same console).
The input and output codepages are global data in conhost.exe. They aren't tracked for each attached process (unlike input history and aliases). That's how chcp.com works in the first place. Otherwise its calls to SetConsoleCP and SetConsoleOutputCP would be pointless. But IMHO all talk of using codepage 65001 is a waste of time. I think the trailing garbage output with this codepage in Windows 7 is unacceptable. And getting EOF for non-ASCII input is a show stopper. The problem occurs in conhost. All you get is the EOF result from ReadFile/ReadConsoleA, so it can't be worked around. This kills the REPL and raises EOFError for input(). ISTM the only people who think codepage 65001 actually works are those using Windows 8+ who occasionally need to print non-OEM text and never enter (or paste) anything but ASCII text.
I was thinking we would end up using the console API for input but stick with the standard handles for output, mostly to minimize the amount of magic switching we have to do. But since we can just switch the entire stream object in __std*__ once at startup if nothing is redirected it probably isn't that much of a simplification. I have some airport/aeroplane time today where I can experiment. Top-posted from my Windows Phone -----Original Message----- From: "eryk sun" <eryksun@gmail.com> Sent: 8/12/2016 5:40 To: "python-ideas" <python-ideas@python.org> Subject: Re: [Python-ideas] Fix default encodings on Windows On Thu, Aug 11, 2016 at 9:07 AM, Paul Moore <p.f.moore@gmail.com> wrote:
set codepage to UTF-8 ... set codepage back spawn subprocess X, but don't wait for it set codepage to UTF-8 ... ... At this point what codepage does Python see? What codepage does process X see? (Note that they are both sharing the same console).
The input and output codepages are global data in conhost.exe. They aren't tracked for each attached process (unlike input history and aliases). That's how chcp.com works in the first place. Otherwise its calls to SetConsoleCP and SetConsoleOutputCP would be pointless. But IMHO all talk of using codepage 65001 is a waste of time. I think the trailing garbage output with this codepage in Windows 7 is unacceptable. And getting EOF for non-ASCII input is a show stopper. The problem occurs in conhost. All you get is the EOF result from ReadFile/ReadConsoleA, so it can't be worked around. This kills the REPL and raises EOFError for input(). ISTM the only people who think codepage 65001 actually works are those using Windows 8+ who occasionally need to print non-OEM text and never enter (or paste) anything but ASCII text. _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
On 12 August 2016 at 13:38, eryk sun <eryksun@gmail.com> wrote:
... At this point what codepage does Python see? What codepage does process X see? (Note that they are both sharing the same console).
The input and output codepages are global data in conhost.exe. They aren't tracked for each attached process (unlike input history and aliases). That's how chcp.com works in the first place. Otherwise its calls to SetConsoleCP and SetConsoleOutputCP would be pointless.
That's what I expected, but hadn't had time to confirm (your point about chcp didn't occur to me). Thanks.
But IMHO all talk of using codepage 65001 is a waste of time. I think the trailing garbage output with this codepage in Windows 7 is unacceptable. And getting EOF for non-ASCII input is a show stopper. The problem occurs in conhost. All you get is the EOF result from ReadFile/ReadConsoleA, so it can't be worked around. This kills the REPL and raises EOFError for input(). ISTM the only people who think codepage 65001 actually works are those using Windows 8+ who occasionally need to print non-OEM text and never enter (or paste) anything but ASCII text.
Agreed, mucking with global state that subprocesses need was sufficient for me, but the other issues you mention seem conclusive. I understand Steve's point about being an improvement over 100% wrong, but we've lived with the current state of affairs long enough that I think we should take whatever time is needed to do it right, rather than briefly postponing the inevitable with a partial solution. Paul PS I've spent the last week on a different project trying to "save time" with partial solutions to precisely this issue, so apologies if I'm in a particularly unforgiving mood about it right now :-(
On Fri, Aug 12, 2016 at 6:41 AM, Paul Moore <p.f.moore@gmail.com> wrote:
I understand Steve's point about being an improvement over 100% wrong, but we've lived with the current state of affairs long enough that I think we should take whatever time is needed to do it right,
Sure -- but his is such a freakin' mess that there may well not BE a "right" solution. In which case, something IS better than nothing. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On 12 August 2016 at 18:05, Chris Barker <chris.barker@noaa.gov> wrote:
On Fri, Aug 12, 2016 at 6:41 AM, Paul Moore <p.f.moore@gmail.com> wrote:
I understand Steve's point about being an improvement over 100% wrong, but we've lived with the current state of affairs long enough that I think we should take whatever time is needed to do it right,
Sure -- but his is such a freakin' mess that there may well not BE a "right" solution.
In which case, something IS better than nothing.
Using Unicode APIs for console IO *is* better. Powershell does it, and it works there. All I'm saying is that we should focus on that as our "improved solution", rather than looking at CP_UTF8 as a "quick and dirty" solution, as there's no evidence that people need "quick and dirty" (they have win_unicode_console if the current state of affairs isn't sufficient for them). I'm not arguing that we do nothing. Are you saying we should use CP_UTF8 *in preference* to wide character APIs? Or that we should implement CP_UTF8 first and then wide chars later? Or are we in violent agreement that we should implement wide chars? Paul
On Fri, Aug 12, 2016 at 10:19 AM, Paul Moore <p.f.moore@gmail.com> wrote:
In which case, something IS better than nothing.
I'm not arguing that we do nothing. Are you saying we should use CP_UTF8 *in preference* to wide character APIs? Or that we should implement CP_UTF8 first and then wide chars later?
Honestly, I don't understand the details enough to argue either way.
Or are we in violent agreement that we should implement wide chars?
probably -- to the extend I understand the issues :-) But I am arguing that anything that makes it "better" that actually gets implemented is better than a "right" solution that no one has the time to make it happen, or that we can't agree on anyway. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
-----Original Message----- From: Python-ideas [mailto:python-ideas-bounces+tritium- list=sdamon.com@python.org] On Behalf Of Paul Moore Sent: Friday, August 12, 2016 9:42 AM To: eryk sun <eryksun@gmail.com> Cc: python-ideas <python-ideas@python.org> Subject: Re: [Python-ideas] Fix default encodings on Windows
On 12 August 2016 at 13:38, eryk sun <eryksun@gmail.com> wrote:
... At this point what codepage does Python see? What codepage does process X see? (Note that they are both sharing the same console).
The input and output codepages are global data in conhost.exe. They aren't tracked for each attached process (unlike input history and aliases). That's how chcp.com works in the first place. Otherwise its calls to SetConsoleCP and SetConsoleOutputCP would be pointless.
That's what I expected, but hadn't had time to confirm (your point about chcp didn't occur to me). Thanks.
But IMHO all talk of using codepage 65001 is a waste of time. I think the trailing garbage output with this codepage in Windows 7 is unacceptable. And getting EOF for non-ASCII input is a show stopper. The problem occurs in conhost. All you get is the EOF result from ReadFile/ReadConsoleA, so it can't be worked around. This kills the REPL and raises EOFError for input(). ISTM the only people who think codepage 65001 actually works are those using Windows 8+ who occasionally need to print non-OEM text and never enter (or paste) anything but ASCII text.
Agreed, mucking with global state that subprocesses need was sufficient for me, but the other issues you mention seem conclusive. I understand Steve's point about being an improvement over 100% wrong, but we've lived with the current state of affairs long enough that I think we should take whatever time is needed to do it right, rather than briefly postponing the inevitable with a partial solution.
For the love of all that is holy and good, ignore that sentiment. We need ANY AND ALL improvements to this miserable console experience.
Paul
PS I've spent the last week on a different project trying to "save time" with partial solutions to precisely this issue, so apologies if I'm in a particularly unforgiving mood about it right now :-( _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
On Wed, Aug 10, 2016 at 11:40 PM, Steve Dower <steve.dower@python.org> wrote:
It sounds like we'll have to look into effectively merging that package into the core. I'm afraid that'll come with a much longer tail of bugs (and will quite likely break code that expects to use file descriptors to access stdin/out), but it's the least impactful way to do it.
Programs that use sys.std*.encoding but use the file descriptor seem like a weird case to me. Do you have an example?
On Wed, Aug 10, 2016 at 04:40:31PM -0700, Steve Dower wrote:
On 10Aug2016 1431, Chris Angelico wrote:
* make the default open() encoding check for a BOM or else use utf-8
-0.5. Is there any precedent for this kind of data-based detection being the default?
There is precedent: the Python interpreter will accept a BOM instead of an encoding cookie when importing .py files. [Chris]
An explicit "utf-sig" could do a full detection, but even then it's not perfect - how do you distinguish UTF-32LE from UTF-16LE that starts with U+0000?
BOMs are a heuristic, nothing more. If you're reading arbitrary files could start with anything, then of course they can guess wrong. But then if I dumped a bunch of arbitrary Unicode codepoints in your lap and asked you to guess the language, you would likely get it wrong too :-) [Chris]
Do you say "UTF-32 is rare so we'll assume UTF-16", or do you say "files starting U+0000 are rare, so we'll assume UTF-32"?
The way I have done auto-detection based on BOMs is you start by reading four bytes from the file in binary mode. (If there are fewer than four bytes, it cannot be a text file with a BOM.) Compare those first four bytes against the UTF-32 BOMs first, and the UTF-16 BOMs *second* (otherwise UTF-16 will shadow UFT-32). Note that there are two BOMs (big-endian and little-endian). Then check for UTF-8, and if you're really keen, UTF-7 and UTF-1. def bom2enc(bom, default=None): """Return encoding name from a four-byte BOM.""" if bom.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')): return 'utf_32' elif bom.startswith((b'\xFE\xFF', b'\xFF\xFE')): return 'utf_16' elif bom.startswith(b'\xEF\xBB\xBF'): return 'utf_8_sig' elif bom.startswith(b'\x2B\x2F\x76'): if len(bom) == 4 and bom[4] in b'\x2B\x2F\x38\x39': return 'utf_7' elif bom.startswith(b'\xF7\x64\x4C'): return 'utf_1' elif default is None: raise ValueError('no recognisable BOM signature') else: return default [Steve Dower]
The BOM exists solely for data-based detection, and the UTF-8 BOM is different from the UTF-16 and UTF-32 ones. So we either find an exact BOM (which IIRC decodes as a no-op spacing character, though I have a feeling some version of Unicode redefined it exclusively for being the marker) or we use utf-8.
The Byte Order Mark is always U+FEFF encoded into whatever bytes your encoding uses. You should never use U+FEFF except as a BOM, but of course arbitrary Unicode strings might include it in the middle of the string Just Because. In that case, it may be interpreted as a legacy "ZERO WIDTH NON-BREAKING SPACE" character. But new content should never do that: you should use U+2060 "WORD JOINER" instead, and treat a U+FEFF inside the body of your file or string as an unsupported character. http://www.unicode.org/faq/utf_bom.html#BOM [Steve]
But the main reason for detecting the BOM is that currently opening files with 'utf-8' does not skip the BOM if it exists. I'd be quite happy with changing the default encoding to:
* utf-8-sig when reading (so the UTF-8 BOM is skipped if it exists) * utf-8 when writing (so the BOM is *not* written)
Sounds reasonable to me. Rather than hard-coding that behaviour, can we have a new encoding that does that? "utf-8-readsig" perhaps. [Steve]
This provides the best compatibility when reading/writing files without making any guesses. We could reasonably extend this to read utf-16 and utf-32 if they have a BOM, but that's an extension and not necessary for the main change.
The use of a BOM is always a guess :-) Maybe I just happen to have a Latin1 file that starts with "", or a Mac Roman file that starts with "Ôªø". Either case will be wrongly detected as UTF-8. That's the risk you take when using a heuristic. And if you don't want to use that heuristic, then you must specify the actual encoding in use. -- Steven D'Aprano
On Thu, Aug 11, 2016 at 1:14 PM, Steven D'Aprano <steve@pearwood.info> wrote:
On Wed, Aug 10, 2016 at 04:40:31PM -0700, Steve Dower wrote:
On 10Aug2016 1431, Chris Angelico wrote:
* make the default open() encoding check for a BOM or else use utf-8
-0.5. Is there any precedent for this kind of data-based detection being the default?
There is precedent: the Python interpreter will accept a BOM instead of an encoding cookie when importing .py files.
Okay, that's good enough for me.
[Chris]
An explicit "utf-sig" could do a full detection, but even then it's not perfect - how do you distinguish UTF-32LE from UTF-16LE that starts with U+0000?
BOMs are a heuristic, nothing more. If you're reading arbitrary files could start with anything, then of course they can guess wrong. But then if I dumped a bunch of arbitrary Unicode codepoints in your lap and asked you to guess the language, you would likely get it wrong too :-)
I have my own mental heuristics, but I can't recognize one Cyrillic language from another. And some Slavic languages can be written with either Latin or Cyrillic letters, just to further confuse matters. Of course, "arbitrary Unicode codepoints" might not all come from one language, and might not be any language at all. (Do you wanna build a U+2603?)
[Chris]
Do you say "UTF-32 is rare so we'll assume UTF-16", or do you say "files starting U+0000 are rare, so we'll assume UTF-32"?
The way I have done auto-detection based on BOMs is you start by reading four bytes from the file in binary mode. (If there are fewer than four bytes, it cannot be a text file with a BOM.)
Interesting. Are you assuming that a text file cannot be empty? Because 0xFF 0xFE could represent an empty file in UTF-16, and 0xEF 0xBB 0xBF likewise for UTF-8. Or maybe you don't care about files with less than one character in them?
Compare those first four bytes against the UTF-32 BOMs first, and the UTF-16 BOMs *second* (otherwise UTF-16 will shadow UFT-32). Note that there are two BOMs (big-endian and little-endian). Then check for UTF-8, and if you're really keen, UTF-7 and UTF-1.
For a default file-open encoding detection, I would minimize the number of options. The UTF-7 BOM could be the beginning of a file containing Base 64 data encoded in ASCII, which is a very real possibility.
elif bom.startswith(b'\x2B\x2F\x76'): if len(bom) == 4 and bom[4] in b'\x2B\x2F\x38\x39': return 'utf_7'
So I wouldn't include UTF-7 in the detection. Nor UTF-1. Both are rare. Even UTF-32 doesn't necessarily have to be included. When was the last time you saw a UTF-32LE-BOM file?
[Steve]
But the main reason for detecting the BOM is that currently opening files with 'utf-8' does not skip the BOM if it exists. I'd be quite happy with changing the default encoding to:
* utf-8-sig when reading (so the UTF-8 BOM is skipped if it exists) * utf-8 when writing (so the BOM is *not* written)
Sounds reasonable to me.
Rather than hard-coding that behaviour, can we have a new encoding that does that? "utf-8-readsig" perhaps.
+1. Makes the documentation easier by having the default value for encoding not depend on the value for mode. ChrisA
On Thu, Aug 11, 2016 at 02:09:00PM +1000, Chris Angelico wrote:
On Thu, Aug 11, 2016 at 1:14 PM, Steven D'Aprano <steve@pearwood.info> wrote:
The way I have done auto-detection based on BOMs is you start by reading four bytes from the file in binary mode. (If there are fewer than four bytes, it cannot be a text file with a BOM.)
Interesting. Are you assuming that a text file cannot be empty?
Hmmm... not consciously, but I guess I was. If the file is empty, how do you know it's text?
Because 0xFF 0xFE could represent an empty file in UTF-16, and 0xEF 0xBB 0xBF likewise for UTF-8. Or maybe you don't care about files with less than one character in them?
I'll have to think about it some more :-)
For a default file-open encoding detection, I would minimize the number of options. The UTF-7 BOM could be the beginning of a file containing Base 64 data encoded in ASCII, which is a very real possibility.
I'm coming from the assumption that you're reading unformated text in an unknown encoding, rather than some structured format. But we're getting off topic here. In context of Steve's suggestion, we should only autodetect UTF-8. In other words, if there's a UTF-8 BOM, skip it, otherwise treat the file as UTF-8.
When was the last time you saw a UTF-32LE-BOM file?
Two minutes ago, when I looked at my test suite :-P -- Steve
On Thu, Aug 11, 2016, at 10:25, Steven D'Aprano wrote:
Interesting. Are you assuming that a text file cannot be empty?
Hmmm... not consciously, but I guess I was.
If the file is empty, how do you know it's text?
Heh. That's the *other* thing that Notepad does wrong in the opinion of people coming from the Unix world - a Windows text file does not need to end with a [CR]LF, and normally will not.
But we're getting off topic here. In context of Steve's suggestion, we should only autodetect UTF-8. In other words, if there's a UTF-8 BOM, skip it, otherwise treat the file as UTF-8.
I think there's still room for UTF-16. It's two of the four encodings supported by Notepad, after all.
Unless someone else does the implementation, I'd rather add a utf8-readsig encoding that initially only skips a utf8 BOM - notably, you always get the same encoding, it just sometimes skips the first three bytes. I think we can change this later to detect and switch to utf16 without it being disastrous, though we've made it this far without it and frankly there are good reasons to "encourage" utf8 over utf16. My big concern is the console... I think that change is inevitably going to have to break someone, but I need to map out the possibilities first to figure out just how bad it'll be. Top-posted from my Windows Phone -----Original Message----- From: "Random832" <random832@fastmail.com> Sent: 8/11/2016 7:54 To: "python-ideas@python.org" <python-ideas@python.org> Subject: Re: [Python-ideas] Fix default encodings on Windows On Thu, Aug 11, 2016, at 10:25, Steven D'Aprano wrote:
Interesting. Are you assuming that a text file cannot be empty?
Hmmm... not consciously, but I guess I was.
If the file is empty, how do you know it's text?
Heh. That's the *other* thing that Notepad does wrong in the opinion of people coming from the Unix world - a Windows text file does not need to end with a [CR]LF, and normally will not.
But we're getting off topic here. In context of Steve's suggestion, we should only autodetect UTF-8. In other words, if there's a UTF-8 BOM, skip it, otherwise treat the file as UTF-8.
I think there's still room for UTF-16. It's two of the four encodings supported by Notepad, after all. _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
On Fri, Aug 12, 2016 at 1:31 AM, Steve Dower <steve.dower@python.org> wrote:
My big concern is the console... I think that change is inevitably going to have to break someone, but I need to map out the possibilities first to figure out just how bad it'll be.
Obligatory XKCD: https://xkcd.com/1172/ Subprocess invocation has been mentioned. What about logging? Will there be issues with something that attempts to log to both console and file? ChrisA
On Wed, Aug 10, 2016, at 17:31, Chris Angelico wrote:
AIUI, the data flow would be: Python bytes object
Nothing _starts_ as a Python bytes object. It has to be read from somewhere or encoded in the source code as a literal. The scenario is very different for "defined internally within the program" (how are these not gonna be ASCII) vs "user input" (user input how? from the console? from tkinter? how'd that get converted to bytes?) vs "from a network or something like a tar file where it represents a path on some other system" (in which case it's in whatever encoding that system used, or *maybe* an encoding defined as part of the network protocol or file format). The use case has not been described adequately enough to answer my question.
On Wed, Aug 10, 2016 at 8:09 PM, Random832 <random832@fastmail.com> wrote:
On Wed, Aug 10, 2016, at 15:22, Steve Dower wrote:
Allowing library developers who support POSIX and Windows to just use bytes everywhere to represent paths.
Okay, how is that use case impacted by it being mbcs instead of utf-8?
Using 'mbcs' doesn't work reliably with arbitrary bytes paths in locales that use a DBCS codepage such as 932. If a sequence is invalid, it gets passed to the filesystem as the default Unicode character, so it won't successfully roundtrip. In the following example b"\x81\xad", which isn't defined in CP932, gets mapped to the codepage's default Unicode character, Katakana middle dot, which encodes back as b"\x81E": >>> locale.getpreferredencoding() 'cp932' >>> open(b'\x81\xad', 'w').close() >>> os.listdir('.') ['・'] >>> unicodedata.name(os.listdir('.')[0]) 'KATAKANA MIDDLE DOT' >>> '・'.encode('932') b'\x81E' This isn't a problem for single-byte codepages, since every byte value uniquely maps to a Unicode code point, even if it's simply b'\x81' => u"\x81". Obviously there's still the general problem of dealing with arbitrary Unicode filenames created by other programs, since the ANSI API can only return a best-fit encoding of the filename, which is useless for actually accessing the file.
It probably also entails opening the file descriptor in bytes mode, which might break programs that pass the fd directly to CRT functions. Personally I wish they wouldn't, but it's too late to stop them now.
The only thing O_TEXT does rather than O_BINARY is convert CRLF line endings (and maybe end on ^Z), and I don't think we even expose the constants for the CRT's unicode modes.
Python 3 uses O_BINARY when opening files, unless you explicitly call os.open. Specifically, FileIO.__init__ adds O_BINARY to the open flags if the platform defines it. The Windows CRT reads the BOM for the Unicode modes O_WTEXT, O_U16TEXT, and O_U8TEXT. For O_APPEND | O_WRONLY mode, this requires opening the file twice, the first time with read access. See configure_text_mode() in "Windows Kits\10\Source\10.0.10586.0\ucrt\lowio\open.cpp". Python doesn't expose or use these Unicode text-mode constants. That's for the best because in Unicode mode the CRT invokes the invalid parameter handler when a buffer doesn't have an even number of bytes, i.e. a multiple of sizeof(wchar_t). Python could copy how configure_text_mode() handles the BOM, except it shouldn't write a BOM for new UTF-8 files.
On Wed, Aug 10, 2016, at 19:04, eryk sun wrote:
Using 'mbcs' doesn't work reliably with arbitrary bytes paths in locales that use a DBCS codepage such as 932.
Er... utf-8 doesn't work reliably with arbitrary bytes paths either, unless you intend to use surrogateescape (which you could also do with mbcs). Is there any particular reason to expect all bytes paths in this scenario to be valid UTF-8?
Python 3 uses O_BINARY when opening files, unless you explicitly call os.open. Specifically, FileIO.__init__ adds O_BINARY to the open flags if the platform defines it.
Fair enough. I wasn't sure, particularly considering that python does expose O_BINARY, O_TEXT, and msvcrt.setmode. I'm not sure I approve of os.open not also adding it (or perhaps adding it only if O_TEXT is not explicitly added), but... meh.
Python could copy how configure_text_mode() handles the BOM, except it shouldn't write a BOM for new UTF-8 files.
I disagree. I think that *on windows* it should, just like *on windows* it should write CR-LF for line endings.
On 10Aug2016 1630, Random832 wrote:
On Wed, Aug 10, 2016, at 19:04, eryk sun wrote:
Using 'mbcs' doesn't work reliably with arbitrary bytes paths in locales that use a DBCS codepage such as 932.
Er... utf-8 doesn't work reliably with arbitrary bytes paths either, unless you intend to use surrogateescape (which you could also do with mbcs).
Is there any particular reason to expect all bytes paths in this scenario to be valid UTF-8?
On Windows, all paths are effectively UCS-2 (they are defined as UTF-16, but surrogate pairs don't seem to be validated, which IIUC means it's really UCS-2), so while the majority can be encoded as valid UTF-8, there are some paths which cannot. (These paths are going to break many other tools though, such as PowerShell, so we won't be in bad company if we can't handle them properly in edge cases). surrogateescape is irrelevant because it's only for decoding from bytes. An alternative approach would be to replace mbcs with a ucs-2 encoding that is basically just a blob of the path that was returned from Windows (using the Unicode APIs). None of the manipulation functions would work on this though, since nearly every second character would be \x00, but it's the only way (besides using str) to maintain full fidelity for every possible path name. Compromising on UTF-8 is going to increase consistency across platforms and across different Windows installations without increasing the rate of errors above what we currently see (given that invalid characters are currently replaced with '?'). It's not a 100% solution, but it's a 99% solution where the 1% is not handled well by anyone. Cheers, Steve
On Wed, Aug 10, 2016 at 11:30 PM, Random832 <random832@fastmail.com> wrote:
Er... utf-8 doesn't work reliably with arbitrary bytes paths either, unless you intend to use surrogateescape (which you could also do with mbcs).
Is there any particular reason to expect all bytes paths in this scenario to be valid UTF-8?
The problem is more so that data is lost without an error when using the legacy ANSI API. If the path is invalid UTF-8, Python will at least raise an exception when decoding it. To work around this, the developers may decide they need to just bite the bullet and use Unicode, or maybe there could be legacy Latin-1 and ANSI modes enabled by an environment variable or sys flag.
On 11 August 2016 at 00:30, Random832 <random832@fastmail.com> wrote:
Python could copy how configure_text_mode() handles the BOM, except it shouldn't write a BOM for new UTF-8 files.
I disagree. I think that *on windows* it should, just like *on windows* it should write CR-LF for line endings.
Tools like git and hg, and cross platform text editors, handle transparently managing the differences between line endings for you. But nothing much handles BOM stripping/adding automatically. So while in theory the two cases are similar, in practice lack of tool support means that if we start adding BOMs on Windows (and requiring them so that we can detect UTF8) then we'll be setting up new interoperability problems for Python users, for little benefit. Paul
On Wed, Aug 10, 2016 at 6:10 PM, Steve Dower <steve.dower@python.org> wrote:
Similarly, locale.getpreferredencoding() on Windows returns a legacy value - the user's active code page - which should generally not be used for any reason. The one exception is as a default encoding for opening files when no other information is available (e.g. a Unicode BOM or explicit encoding argument). BOMs are very common on Windows, since the default assumption is nearly always a bad idea.
The CRT doesn't allow UTF-8 as a locale encoding because Windows itself doesn't allow this. So locale.getpreferredencoding() can't change, but in practice it can be ignored. Speaking of locale, Windows Python should call setlocale(LC_CTYPE, "") in pylifecycle.c in order to work around an inconsistency between LC_TIME and LC_CTYPE in the the default "C" locale. The former is ANSI while the latter is effectively Latin-1, which leads to mojibake in time.tzname and elsewhere. Calling setlocale(LC_CTYPE, "") is already done on most Unix systems, so this would actually improve cross-platform consistency.
Finally, the encoding of stdin, stdout and stderr are currently (correctly) inferred from the encoding of the console window that Python is attached to. However, this is typically a codepage that is different from the system codepage (i.e. it's not mbcs) and is almost certainly not Unicode. If users are starting Python from a console, they can use "chcp 65001" first to switch to UTF-8, and then *most* functionality works (input() has some issues, but those can be fixed with a slight rewrite and possibly breaking readline hooks).
Using codepage 65001 for output is broken prior to Windows 8 because WriteFile/WriteConsoleA returns (as an output parameter) the number of decoded UTF-16 codepoints instead of the number of bytes written, which makes a buffered writer repeatedly write garbage at the end of each write in proportion to the number of non-ASCII characters. This can be worked around by decoding to get the UTF-16 size before each write, or by just blindly assuming that a console write always succeeds in writing the entire buffer. In this case the console should be detected by GetConsoleMode(). isatty() isn't right for this since it's true for all character devices, which includes NUL among others. Codepage 65001 is broken for non-ASCII input (via ReadFile/ReadConsoleA) in all versions of Windows that I've tested, including Windows 10. By attaching a debugger to conhost.exe you can see how it fails in WideCharToMultiByte because it assumes one byte per character. If you try to read 10 bytes, it assumes you're trying to read 10 UTF-16 'characters' into a 10 byte buffer, which fails for UTF-8 when even a single non-ASCII character is read. The ReadFile/ReadConsoleA call returns that it successfully read 0 bytes, which is interpreted as EOF. This cannot be worked around. The only way to read the full range of Unicode from the console is via the wide-character APIs ReadConsoleW and ReadConsoleInputW. IMO, Python needs a C implementation of the win_unicode_console module, using the wide-character APIs ReadConsoleW and WriteConsoleW. Note that this sets sys.std*.encoding as UTF-8 and transcodes, so Python code never has to work directly with UTF-16 encoded text.
On 10 August 2016 at 21:16, eryk sun <eryksun@gmail.com> wrote:
IMO, Python needs a C implementation of the win_unicode_console module, using the wide-character APIs ReadConsoleW and WriteConsoleW. Note that this sets sys.std*.encoding as UTF-8 and transcodes, so Python code never has to work directly with UTF-16 encoded text.
+1 on this (and if this means we need to wait till 3.7, so be it). I'd originally thought this was what Steve was proposing. Paul
On Wed, 10 Aug 2016 at 11:16 Steve Dower <steve.dower@python.org> wrote:
[SNIP]
Finally, the encoding of stdin, stdout and stderr are currently (correctly) inferred from the encoding of the console window that Python is attached to. However, this is typically a codepage that is different from the system codepage (i.e. it's not mbcs) and is almost certainly not Unicode. If users are starting Python from a console, they can use "chcp 65001" first to switch to UTF-8, and then *most* functionality works (input() has some issues, but those can be fixed with a slight rewrite and possibly breaking readline hooks).
It is also possible for Python to change the current console encoding to be UTF-8 on initialize and change it back on finalize. (This would leave the console in an unexpected state if Python segfaults, but console encoding is probably the least of anyone's worries at that point.) So I'm proposing actively changing the current console to be Unicode while Python is running, and hence sys.std[in|out|err] will default to utf-8.
So that's a broad range of changes, and I have little hope of figuring out all the possible issues, back-compat risks, and flow-on effects on my own. Please let me know (either on-list or off-list) how a change like this would affect your projects, either positively or negatively, and whether you have any specific experience with these changes/fixes and think they should be approached differently.
To summarise the proposals (remembering that these would only affect Python 3.6 on Windows):
[SNIP] * force the console encoding to UTF-8 on initialize and revert on finalize
Don't have enough Windows experience to comment on the other parts of this proposal, but for the console encoding I am a hearty +1 as I'm tired of Unicode characters failing to show up in the REPL.
On 11 August 2016 at 04:10, Steve Dower <steve.dower@python.org> wrote:
I suspect there's a lot of discussion to be had around this topic, so I want to get it started. There are some fairly drastic ideas here and I need help figuring out whether the impact outweighs the value.
My main reaction would be that if Drekin (Adam Bartoš) agrees the changes natively solve the problems that https://pypi.python.org/pypi/win_unicode_console works around, it's probably a good idea. The status quo is also sufficiently broken from both a native Windows perspective and a cross-platform compatibility perspective that your proposals are highly unlikely to make things *worse* :) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 11 August 2016 at 13:26, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 11 August 2016 at 04:10, Steve Dower <steve.dower@python.org> wrote:
I suspect there's a lot of discussion to be had around this topic, so I want to get it started. There are some fairly drastic ideas here and I need help figuring out whether the impact outweighs the value.
My main reaction would be that if Drekin (Adam Bartoš) agrees the changes natively solve the problems that https://pypi.python.org/pypi/win_unicode_console works around, it's probably a good idea.
Also, a reminder that Adam has a couple of proposals on the tracker aimed at getting CPython to use a UTF-16-LE console on Windows: http://bugs.python.org/issue22555#msg242943 (last two issue references in that comment) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Wed, Aug 10, 2016, at 14:10, Steve Dower wrote:
* force the console encoding to UTF-8 on initialize and revert on finalize
So what are your concerns? Suggestions?
As far as I know, the single biggest problem caused by the status quo for console encoding is "some string containing characters not in the console codepage is printed out; unhandled UnicodeEncodeError". Is there any particular reason not to use errors='replace'? Is there any particular reason for the REPL, when printing the repr of a returned object, not to replace characters not in the stdout encoding with backslash sequences? Does Python provide any mechanism to access the built-in "best fit" mappings for windows codepages (which mostly consist of removing accents from latin letters)?
On Fri, Aug 12, 2016 at 2:20 PM, Random832 <random832@fastmail.com> wrote:
On Wed, Aug 10, 2016, at 14:10, Steve Dower wrote:
* force the console encoding to UTF-8 on initialize and revert on finalize
So what are your concerns? Suggestions?
As far as I know, the single biggest problem caused by the status quo for console encoding is "some string containing characters not in the console codepage is printed out; unhandled UnicodeEncodeError". Is there any particular reason not to use errors='replace'?
If that's all you want then you can set PYTHONIOENCODING=:replace. Prepare to be inundated with question marks. Python's 'cp*' encodings are cross-platform, so they don't call Windows NLS APIs. If you want a best-fit encoding, then 'mbcs' is the only choice. Use chcp.com to switch to your system's ANSI codepage and set PYTHONIOENCODING=mbcs:replace. An 'oem' encoding could be added, but I'm no fan of these best-fit encodings. Writing question marks at least hints that the output is wrong.
Is there any particular reason for the REPL, when printing the repr of a returned object, not to replace characters not in the stdout encoding with backslash sequences?
sys.displayhook already does this. It falls back on sys_displayhook_unencodable if printing the repr raises a UnicodeEncodeError.
Does Python provide any mechanism to access the built-in "best fit" mappings for windows codepages (which mostly consist of removing accents from latin letters)?
As mentioned above, for output this is only available with 'mbcs'. For reading input via ReadFile or ReadConsoleA (and thus also C _read, fread, and fgets), the console already encodes its UTF-16 input buffer using a best-fit encoding to the input codepage. So there's no error in the following example, even though the result is wrong: >>> sys.stdin.encoding 'cp437' >>> s = 'Ā' >>> s, ord(s) ('A', 65) Jumping back to the codepage 65001 discussion, here's a function to simulate the bad output that Windows Vista and 7 users see: def write(text): writes = [] n = 0 buffer = text.replace('\n', '\r\n').encode('utf-8') while buffer: decoded = buffer.decode('utf-8', 'replace') buffer = buffer[len(decoded):] writes.append(decoded.replace('\r', '\n')) return ''.join(writes) For example: >>> greek = 'αβγδεζηθι\n' >>> write(greek) 'αβγδεζηθι\n\n�ηθι\n\n�\n\n' It gets worse with characters that require 3 bytes in UTF-8: >>> devanagari = 'ऄअआइईउऊऋऌ\n' >>> write(devanagari) 'ऄअआइईउऊऋऌ\n\n�ईउऊऋऌ\n\n��ऋऌ\n\n��\n\n' This problem doesn't exit in Windows 8+ because the old LPC-based communication (LPC is an undocumented protocol that's used extensively for IPC between Windows subsystems) with the console was rewritten to use a kernel driver (condrv.sys). Now it works like any other device by calling NtReadFile, NtWriteFile, and NtDeviceIoControlFile. Apparently in the rewrite someone fixed the fact that the conhost code that handles WriteFile and WriteConsoleA was incorrectly returning the number of UTF-16 codes written instead of the number of bytes. Unfortunately the rewrite also broke Ctrl+C handling because ReadFile no longer sets the last error to ERROR_OPERATION_ABORTED when a console read is interrupted by Ctrl+C. I'm surprised so few Windows users have noticed or cared that Ctrl+C kills the REPL and misbehaves with input() in the Windows 8/10 console. The source of the Ctrl+C bug is an incorrect NTSTATUS code STATUS_ALERTED, which should be STATUS_CANCELLED. The console has always done this wrong, but before the rewrite there was common code for ReadFile and ReadConsole that handled STATUS_ALERTED specially. It's still there in ReadConsole, so Ctrl+C handling works fine in Unicode programs that use ReadConsoleW (e.g. cmd.exe, powershell.exe). It also works fine if win_unicode_console is enabled. Finally, here's a ctypes example in Windows 10.0.10586 that shows the unsolvable problem with non-ASCII input when using codepage 65001: import ctypes, msvcrt conin = open(r'\\.\CONIN$', 'r+') hConin = msvcrt.get_osfhandle(conin.fileno()) kernel32 = ctypes.WinDLL('kernel32', use_last_error=True) nread = (ctypes.c_uint * 1)() ASCII-only input works: >>> buf = (ctypes.c_char * 100)() >>> kernel32.ReadFile(hConin, buf, 100, nread, None) spam 1 >>> nread[0], buf.value (6, b'spam\r\n') But it returns EOF if "a" is replaced by Greek "α": >>> buf = (ctypes.c_char * 100)() >>> kernel32.ReadFile(hConin, buf, 100, nread, None) spαm 1 >>> nread[0], buf.value (0, b'') Notice that the read is successful but nread is 0. That signifies EOF. So the REPL will just silently quit as if you entered Ctrl+Z, and input() will raise EOFError. This can't be worked around. The problem is in conhost.exe, which assumes a request for N bytes wants N UTF-16 codes from the input buffer. This can only work with ASCII in UTF-8.
Hello, I'm in holiday and I'm writing on a phone, so sorry in advance for the short answer. In short: we should drop support for the bytes API. Just use Unicode on all platforms, especially for filenames. Sorry but most of these changes look like very bad ideas. Or maybe I misunderstood something. Windows bytes API are broken in different ways, in short your proposal is to put another layer on top of it to try to workaround issues. Unicode is complex. Unicode issues are hard to debug. Adding a new layer makes debugging even harder. Is the bug in the input data? In the layer? In the final Windows function? In my experience on UNIX, the most important part is the interoperability with other applications. I understand that Python 2 will speak ANSI code page but Python 3 will speak UTF-8. I don't understand how it can work. Almsot all Windows applications speak the ANSI code page (I'm talking about stdin, stdout, pipes, ...). Do you propose to first try to decode from UTF-8 or fallback on decoding from the ANSI code page? What about encoding? Always encode to UTF-8? About BOM: I hate them. Many applications don't understand them. Again, think about Python 2. I recall vaguely that the Unicode strandard suggests to not use BOM (I have to check). I recall a bug in gettext. The tool doesn't understand BOM. When I opened the file in vim, the BOM was invisible (hidden). I had to use hexdump to understand the issue! BOM introduces issues very difficult to debug :-/ I also think that it goes in the wrong direction in term of interoperability. For the Windows console: I played with all Windows functions, tried all fonts and many code pages. I also read technical blog articles of Microsoft employees. I gave up on this issue. It doesn't seem possible to support fully Unicode the Windows console (at least the last time I checked). By the way, it seems like Windows functions have bugs, and the code page 65001 fixes a few issues but introduces new issues... Victor Le 10 août 2016 20:16, "Steve Dower" <steve.dower@python.org> a écrit :
I suspect there's a lot of discussion to be had around this topic, so I want to get it started. There are some fairly drastic ideas here and I need help figuring out whether the impact outweighs the value.
Some background: within the Windows API, the preferred encoding is UTF-16. This is a 16-bit format that is typed as wchar_t in the APIs that use it. These APIs are generally referred to as the *W APIs (because they have a W suffix).
There are also (broadly deprecated) APIs that use an 8-bit format (char), where the encoding is assumed to be "the user's active code page". These are *A APIs. AFAIK, there are no cases where a *A API should be preferred over a *W API, and many newer APIs are *W only.
In general, Python passes byte strings into the *A APIs and text strings into the *W APIs.
Right now, sys.getfilesystemencoding() on Windows returns "mbcs", which translates to "the system's active code page". As this encoding generally cannot represent all paths on Windows, it is deprecated and Unicode strings are recommended instead. This, however, means you need to write significantly different code between POSIX (use bytes) and Windows (use text).
ISTM that changing sys.getfilesystemencoding() on Windows to "utf-8" and updating path_converter() (Python/posixmodule.c; likely similar code in other places) to decode incoming byte strings would allow us to undeprecate byte strings and add the requirement that they *must* be encoded with sys.getfilesystemencoding(). I assume that this would allow cross-platform code to handle paths similarly by encoding to whatever the sys module says they should and using bytes consistently (starting this thread is meant to validate/refute my assumption).
(Yes, I know that people on POSIX should just change to using Unicode and surrogateescape. Unfortunately, rather than doing that they complain about Windows and drop support for the platform. If you want to keep hitting them with the stick, go ahead, but I'm inclined to think the carrot is more valuable here.)
Similarly, locale.getpreferredencoding() on Windows returns a legacy value - the user's active code page - which should generally not be used for any reason. The one exception is as a default encoding for opening files when no other information is available (e.g. a Unicode BOM or explicit encoding argument). BOMs are very common on Windows, since the default assumption is nearly always a bad idea.
Making open()'s default encoding detect a BOM before falling back to locale.getpreferredencoding() would resolve many issues, but I'm also inclined towards making the fallback utf-8, leaving locale.getpreferredencoding() solely as a way to get the active system codepage (with suitable warnings about it only being useful for back-compat). This would match the behavior that the .NET Framework has used for many years - effectively, utf_8_sig on read and utf_8 on write.
Finally, the encoding of stdin, stdout and stderr are currently (correctly) inferred from the encoding of the console window that Python is attached to. However, this is typically a codepage that is different from the system codepage (i.e. it's not mbcs) and is almost certainly not Unicode. If users are starting Python from a console, they can use "chcp 65001" first to switch to UTF-8, and then *most* functionality works (input() has some issues, but those can be fixed with a slight rewrite and possibly breaking readline hooks).
It is also possible for Python to change the current console encoding to be UTF-8 on initialize and change it back on finalize. (This would leave the console in an unexpected state if Python segfaults, but console encoding is probably the least of anyone's worries at that point.) So I'm proposing actively changing the current console to be Unicode while Python is running, and hence sys.std[in|out|err] will default to utf-8.
So that's a broad range of changes, and I have little hope of figuring out all the possible issues, back-compat risks, and flow-on effects on my own. Please let me know (either on-list or off-list) how a change like this would affect your projects, either positively or negatively, and whether you have any specific experience with these changes/fixes and think they should be approached differently.
To summarise the proposals (remembering that these would only affect Python 3.6 on Windows):
* change sys.getfilesystemencoding() to return 'utf-8' * automatically decode byte paths assuming they are utf-8 * remove the deprecation warning on byte paths * make the default open() encoding check for a BOM or else use utf-8 * [ALTERNATIVE] make the default open() encoding check for a BOM or else use sys.getpreferredencoding() * force the console encoding to UTF-8 on initialize and revert on finalize
So what are your concerns? Suggestions?
Thanks, Steve _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Le 10 août 2016 20:16, "Steve Dower" <steve.dower@python.org> a écrit :
So what are your concerns? Suggestions?
Add a new option specific to Windows to switch to UTF-8 everywhere, use BOM, whatever you want, *but* don't change the defaults. IMO mbcs encoding is the least worst encoding for the default. I have an idea of a similar option for UNIX: ignore user preference (LC_ALL, LC_CTYPE, LANG environment variables) and force UTF-8. It's a common request on UNIX where UTF-8 is now the encoding of almost all systems, whereas the C library continues to use ASCII when the POSIX locale is used (which occurs in many cases). Perl already has such utf8 option. Victor
Steve Dower writes:
ISTM that changing sys.getfilesystemencoding() on Windows to "utf-8" and updating path_converter() (Python/posixmodule.c;
I think this proposal requires the assumption that strings intended to be interpreted as file names invariably come from the Windows APIs. I don't think that is true: Makefiles and similar, configuration files, all typically contain filenames. Zipfiles (see below). Python is frequently used as a glue language, so presumably receives such file name information as (more or less opaque) bytes objects over IPC channels. These just aren't under OS control, so the assumption will fail. Supporting Windows users in Japan means dealing with lots of crap produced by standard-oblivious software. Eg, Shift JIS filenames in zipfiles. AFAICT Windows itself never does that, but the majority of zipfiles I get from colleagues have Shift JIS in the directory (and it's the great majority if you assume that people who use ASCII transliterations are doing so because they know that non-Windows-users can't handle Shift JIS file names in zipfiles). So I believe bytes-oriented software must expect non-UTF-8 file names in Japan. UTF-8 may have penetration in the rest of the world, but the great majority of my Windows-using colleagues in Japan still habitually and by preference use Shift JIS in text files. I suppose that includes files that are used by programs, and thus file names, and probably extends to most Windows users here. I suspect a similar situation holds in China, where AIUI "GB is not just a good idea, it's the law,"[1] and possibly Taiwan (Big 5) and Korea (KSC) as those standards have always provided the benefits of (nearly) universal repertoires[2].
and add the requirement that [bytes file names] *must* be encoded with sys.getfilesystemencoding().
To the extent that this *can* work, it *already* works. Trying to enforce a particular encoding will simply break working code that depends on sys.getfilesystemencoding() matching the encoding that other programs use. You have no carrot. These changes enforce an encoding on bytes for Windows APIs but can't do so for data, and so will make file-names- are-just-bytes programmers less happy with Python, not more happy. The exception is the proposed console changes, because there you *do* perform all I/O with OS APIs. But I don't know anything about the Windows console except that nobody seems happy with it.
Similarly, locale.getpreferredencoding() on Windows returns a legacy value - the user's active code page - which should generally not be used for any reason.
This is even less supportable, because it breaks much code that used to work without specifying an encoding. Refusing to respect the locale preferred encoding would force most Japanese scripters to specify encodings where they currently accept the system default, I suspect. On those occasions my Windows-using colleagues deliver text files, they are *always* encoded in Shift JIS. University databases the deliver CSV files allow selecting Shift JIS or UTF-8, and most people choose Shift JIS. And so on. In Japan, Shift JIS remains pervasive on Windows. I don't think Japan is special in this, except in the pervasiveness of Shift JIS. For everybody I think there will be more loss than benefit imposed.
BOMs are very common on Windows, since the default assumption is nearly always a bad idea.
I agree (since 1990!) that Shift JIS by default is a bad idea, but there's no question that it is still overwhelmingly popular. I suspect UTF-8 signatures are uncommon, too, as most UTF-8 originates on Mac or *nix platforms.
This would match the behavior that the .NET Framework has used for many years - effectively, utf_8_sig on read and utf_8 on write.
But .NET is a framework. It expects to be the world in which programs exist, no? Python is very frequently used as a glue language, and I suspect the analogy fails due to that distinction. Footnotes: [1] Strictly speaking, certain programs must support GB 18030. I don't think it's legally required to be the default encoding. [2] For example, the most restricted Japanese standard, JIS X 0208, includes not only "full-width" versions of ASCII characters, but the full Greek and Cyrillic alphabets, many math symbols, a full line drawing set, and much more besides the native syllabary and Han ideographs. The elderly Chinese GB 2312 not only includes Greek and Cyrillic, and the various symbols, but also the Japanese syllabaries. (And the more recent GB 18030 swallowed Unicode whole.)
On Sat, Aug 13, 2016, at 04:12, Stephen J. Turnbull wrote:
Steve Dower writes:
ISTM that changing sys.getfilesystemencoding() on Windows to "utf-8" and updating path_converter() (Python/posixmodule.c;
I think this proposal requires the assumption that strings intended to be interpreted as file names invariably come from the Windows APIs. I don't think that is true: Makefiles and similar, configuration files, all typically contain filenames. Zipfiles (see below).
And what's going to happen if you shovel those bytes into the filesystem without conversion on Linux, or worse, OSX? This problem isn't unique to Windows.
Python is frequently used as a glue language, so presumably receives such file name information as (more or less opaque) bytes objects over IPC channels.
They *can't* be opaque. Someone has to decide what they mean, and you as the application developer might well have to step up and *be that someone*. If you don't, someone else will decide for you.
These just aren't under OS control, so the assumption will fail.
So I believe bytes-oriented software must expect non-UTF-8 file names in Japan.
The only way to deal with data representing filenames and destined for the filesystem on windows is to convert it, somehow, ultimately to UTF-16-LE. Not doing so is impossible, it's only a question of what layer it happens in. If you convert it using the wrong encoding, you lose. The only way to deal with it on Mac OS X is to convert it to UTF-8. If you don't, you lose. If you convert it using the wrong encoding, you lose. This proposal embodies an assumption that bytes from unknown sources used as filenames are more likely to be UTF-8 than in the locale ACP (i.e. "mbcs" in pythonspeak, and Shift-JIS in Japan). Personally, I think the whole edifice is rotten, and choosing one encoding over another isn't a solution; the only solution is to require the application to make a considered decision about what the bytes mean and pass its best effort at converting to a Unicode string to the API. This is true on Windows, it's true on OSX, and I would argue it's pretty close to being true on Linux except in a few very niche cases. So I think for the filesystem encoding we should stay the course, continuing to print a DeprecationWarning and maybe, just maybe, eventually actually deprecating it. On Windows and OSX, this "glue language" business of shoveling bytes from one place to another without caring what they mean can only last as long as they don't touch the filesystem.
You have no carrot. These changes enforce an encoding on bytes for Windows APIs but can't do so for data, and so will make file-names- are-just-bytes programmers less happy with Python, not more happy.
I think the use case that the proposal has in mind is a file-names-are-just- bytes program (or set of programs) that reads from the filesystem, converts to bytes for a file/network, and then eventually does the reverse - either end may be on windows. Using UTF-8 will allow those to make the round trip (strictly speaking, you may need surrogatepass, and OSX does its weird normalization thing), using any other encoding (except for perhaps GB18030) will not.
Just a heads-up that I've assigned http://bugs.python.org/issue1602 to myself and started a patch for the console changes. Let's move the console discussion back over there. Hopefully it will show up in 3.6.0b1, but if you're prepared to apply a patch and test on Windows, feel free to grab my work so far. There's a lot of "making sure other things aren't broken" left to do. Cheers, Steve
On 13Aug2016 0523, Random832 wrote:
On Sat, Aug 13, 2016, at 04:12, Stephen J. Turnbull wrote:
Steve Dower writes:
ISTM that changing sys.getfilesystemencoding() on Windows to "utf-8" and updating path_converter() (Python/posixmodule.c;
I think this proposal requires the assumption that strings intended to be interpreted as file names invariably come from the Windows APIs. I don't think that is true: Makefiles and similar, configuration files, all typically contain filenames. Zipfiles (see below).
And what's going to happen if you shovel those bytes into the filesystem without conversion on Linux, or worse, OSX? This problem isn't unique to Windows.
Yeah, this is basically my view too. If your path bytes don't come from the filesystem, you need to know the encoding regardless. But it's very reasonable to be able to round-trip. Currently, the following two lines of code can have different behaviour on Windows (i.e. the latter fails to open the file):
open(os.listdir('.')[-1]) open(os.listdir(b'.')[-1])
On Windows, the filesystem encoding is inherently Unicode, which means you can't reliably round-trip filenames through the current code page. Changing all of Python to use the Unicode APIs internally and making the bytes encoding utf-8 (or utf-16-le, which would save a conversion) resolves this and doesn't really affect
These just aren't under OS control, so the assumption will fail.
So I believe bytes-oriented software must expect non-UTF-8 file names in Japan.
Even on Japanese Windows, non-UTF-8 file names must be encodable with UTF-16 or they cannot exist on the file system. This moves the encoding boundary into the application, which is where it needed to be anyway for robust software - "Correct" path handling still requires decoding to text, and if you know that your source is the encoded with the active code page then byte_path.decode('mbcs', 'surrogateescape') is still valid. Cheers, Steve
Random832 writes:
And what's going to happen if you shovel those bytes into the filesystem without conversion on Linux, or worse, OSX?
Off topic. See Subject: field.
This proposal embodies an assumption that bytes from unknown sources used as filenames are more likely to be UTF-8 than in the locale ACP
Then it's irrelevant: most bytes are not from "unknown sources", they're from correspondents (or from yourself!) -- and for most users most of the time, those correspondents share the locale encoding with them. At least where I live, they use that encoding frequently.
the only solution is to require the application to make a considered decision
That's not a solution. Code is not written with every decision considered, and it never will be. The (long-run) solution is a la Henry Ford: "you can encode text any way you want, as long as it's UTF-8". Then it won't matter if people ever make considered decisions about encoding! But trying to enforce that instead of letting it evolve naturally (as it is doing) will cause unnecessary pain for Python programmers, and I believe quite a lot of pain. I used to be in the "make them speak UTF-8" camp. But in the 15 years since PEP 263, experience has shown me that mostly it doesn't matter, and that when it does matter, you have to deal with the large variety of encodings anyway -- assuming UTF-8 is not a win. For use cases that can be encoding-agnostic because all cooperating participants share a locale encoding, making them explicitly specify the locale encoding is just a matter of "misery loves company". Please, let's not do things for that reason.
I think the use case that the proposal has in mind is a file-names-are-just-bytes program (or set of programs) that reads from the filesystem, converts to bytes for a file/network, and then eventually does the reverse - either end may be on windows.
You have misspoken somewhere. The programs under discussion do not "convert" input to bytes; they *receive* bytes, either from POSIX APIs or from Windows *A APIs, and use them as is. Unless I am greatly mistaken, Steve simply wants that to work as well on Windows as on POSIX platforms, so that POSIX programmers who do encoding-agnostic programming have one less barrier to supporting their software on Windows. But you'll have to ask Steve to rule on that. Steve
The last point is correct: if you get bytes from a file system API, you should be able to pass them back in without losing information. CP_ACP (a.k.a. the *A API) does not allow this, so I'm proposing using the *W API everywhere and encoding to utf-8 when the user wants/gives bytes. Top-posted from my Windows Phone -----Original Message----- From: "Stephen J. Turnbull" <turnbull.stephen.fw@u.tsukuba.ac.jp> Sent: 8/13/2016 12:11 To: "Random832" <random832@fastmail.com> Cc: "python-ideas@python.org" <python-ideas@python.org> Subject: Re: [Python-ideas] Fix default encodings on Windows Random832 writes:
And what's going to happen if you shovel those bytes into the filesystem without conversion on Linux, or worse, OSX?
Off topic. See Subject: field.
This proposal embodies an assumption that bytes from unknown sources used as filenames are more likely to be UTF-8 than in the locale ACP
Then it's irrelevant: most bytes are not from "unknown sources", they're from correspondents (or from yourself!) -- and for most users most of the time, those correspondents share the locale encoding with them. At least where I live, they use that encoding frequently.
the only solution is to require the application to make a considered decision
That's not a solution. Code is not written with every decision considered, and it never will be. The (long-run) solution is a la Henry Ford: "you can encode text any way you want, as long as it's UTF-8". Then it won't matter if people ever make considered decisions about encoding! But trying to enforce that instead of letting it evolve naturally (as it is doing) will cause unnecessary pain for Python programmers, and I believe quite a lot of pain. I used to be in the "make them speak UTF-8" camp. But in the 15 years since PEP 263, experience has shown me that mostly it doesn't matter, and that when it does matter, you have to deal with the large variety of encodings anyway -- assuming UTF-8 is not a win. For use cases that can be encoding-agnostic because all cooperating participants share a locale encoding, making them explicitly specify the locale encoding is just a matter of "misery loves company". Please, let's not do things for that reason.
I think the use case that the proposal has in mind is a file-names-are-just-bytes program (or set of programs) that reads from the filesystem, converts to bytes for a file/network, and then eventually does the reverse - either end may be on windows.
You have misspoken somewhere. The programs under discussion do not "convert" input to bytes; they *receive* bytes, either from POSIX APIs or from Windows *A APIs, and use them as is. Unless I am greatly mistaken, Steve simply wants that to work as well on Windows as on POSIX platforms, so that POSIX programmers who do encoding-agnostic programming have one less barrier to supporting their software on Windows. But you'll have to ask Steve to rule on that. Steve _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
The last point is correct: if you get bytes from a file system API, you should be able to pass them back in without losing information. CP_ACP (a.k.a. the *A API) does not allow this, so I'm proposing using the *W API everywhere and encoding to utf-8 when the user wants/gives bytes.
You get troubles when the filename comes a file, another application, a registry key, ... which is encoded to CP_ACP. Do you plan to transcode all these data? (decode from CP_ACP, encode back to UTF-8)
I plan to use only Unicode to interact with the OS and then utf8 within Python if the caller wants bytes. Currently we effectively use Unicode to interact with the OS and then CP_ACP if the caller wants bytes. All the *A APIs just decode strings and call the *W APIs, and encode the return values. I'm proposing that we move the decoding and encoding into Python and make it (nearly) lossless. In practice, this means all *A APIs are banned within the CPython source, and if we get/need bytes we have to convert to text first using the FS encoding, which will be utf8. Top-posted from my Windows Phone -----Original Message----- From: "Victor Stinner" <victor.stinner@gmail.com> Sent: 8/14/2016 9:20 To: "Steve Dower" <steve.dower@python.org> Cc: "Stephen J. Turnbull" <turnbull.stephen.fw@u.tsukuba.ac.jp>; "python-ideas" <python-ideas@python.org>; "Random832" <random832@fastmail.com> Subject: Re: [Python-ideas] Fix default encodings on Windows
The last point is correct: if you get bytes from a file system API, you should be able to pass them back in without losing information. CP_ACP (a.k.a. the *A API) does not allow this, so I'm proposing using the *W API everywhere and encoding to utf-8 when the user wants/gives bytes. You get troubles when the filename comes a file, another application, a registry key, ... which is encoded to CP_ACP. Do you plan to transcode all these data? (decode from CP_ACP, encode back to UTF-8)
participants (12)
-
Brett Cannon
-
Chris Angelico
-
Chris Barker
-
eryk sun
-
Nick Coghlan
-
Paul Moore
-
Random832
-
Stephen J. Turnbull
-
Steve Dower
-
Steven D'Aprano
-
tritium-list@sdamon.com
-
Victor Stinner