On Thu, Aug 11, 2016 at 6:09 AM, Random832 email@example.com wrote:
On Wed, Aug 10, 2016, at 15:22, Steve Dower wrote:
Why? What's the use case? [byte paths]
Allowing library developers who support POSIX and Windows to just use bytes everywhere to represent paths.
Okay, how is that use case impacted by it being mbcs instead of utf-8?
AIUI, the data flow would be: Python bytes object -> decode to Unicode text -> encode to UTF-16 -> Windows API. If you do the first transformation using mbcs, you're guaranteed *some* result (all Windows codepages have definitions for all byte values, if I'm not mistaken), but a hard-to-predict one - and worse, one that can change based on system settings. Also, if someone naively types "bytepath.decode()", Python will default to UTF-8, *not* to the system codepage.
I'd rather a single consistent default encoding.
What about only doing the deprecation warning if non-ascii bytes are present in the value?
-1. Data-dependent warnings just serve to strengthen the feeling that "weird characters" keep breaking your programs, instead of writing your program to cope with all characters equally. It's like being racist against non-ASCII characters :)
On Thu, Aug 11, 2016 at 4:10 AM, Steve Dower firstname.lastname@example.org wrote:
To summarise the proposals (remembering that these would only affect Python 3.6 on Windows):
- change sys.getfilesystemencoding() to return 'utf-8'
- automatically decode byte paths assuming they are utf-8
- remove the deprecation warning on byte paths
+1 on these.
- make the default open() encoding check for a BOM or else use utf-8
-0.5. Is there any precedent for this kind of data-based detection being the default? An explicit "utf-sig" could do a full detection, but even then it's not perfect - how do you distinguish UTF-32LE from UTF-16LE that starts with U+0000? Do you say "UTF-32 is rare so we'll assume UTF-16", or do you say "files starting U+0000 are rare, so we'll assume UTF-32"?
- [ALTERNATIVE] make the default open() encoding check for a BOM or else use
-1. Same concerns as the above, plus I'd rather use the saner default.
- force the console encoding to UTF-8 on initialize and revert on finalize
-0 for Python itself; +1 for Python's interactive interpreter. Programs that mess with console settings get annoying when they crash out and don't revert properly. Unless there is *no way* that you could externally kill the process without also bringing the terminal down, there's the distinct possibility of messing everything up.
Would it be possible to have a "sys.setconsoleutf8()" that changes the console encoding and slaps in an atexit() to revert? That would at least leave it in the hands of the app.
Overall I'm +1 on shifting from eight-bit encodings to UTF-8. Don't be held back by what Notepad does.