Re: [Python-ideas] Fix default encodings on Windows

12 Aug 2016

      Hello,

I'm in holiday and I'm writing on a phone, so sorry in advance for the
short answer.

In short: we should drop support for the bytes API. Just use Unicode on all
platforms, especially for filenames.

Sorry but most of these changes look like very bad ideas. Or maybe I
misunderstood something. Windows bytes API are broken in different ways, in
short your proposal is to put another layer on top of it to try to
workaround issues.

Unicode is complex. Unicode issues are hard to debug. Adding a new layer
makes debugging even harder. Is the bug in the input data? In the layer? In
the final Windows function?

In my experience on UNIX, the most important part is the interoperability
with other applications. I understand that Python 2 will speak ANSI code
page but Python 3 will speak UTF-8. I don't understand how it can work.
Almsot all Windows applications speak the ANSI code page (I'm talking about
stdin, stdout, pipes, ...).

Do you propose to first try to decode from UTF-8 or fallback on decoding
from the ANSI code page? What about encoding? Always encode to UTF-8?

About BOM: I hate them. Many applications don't understand them. Again,
think about Python 2. I recall vaguely that the Unicode strandard suggests
to not use BOM (I have to check).

I recall a bug in gettext. The tool doesn't understand BOM. When I opened
the file in vim, the BOM was invisible (hidden). I had to use hexdump to
understand the issue!

BOM introduces issues very difficult to debug :-/ I also think that it goes
in the wrong direction in term of interoperability.

For the Windows console: I played with all Windows functions, tried all
fonts and many code pages. I also read technical blog articles of Microsoft
employees. I gave up on this issue. It doesn't seem possible to support
fully Unicode the Windows console (at least the last time I checked). By
the way, it seems like Windows functions have bugs, and the code page 65001
fixes a few issues but introduces new issues...

Victor

Le 10 août 2016 20:16, "Steve Dower"  a écrit :
...
I suspect there's a lot of discussion to be had around this topic, so I
want to get it started. There are some fairly drastic ideas here and I need
help figuring out whether the impact outweighs the value.
Some background: within the Windows API, the preferred encoding is UTF-16.
This is a 16-bit format that is typed as wchar_t in the APIs that use it.
These APIs are generally referred to as the *W APIs (because they have a W
suffix).
There are also (broadly deprecated) APIs that use an 8-bit format (char),
where the encoding is assumed to be "the user's active code page". These
are *A APIs. AFAIK, there are no cases where a *A API should be preferred
over a *W API, and many newer APIs are *W only.
In general, Python passes byte strings into the *A APIs and text strings
into the *W APIs.
Right now, sys.getfilesystemencoding() on Windows returns "mbcs", which
translates to "the system's active code page". As this encoding generally
cannot represent all paths on Windows, it is deprecated and Unicode strings
are recommended instead. This, however, means you need to write
significantly different code between POSIX (use bytes) and Windows (use
text).
ISTM that changing sys.getfilesystemencoding() on Windows to "utf-8" and
updating path_converter() (Python/posixmodule.c; likely similar code in
other places) to decode incoming byte strings would allow us to undeprecate
byte strings and add the requirement that they *must* be encoded with
sys.getfilesystemencoding(). I assume that this would allow cross-platform
code to handle paths similarly by encoding to whatever the sys module says
they should and using bytes consistently (starting this thread is meant to
validate/refute my assumption).
(Yes, I know that people on POSIX should just change to using Unicode and
surrogateescape. Unfortunately, rather than doing that they complain about
Windows and drop support for the platform. If you want to keep hitting them
with the stick, go ahead, but I'm inclined to think the carrot is more
valuable here.)
Similarly, locale.getpreferredencoding() on Windows returns a legacy value
- the user's active code page - which should generally not be used for any
reason. The one exception is as a default encoding for opening files when
no other information is available (e.g. a Unicode BOM or explicit encoding
argument). BOMs are very common on Windows, since the default assumption is
nearly always a bad idea.
Making open()'s default encoding detect a BOM before falling back to
locale.getpreferredencoding() would resolve many issues, but I'm also
inclined towards making the fallback utf-8, leaving
locale.getpreferredencoding() solely as a way to get the active system
codepage (with suitable warnings about it only being useful for
back-compat). This would match the behavior that the .NET Framework has
used for many years - effectively, utf_8_sig on read and utf_8 on write.
Finally, the encoding of stdin, stdout and stderr are currently
(correctly) inferred from the encoding of the console window that Python is
attached to. However, this is typically a codepage that is different from
the system codepage (i.e. it's not mbcs) and is almost certainly not
Unicode. If users are starting Python from a console, they can use "chcp
65001" first to switch to UTF-8, and then *most* functionality works
(input() has some issues, but those can be fixed with a slight rewrite and
possibly breaking readline hooks).
It is also possible for Python to change the current console encoding to
be UTF-8 on initialize and change it back on finalize. (This would leave
the console in an unexpected state if Python segfaults, but console
encoding is probably the least of anyone's worries at that point.) So I'm
proposing actively changing the current console to be Unicode while Python
is running, and hence sys.std[in|out|err] will default to utf-8.
So that's a broad range of changes, and I have little hope of figuring out
all the possible issues, back-compat risks, and flow-on effects on my own.
Please let me know (either on-list or off-list) how a change like this
would affect your projects, either positively or negatively, and whether
you have any specific experience with these changes/fixes and think they
should be approached differently.
To summarise the proposals (remembering that these would only affect
Python 3.6 on Windows):
* change sys.getfilesystemencoding() to return 'utf-8'
* automatically decode byte paths assuming they are utf-8
* remove the deprecation warning on byte paths
* make the default open() encoding check for a BOM or else use utf-8
* [ALTERNATIVE] make the default open() encoding check for a BOM or else
use sys.getpreferredencoding()
* force the console encoding to UTF-8 on initialize and revert on finalize
So what are your concerns? Suggestions?
Thanks,
Steve
_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

Victor Stinner