Support WHATWG versions of legacy encodings
Hi! I joined this list because I'm interested in filling a gap in Python's standard library, relating to text encodings. There is an encoding with no name of its own. It's supported by every current web browser and standardized by WHATWG. It's so prevalent that if you ask a Web browser to decode "iso-8859-1" or "windows-1252", you will get this encoding _instead_. It is probably the second or third most common text encoding in the world. And Python doesn't quite support it. You can see the character table for this encoding at: https://encoding.spec.whatwg.org/index-windows-1252.txt For the sake of discussion, let's call this encoding "web-1252". WHATWG calls it "windows-1252", but notice that it's subtly different from Python's "windows-1252" encoding. Python's windows-1252 has bytes that are undefined:
b'\x90'.decode('windows-1252') UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 0: character maps to <undefined>
In web-1252, the bytes that are undefined according to windows-1252 map to the control characters in those positions in iso-8859-1 -- that is, the Unicode codepoints with the same number as the byte. In web-1252, b'\x90' would decode as '\u0090'. This may seem like a silly encoding that encourages doing horrible things with text. That's pretty much the case. But there's a reason every Web browser implements it: - It's compatible with windows-1252 - Any sequence of bytes can be round-tripped through it without losing information It's not just this one encoding. WHATWG's encoding standard ( https://encoding.spec.whatwg.org/) contains modified versions of windows-1250 through windows-1258 and windows-874. Support for these encodings matters to me, in part, because I maintain a Unicode data-cleaning library, "ftfy". One thing it does is to detect and undo encoding/decoding errors that cause mojibake, as long as they're detectible and reversible. Looking at real-world examples of text that has been damaged by mojibake, it's clear that lots of text is transferred through what I'm calling the "web-1252" encoding, in a way that's incompatible with Python's "windows-1252". In order to be able to work with and fix this kind of text, ftfy registers new codecs -- and I implemented this even before I knew that they were standardized in Web browsers. When ftfy is imported, you can decode text as "sloppy-windows-1252" (the name I chose for this encoding), for example. ftfy can tell people a sequence of steps that they can use in the future to fix text that's like the text they provided. Very often, these steps require the sloppy-windows-1252 or sloppy-windows-1251 encoding, which means the steps only work with ftfy imported, even for people who are not using the features of ftfy. Support for these encodings also seems highly relevant to people who use Python for web scraping, as it would be desirable to maximize compatibility with what a Web browser would do. This really seems like it belongs in the standard library instead of being an incidental feature of my library. I know that code in the standard library has "one foot in the grave". I _want_ these legacy encodings to have one foot in the grave. But some of them are extremely common, and Python code should be able to deal with them. Adding these encodings to Python would be straightforward to implement. Does this require a PEP, a pull request, or further discussion?
Hi! I joined this list because I'm interested in filling a gap in Python's standard library, relating to text encodings.
There is an encoding with no name of its own. It's supported by every current web browser and standardized by WHATWG. It's so prevalent that if you ask a Web browser to decode "iso-8859-1" or "windows-1252", you will get this encoding _instead_. It is probably the second or third most common text encoding in the world. And Python doesn't quite support it.
You can see the character table for this encoding at: https://encoding.spec.whatwg.org/index-windows-1252.txt
For the sake of discussion, let's call this encoding "web-1252". WHATWG calls it "windows-1252", but notice that it's subtly different from Python's "windows-1252" encoding. Python's windows-1252 has bytes that are undefined:
b'\x90'.decode('windows-1252') UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 0: character maps to <undefined>
In web-1252, the bytes that are undefined according to windows-1252 map to the control characters in those positions in iso-8859-1 -- that is, the Unicode codepoints with the same number as the byte. In web-1252, b'\x90' would decode as '\u0090'. According to https://en.wikipedia.org/wiki/Windows-1252 , Windows does
First of all, many thanks for such a excellently writen letter. It was a real pleasure to read. On 10.01.2018 0:15, Rob Speer wrote: the same: "According to the information on Microsoft's and the Unicode Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused; however, the Windows API |MultiByteToWideChar <http://msdn.microsoft.com/en-us/library/windows/desktop/dd319072%28v=vs.85%29.aspx>| maps these to the corresponding C1 control codes <https://en.wikipedia.org/wiki/C0_and_C1_control_codes>." And in ISO-8859-1, the same handling is done for unused code points even by the standard ( https://en.wikipedia.org/wiki/ISO/IEC_8859-1 ) : "*ISO-8859-1* is the IANA <https://en.wikipedia.org/wiki/Internet_Assigned_Numbers_Authority> preferred name for this standard when supplemented with the C0 and C1 control codes <https://en.wikipedia.org/wiki/C0_and_C1_control_codes> from ISO/IEC 6429 <https://en.wikipedia.org/wiki/ISO/IEC_6429>" And what would you think -- these "C1 control codes" are also the corresponding Unicode points! ( https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block) <https://en.wikipedia.org/wiki/Latin-1_Supplement_%28Unicode_block%29> ) Since Windows is pretty much the reference implementation for "windows-xxxx" encodings, it even makes sense to alter the existing encodings rather than add new ones.
This may seem like a silly encoding that encourages doing horrible things with text. That's pretty much the case. But there's a reason every Web browser implements it:
- It's compatible with windows-1252 - Any sequence of bytes can be round-tripped through it without losing information
It's not just this one encoding. WHATWG's encoding standard (https://encoding.spec.whatwg.org/) contains modified versions of windows-1250 through windows-1258 and windows-874.
Support for these encodings matters to me, in part, because I maintain a Unicode data-cleaning library, "ftfy". One thing it does is to detect and undo encoding/decoding errors that cause mojibake, as long as they're detectible and reversible. Looking at real-world examples of text that has been damaged by mojibake, it's clear that lots of text is transferred through what I'm calling the "web-1252" encoding, in a way that's incompatible with Python's "windows-1252".
In order to be able to work with and fix this kind of text, ftfy registers new codecs -- and I implemented this even before I knew that they were standardized in Web browsers. When ftfy is imported, you can decode text as "sloppy-windows-1252" (the name I chose for this encoding), for example.
ftfy can tell people a sequence of steps that they can use in the future to fix text that's like the text they provided. Very often, these steps require the sloppy-windows-1252 or sloppy-windows-1251 encoding, which means the steps only work with ftfy imported, even for people who are not using the features of ftfy.
Support for these encodings also seems highly relevant to people who use Python for web scraping, as it would be desirable to maximize compatibility with what a Web browser would do.
This really seems like it belongs in the standard library instead of being an incidental feature of my library. I know that code in the standard library has "one foot in the grave". I _want_ these legacy encodings to have one foot in the grave. But some of them are extremely common, and Python code should be able to deal with them.
Adding these encodings to Python would be straightforward to implement. Does this require a PEP, a pull request, or further discussion?
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- Regards, Ivan
Oh that's interesting. So it seems to be Python that's the exception here. Would we really be able to add entries to character mappings that haven't changed since Python 2.0? On Tue, 9 Jan 2018 at 16:53 Ivan Pozdeev via Python-ideas < python-ideas@python.org> wrote:
First of all, many thanks for such a excellently writen letter. It was a real pleasure to read. On 10.01.2018 0:15, Rob Speer wrote:
Hi! I joined this list because I'm interested in filling a gap in Python's standard library, relating to text encodings.
There is an encoding with no name of its own. It's supported by every current web browser and standardized by WHATWG. It's so prevalent that if you ask a Web browser to decode "iso-8859-1" or "windows-1252", you will get this encoding _instead_. It is probably the second or third most common text encoding in the world. And Python doesn't quite support it.
You can see the character table for this encoding at: https://encoding.spec.whatwg.org/index-windows-1252.txt
For the sake of discussion, let's call this encoding "web-1252". WHATWG calls it "windows-1252", but notice that it's subtly different from Python's "windows-1252" encoding. Python's windows-1252 has bytes that are undefined:
b'\x90'.decode('windows-1252') UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 0: character maps to <undefined>
In web-1252, the bytes that are undefined according to windows-1252 map to the control characters in those positions in iso-8859-1 -- that is, the Unicode codepoints with the same number as the byte. In web-1252, b'\x90' would decode as '\u0090'.
According to https://en.wikipedia.org/wiki/Windows-1252 , Windows does the same:
"According to the information on Microsoft's and the Unicode Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused; however, the Windows API MultiByteToWideChar <http://msdn.microsoft.com/en-us/library/windows/desktop/dd319072%28v=vs.85%2...> maps these to the corresponding C1 control codes <https://en.wikipedia.org/wiki/C0_and_C1_control_codes>." And in ISO-8859-1, the same handling is done for unused code points even by the standard ( https://en.wikipedia.org/wiki/ISO/IEC_8859-1 ) :
"*ISO-8859-1* is the IANA <https://en.wikipedia.org/wiki/Internet_Assigned_Numbers_Authority> preferred name for this standard when supplemented with the C0 and C1 control codes <https://en.wikipedia.org/wiki/C0_and_C1_control_codes> from ISO/IEC 6429 <https://en.wikipedia.org/wiki/ISO/IEC_6429>" And what would you think -- these "C1 control codes" are also the corresponding Unicode points! ( https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block) )
Since Windows is pretty much the reference implementation for "windows-xxxx" encodings, it even makes sense to alter the existing encodings rather than add new ones.
This may seem like a silly encoding that encourages doing horrible things with text. That's pretty much the case. But there's a reason every Web browser implements it:
- It's compatible with windows-1252 - Any sequence of bytes can be round-tripped through it without losing information
It's not just this one encoding. WHATWG's encoding standard ( https://encoding.spec.whatwg.org/) contains modified versions of windows-1250 through windows-1258 and windows-874.
Support for these encodings matters to me, in part, because I maintain a Unicode data-cleaning library, "ftfy". One thing it does is to detect and undo encoding/decoding errors that cause mojibake, as long as they're detectible and reversible. Looking at real-world examples of text that has been damaged by mojibake, it's clear that lots of text is transferred through what I'm calling the "web-1252" encoding, in a way that's incompatible with Python's "windows-1252".
In order to be able to work with and fix this kind of text, ftfy registers new codecs -- and I implemented this even before I knew that they were standardized in Web browsers. When ftfy is imported, you can decode text as "sloppy-windows-1252" (the name I chose for this encoding), for example.
ftfy can tell people a sequence of steps that they can use in the future to fix text that's like the text they provided. Very often, these steps require the sloppy-windows-1252 or sloppy-windows-1251 encoding, which means the steps only work with ftfy imported, even for people who are not using the features of ftfy.
Support for these encodings also seems highly relevant to people who use Python for web scraping, as it would be desirable to maximize compatibility with what a Web browser would do.
This really seems like it belongs in the standard library instead of being an incidental feature of my library. I know that code in the standard library has "one foot in the grave". I _want_ these legacy encodings to have one foot in the grave. But some of them are extremely common, and Python code should be able to deal with them.
Adding these encodings to Python would be straightforward to implement. Does this require a PEP, a pull request, or further discussion?
_______________________________________________ Python-ideas mailing listPython-ideas@python.orghttps://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- Regards, Ivan
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
On 10 January 2018 at 09:56, Rob Speer <rspeer@luminoso.com> wrote:
Oh that's interesting. So it seems to be Python that's the exception here.
Would we really be able to add entries to character mappings that haven't changed since Python 2.0?
Changing things that used to cause an exception into operations that produce a useful result is generally OK - it's going the other way (dubious output -> exception) that's always problematic. So as long as the Windows specialists give it a +1, updating the existing codecs to match the MultiByteToWideChar behaviour seems like a better option to me than offering multiple versions of the codecs (and that could then be done as a tracker enhancement request along the lines of "Make the windows-* text encodings match MultiByteToWideChar"). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
One other thing I've noticed that's related to the WHATWG encoding list: in Python, the encoding name "windows-874" seems to be missing. The _encoding_ is there, as "cp874", but "windows-874" doesn't work as an alias for it the way that "windows-1252" works as an alias for "cp1252". That alias should be added, right? On Tue, 9 Jan 2018 at 21:46 Nick Coghlan <ncoghlan@gmail.com> wrote:
On 10 January 2018 at 09:56, Rob Speer <rspeer@luminoso.com> wrote:
Oh that's interesting. So it seems to be Python that's the exception here.
Would we really be able to add entries to character mappings that haven't changed since Python 2.0?
Changing things that used to cause an exception into operations that produce a useful result is generally OK - it's going the other way (dubious output -> exception) that's always problematic.
So as long as the Windows specialists give it a +1, updating the existing codecs to match the MultiByteToWideChar behaviour seems like a better option to me than offering multiple versions of the codecs (and that could then be done as a tracker enhancement request along the lines of "Make the windows-* text encodings match MultiByteToWideChar").
Cheers, Nick.
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 10 January 2018 at 13:56, Rob Speer <rspeer@luminoso.com> wrote:
One other thing I've noticed that's related to the WHATWG encoding list: in Python, the encoding name "windows-874" seems to be missing. The _encoding_ is there, as "cp874", but "windows-874" doesn't work as an alias for it the way that "windows-1252" works as an alias for "cp1252". That alias should be added, right?
Aye, that would make sense. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 10 January 2018 at 04:16, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 10 January 2018 at 13:56, Rob Speer <rspeer@luminoso.com> wrote:
One other thing I've noticed that's related to the WHATWG encoding list: in Python, the encoding name "windows-874" seems to be missing. The _encoding_ is there, as "cp874", but "windows-874" doesn't work as an alias for it the way that "windows-1252" works as an alias for "cp1252". That alias should be added, right?
Aye, that would make sense.
Agreed - extending the encodings and adding the alias both sound like reasonable enhancements to me. Paul
On 10.01.2018 00:56, Rob Speer wrote:
Oh that's interesting. So it seems to be Python that's the exception here.
Would we really be able to add entries to character mappings that haven't changed since Python 2.0?
The Windows mappings in Python come directly from the Unicode Consortium mapping files. If the Consortium changes the mappings, we can update them. If not, then we have a problem, since consumers are not only the win32 APIs, but also other tools out there running on completely different platforms, e.g. Java tools or web servers providing downloads using the Windows code page encodings. Allowing such mappings in the existing codecs would then result failures when the "other" sides see the decoded Unicode version and try to encode back into the original encoding - you'd move the problem from the Python side to the "other" side of the integration. I had a look on the Unicode FTP site and they have since added a new directory with mapping files they call "best fit": http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/readme.... The WideCharToMultiByte() defaults to best fit, but also offers a mode where it operates in standards compliant mode: https://msdn.microsoft.com/en-us/library/windows/desktop/dd374130%28v=vs.85%... See flag WC_NO_BEST_FIT_CHARS. Unicode TR#22 is also clear on this: https://www.unicode.org/reports/tr22/tr22-3.html#Illegal_and_Unassigned It allows such best fit mappings to make encodings round-trip safe, but requires to keep these separate from the original standard mappings: """ It is very important that systems be able to distinguish between the fallback mappings and regular mappings. Systems like XML require the use of hex escape sequences (NCRs) to preserve round-trip integrity; use of fallback characters in that case corrupts the data. """ If you read the above section in TR#22 you quickly get reminded of what the Unicode error handlers do (we basically implement the three modes it mentions... raise, ignore, replace). Now, for unmapped sequences an error handler can opt for using a fallback sequence instead. So in addition to adding best fit codecs, there's also the option to add an error handler for best fit resolution of unmapped sequences. Given the above, I don't think we ought to change the existing standards compliant mappings, but use one of two solutions: a) add "best fit" encodings (see the Unicode FTP site for a list) b) add an error handlers "bestfit" which implements the fallback modes for the encodings in question
On Tue, 9 Jan 2018 at 16:53 Ivan Pozdeev via Python-ideas < python-ideas@python.org> wrote:
First of all, many thanks for such a excellently writen letter. It was a real pleasure to read. On 10.01.2018 0:15, Rob Speer wrote:
Hi! I joined this list because I'm interested in filling a gap in Python's standard library, relating to text encodings.
There is an encoding with no name of its own. It's supported by every current web browser and standardized by WHATWG. It's so prevalent that if you ask a Web browser to decode "iso-8859-1" or "windows-1252", you will get this encoding _instead_. It is probably the second or third most common text encoding in the world. And Python doesn't quite support it.
You can see the character table for this encoding at: https://encoding.spec.whatwg.org/index-windows-1252.txt
For the sake of discussion, let's call this encoding "web-1252". WHATWG calls it "windows-1252", but notice that it's subtly different from Python's "windows-1252" encoding. Python's windows-1252 has bytes that are undefined:
b'\x90'.decode('windows-1252') UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 0: character maps to <undefined>
In web-1252, the bytes that are undefined according to windows-1252 map to the control characters in those positions in iso-8859-1 -- that is, the Unicode codepoints with the same number as the byte. In web-1252, b'\x90' would decode as '\u0090'.
According to https://en.wikipedia.org/wiki/Windows-1252 , Windows does the same:
"According to the information on Microsoft's and the Unicode Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused; however, the Windows API MultiByteToWideChar <http://msdn.microsoft.com/en-us/library/windows/desktop/dd319072%28v=vs.85%2...> maps these to the corresponding C1 control codes <https://en.wikipedia.org/wiki/C0_and_C1_control_codes>." And in ISO-8859-1, the same handling is done for unused code points even by the standard ( https://en.wikipedia.org/wiki/ISO/IEC_8859-1 ) :
"*ISO-8859-1* is the IANA <https://en.wikipedia.org/wiki/Internet_Assigned_Numbers_Authority> preferred name for this standard when supplemented with the C0 and C1 control codes <https://en.wikipedia.org/wiki/C0_and_C1_control_codes> from ISO/IEC 6429 <https://en.wikipedia.org/wiki/ISO/IEC_6429>" And what would you think -- these "C1 control codes" are also the corresponding Unicode points! ( https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block) )
Since Windows is pretty much the reference implementation for "windows-xxxx" encodings, it even makes sense to alter the existing encodings rather than add new ones.
This may seem like a silly encoding that encourages doing horrible things with text. That's pretty much the case. But there's a reason every Web browser implements it:
- It's compatible with windows-1252 - Any sequence of bytes can be round-tripped through it without losing information
It's not just this one encoding. WHATWG's encoding standard ( https://encoding.spec.whatwg.org/) contains modified versions of windows-1250 through windows-1258 and windows-874.
Support for these encodings matters to me, in part, because I maintain a Unicode data-cleaning library, "ftfy". One thing it does is to detect and undo encoding/decoding errors that cause mojibake, as long as they're detectible and reversible. Looking at real-world examples of text that has been damaged by mojibake, it's clear that lots of text is transferred through what I'm calling the "web-1252" encoding, in a way that's incompatible with Python's "windows-1252".
In order to be able to work with and fix this kind of text, ftfy registers new codecs -- and I implemented this even before I knew that they were standardized in Web browsers. When ftfy is imported, you can decode text as "sloppy-windows-1252" (the name I chose for this encoding), for example.
ftfy can tell people a sequence of steps that they can use in the future to fix text that's like the text they provided. Very often, these steps require the sloppy-windows-1252 or sloppy-windows-1251 encoding, which means the steps only work with ftfy imported, even for people who are not using the features of ftfy.
Support for these encodings also seems highly relevant to people who use Python for web scraping, as it would be desirable to maximize compatibility with what a Web browser would do.
This really seems like it belongs in the standard library instead of being an incidental feature of my library. I know that code in the standard library has "one foot in the grave". I _want_ these legacy encodings to have one foot in the grave. But some of them are extremely common, and Python code should be able to deal with them.
Adding these encodings to Python would be straightforward to implement. Does this require a PEP, a pull request, or further discussion?
_______________________________________________ Python-ideas mailing listPython-ideas@python.orghttps://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- Regards, Ivan
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jan 10 2018)
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/
I'm looking at the documentation of "best fit" mappings, and that seems to be a different matter. It appears that best-fit mappings are designed to be many-to-one mappings used only for encoding. "Examples of best fit are converting fullwidth letters to their counterparts when converting to single byte code pages, and mapping the Infinity character to the number 8." (Mapping ∞ to 8? Seriously?!) It also does things such as mapping Cyrillic letters to Latin letters that look like them. This is not what I'm interested in implementing. I just want there to be encodings that match the WHATWG encodings exactly. If they have to be given a different name, that's fine with me. On Wed, 10 Jan 2018 at 03:38 M.-A. Lemburg <mal@egenix.com> wrote:
On 10.01.2018 00:56, Rob Speer wrote:
Oh that's interesting. So it seems to be Python that's the exception here.
Would we really be able to add entries to character mappings that haven't changed since Python 2.0?
The Windows mappings in Python come directly from the Unicode Consortium mapping files.
If the Consortium changes the mappings, we can update them.
If not, then we have a problem, since consumers are not only the win32 APIs, but also other tools out there running on completely different platforms, e.g. Java tools or web servers providing downloads using the Windows code page encodings.
Allowing such mappings in the existing codecs would then result failures when the "other" sides see the decoded Unicode version and try to encode back into the original encoding - you'd move the problem from the Python side to the "other" side of the integration.
I had a look on the Unicode FTP site and they have since added a new directory with mapping files they call "best fit":
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/readme....
The WideCharToMultiByte() defaults to best fit, but also offers a mode where it operates in standards compliant mode:
https://msdn.microsoft.com/en-us/library/windows/desktop/dd374130%28v=vs.85%...
See flag WC_NO_BEST_FIT_CHARS.
Unicode TR#22 is also clear on this:
https://www.unicode.org/reports/tr22/tr22-3.html#Illegal_and_Unassigned
It allows such best fit mappings to make encodings round-trip safe, but requires to keep these separate from the original standard mappings:
""" It is very important that systems be able to distinguish between the fallback mappings and regular mappings. Systems like XML require the use of hex escape sequences (NCRs) to preserve round-trip integrity; use of fallback characters in that case corrupts the data. """
If you read the above section in TR#22 you quickly get reminded of what the Unicode error handlers do (we basically implement the three modes it mentions... raise, ignore, replace).
Now, for unmapped sequences an error handler can opt for using a fallback sequence instead.
So in addition to adding best fit codecs, there's also the option to add an error handler for best fit resolution of unmapped sequences.
Given the above, I don't think we ought to change the existing standards compliant mappings, but use one of two solutions:
a) add "best fit" encodings (see the Unicode FTP site for a list)
b) add an error handlers "bestfit" which implements the fallback modes for the encodings in question
On Tue, 9 Jan 2018 at 16:53 Ivan Pozdeev via Python-ideas < python-ideas@python.org> wrote:
First of all, many thanks for such a excellently writen letter. It was a real pleasure to read. On 10.01.2018 0:15, Rob Speer wrote:
Hi! I joined this list because I'm interested in filling a gap in Python's standard library, relating to text encodings.
There is an encoding with no name of its own. It's supported by every current web browser and standardized by WHATWG. It's so prevalent that if you ask a Web browser to decode "iso-8859-1" or "windows-1252", you will get this encoding _instead_. It is probably the second or third most common text encoding in the world. And Python doesn't quite support it.
You can see the character table for this encoding at: https://encoding.spec.whatwg.org/index-windows-1252.txt
For the sake of discussion, let's call this encoding "web-1252". WHATWG calls it "windows-1252", but notice that it's subtly different from Python's "windows-1252" encoding. Python's windows-1252 has bytes that are undefined:
b'\x90'.decode('windows-1252') UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 0: character maps to <undefined>
In web-1252, the bytes that are undefined according to windows-1252 map to the control characters in those positions in iso-8859-1 -- that is, the Unicode codepoints with the same number as the byte. In web-1252, b'\x90' would decode as '\u0090'.
According to https://en.wikipedia.org/wiki/Windows-1252 , Windows does the same:
"According to the information on Microsoft's and the Unicode Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused; however, the Windows API MultiByteToWideChar < http://msdn.microsoft.com/en-us/library/windows/desktop/dd319072%28v=vs.85%2...
maps these to the corresponding C1 control codes <https://en.wikipedia.org/wiki/C0_and_C1_control_codes>." And in ISO-8859-1, the same handling is done for unused code points even by the standard ( https://en.wikipedia.org/wiki/ISO/IEC_8859-1 ) :
"*ISO-8859-1* is the IANA <https://en.wikipedia.org/wiki/Internet_Assigned_Numbers_Authority> preferred name for this standard when supplemented with the C0 and C1 control codes <https://en.wikipedia.org/wiki/C0_and_C1_control_codes> from ISO/IEC 6429 <https://en.wikipedia.org/wiki/ISO/IEC_6429>" And what would you think -- these "C1 control codes" are also the corresponding Unicode points! ( https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block) )
Since Windows is pretty much the reference implementation for "windows-xxxx" encodings, it even makes sense to alter the existing encodings rather than add new ones.
This may seem like a silly encoding that encourages doing horrible things with text. That's pretty much the case. But there's a reason every Web browser implements it:
- It's compatible with windows-1252 - Any sequence of bytes can be round-tripped through it without losing information
It's not just this one encoding. WHATWG's encoding standard ( https://encoding.spec.whatwg.org/) contains modified versions of windows-1250 through windows-1258 and windows-874.
Support for these encodings matters to me, in part, because I maintain a Unicode data-cleaning library, "ftfy". One thing it does is to detect and undo encoding/decoding errors that cause mojibake, as long as they're detectible and reversible. Looking at real-world examples of text that has been damaged by mojibake, it's clear that lots of text is transferred through what I'm calling the "web-1252" encoding, in a way that's incompatible with Python's "windows-1252".
In order to be able to work with and fix this kind of text, ftfy registers new codecs -- and I implemented this even before I knew that they were standardized in Web browsers. When ftfy is imported, you can decode text as "sloppy-windows-1252" (the name I chose for this encoding), for example.
ftfy can tell people a sequence of steps that they can use in the future to fix text that's like the text they provided. Very often, these steps require the sloppy-windows-1252 or sloppy-windows-1251 encoding, which means the steps only work with ftfy imported, even for people who are not using the features of ftfy.
Support for these encodings also seems highly relevant to people who use Python for web scraping, as it would be desirable to maximize compatibility with what a Web browser would do.
This really seems like it belongs in the standard library instead of being an incidental feature of my library. I know that code in the standard library has "one foot in the grave". I _want_ these legacy encodings to have one foot in the grave. But some of them are extremely common, and Python code should be able to deal with them.
Adding these encodings to Python would be straightforward to implement. Does this require a PEP, a pull request, or further discussion?
_______________________________________________ Python-ideas mailing listPython-ideas@python.orghttps:// mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- Regards, Ivan
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- Marc-Andre Lemburg eGenix.com
Professional Python Services directly from the Experts (#1, Jan 10 2018)
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/
::: We implement business ideas - efficiently in both time and costs :::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/
On 10.01.2018 19:36, Rob Speer wrote:
I'm looking at the documentation of "best fit" mappings, and that seems to be a different matter. It appears that best-fit mappings are designed to be many-to-one mappings used only for encoding.
"Best fit" is what the Windows API is implementing. I don't believe it's a good strategy to create the confusion that WHATWG is introducing by using the same names for non-standard encodings. Python uses the Unicode Consortium standard encodings or otherwise internationally standardized ones for the stdlib. If someone wants to use different encodings, it's easily possible to pip install these as necessary. For the stdlib, I think we should stick to standards and not go for spreading non-standard ones. So -1 on adding WHATWG encodings to the stdlib. We could add encodings from the Unicode Best Fit mappings and call them e.g. "bestfit1252" as is done by the Unicode Consortium. They may not be the same as what the WHATWG defines, but serve a very similar purpose and match what is implemented by the Windows API. Adding valid new aliases is a different matter. As long as the aliases do map to the same encodings, those are perfectly fine to add. Thanks.
"Examples of best fit are converting fullwidth letters to their counterparts when converting to single byte code pages, and mapping the Infinity character to the number 8." (Mapping ∞ to 8? Seriously?!) It also does things such as mapping Cyrillic letters to Latin letters that look like them.
This is not what I'm interested in implementing. I just want there to be encodings that match the WHATWG encodings exactly. If they have to be given a different name, that's fine with me.
On Wed, 10 Jan 2018 at 03:38 M.-A. Lemburg <mal@egenix.com> wrote:
On 10.01.2018 00:56, Rob Speer wrote:
Oh that's interesting. So it seems to be Python that's the exception here.
Would we really be able to add entries to character mappings that haven't changed since Python 2.0?
The Windows mappings in Python come directly from the Unicode Consortium mapping files.
If the Consortium changes the mappings, we can update them.
If not, then we have a problem, since consumers are not only the win32 APIs, but also other tools out there running on completely different platforms, e.g. Java tools or web servers providing downloads using the Windows code page encodings.
Allowing such mappings in the existing codecs would then result failures when the "other" sides see the decoded Unicode version and try to encode back into the original encoding - you'd move the problem from the Python side to the "other" side of the integration.
I had a look on the Unicode FTP site and they have since added a new directory with mapping files they call "best fit":
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/readme....
The WideCharToMultiByte() defaults to best fit, but also offers a mode where it operates in standards compliant mode:
https://msdn.microsoft.com/en-us/library/windows/desktop/dd374130%28v=vs.85%...
See flag WC_NO_BEST_FIT_CHARS.
Unicode TR#22 is also clear on this:
https://www.unicode.org/reports/tr22/tr22-3.html#Illegal_and_Unassigned
It allows such best fit mappings to make encodings round-trip safe, but requires to keep these separate from the original standard mappings:
""" It is very important that systems be able to distinguish between the fallback mappings and regular mappings. Systems like XML require the use of hex escape sequences (NCRs) to preserve round-trip integrity; use of fallback characters in that case corrupts the data. """
If you read the above section in TR#22 you quickly get reminded of what the Unicode error handlers do (we basically implement the three modes it mentions... raise, ignore, replace).
Now, for unmapped sequences an error handler can opt for using a fallback sequence instead.
So in addition to adding best fit codecs, there's also the option to add an error handler for best fit resolution of unmapped sequences.
Given the above, I don't think we ought to change the existing standards compliant mappings, but use one of two solutions:
a) add "best fit" encodings (see the Unicode FTP site for a list)
b) add an error handlers "bestfit" which implements the fallback modes for the encodings in question
On Tue, 9 Jan 2018 at 16:53 Ivan Pozdeev via Python-ideas < python-ideas@python.org> wrote:
First of all, many thanks for such a excellently writen letter. It was a real pleasure to read. On 10.01.2018 0:15, Rob Speer wrote:
Hi! I joined this list because I'm interested in filling a gap in Python's standard library, relating to text encodings.
There is an encoding with no name of its own. It's supported by every current web browser and standardized by WHATWG. It's so prevalent that if you ask a Web browser to decode "iso-8859-1" or "windows-1252", you will get this encoding _instead_. It is probably the second or third most common text encoding in the world. And Python doesn't quite support it.
You can see the character table for this encoding at: https://encoding.spec.whatwg.org/index-windows-1252.txt
For the sake of discussion, let's call this encoding "web-1252". WHATWG calls it "windows-1252", but notice that it's subtly different from Python's "windows-1252" encoding. Python's windows-1252 has bytes that are undefined:
> b'\x90'.decode('windows-1252') UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 0: character maps to <undefined>
In web-1252, the bytes that are undefined according to windows-1252 map to the control characters in those positions in iso-8859-1 -- that is, the Unicode codepoints with the same number as the byte. In web-1252, b'\x90' would decode as '\u0090'.
According to https://en.wikipedia.org/wiki/Windows-1252 , Windows does the same:
"According to the information on Microsoft's and the Unicode Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused; however, the Windows API MultiByteToWideChar < http://msdn.microsoft.com/en-us/library/windows/desktop/dd319072%28v=vs.85%2...
maps these to the corresponding C1 control codes <https://en.wikipedia.org/wiki/C0_and_C1_control_codes>." And in ISO-8859-1, the same handling is done for unused code points even by the standard ( https://en.wikipedia.org/wiki/ISO/IEC_8859-1 ) :
"*ISO-8859-1* is the IANA <https://en.wikipedia.org/wiki/Internet_Assigned_Numbers_Authority> preferred name for this standard when supplemented with the C0 and C1 control codes <https://en.wikipedia.org/wiki/C0_and_C1_control_codes> from ISO/IEC 6429 <https://en.wikipedia.org/wiki/ISO/IEC_6429>" And what would you think -- these "C1 control codes" are also the corresponding Unicode points! ( https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block) )
Since Windows is pretty much the reference implementation for "windows-xxxx" encodings, it even makes sense to alter the existing encodings rather than add new ones.
This may seem like a silly encoding that encourages doing horrible things with text. That's pretty much the case. But there's a reason every Web browser implements it:
- It's compatible with windows-1252 - Any sequence of bytes can be round-tripped through it without losing information
It's not just this one encoding. WHATWG's encoding standard ( https://encoding.spec.whatwg.org/) contains modified versions of windows-1250 through windows-1258 and windows-874.
Support for these encodings matters to me, in part, because I maintain a Unicode data-cleaning library, "ftfy". One thing it does is to detect and undo encoding/decoding errors that cause mojibake, as long as they're detectible and reversible. Looking at real-world examples of text that has been damaged by mojibake, it's clear that lots of text is transferred through what I'm calling the "web-1252" encoding, in a way that's incompatible with Python's "windows-1252".
In order to be able to work with and fix this kind of text, ftfy registers new codecs -- and I implemented this even before I knew that they were standardized in Web browsers. When ftfy is imported, you can decode text as "sloppy-windows-1252" (the name I chose for this encoding), for example.
ftfy can tell people a sequence of steps that they can use in the future to fix text that's like the text they provided. Very often, these steps require the sloppy-windows-1252 or sloppy-windows-1251 encoding, which means the steps only work with ftfy imported, even for people who are not using the features of ftfy.
Support for these encodings also seems highly relevant to people who use Python for web scraping, as it would be desirable to maximize compatibility with what a Web browser would do.
This really seems like it belongs in the standard library instead of being an incidental feature of my library. I know that code in the standard library has "one foot in the grave". I _want_ these legacy encodings to have one foot in the grave. But some of them are extremely common, and Python code should be able to deal with them.
Adding these encodings to Python would be straightforward to implement. Does this require a PEP, a pull request, or further discussion?
_______________________________________________ Python-ideas mailing listPython-ideas@python.orghttps:// mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- Regards, Ivan
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- Marc-Andre Lemburg eGenix.com
Professional Python Services directly from the Experts (#1, Jan 10 2018)
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/
::: We implement business ideas - efficiently in both time and costs :::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jan 10 2018)
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/
On 11 January 2018 at 05:04, M.-A. Lemburg <mal@egenix.com> wrote:
For the stdlib, I think we should stick to standards and not go for spreading non-standard ones.
So -1 on adding WHATWG encodings to the stdlib.
We already support HTML5 in the standard library, and saying "We'll accept WHATWG's definition of HTML, but not their associated text encodings" seems like a strange place to draw a line when it comes to standards support. I do think your observation constitutes a compelling reason to leave the existing codecs alone though, and treat the web codecs as a distinct set of mappings. Given that, I think Rob's original suggestion of using "web-1252" et al is a good one. We can also separate them out in the documentation, such that we have three tables: * https://docs.python.org/3/library/codecs.html#standard-encodings (Unicode Consortium) * https://docs.python.org/3/library/codecs.html#python-specific-encodings (python-dev/PSF) * a new table for WHATWG encodings Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 11.01.2018 01:22, Nick Coghlan wrote:
On 11 January 2018 at 05:04, M.-A. Lemburg <mal@egenix.com> wrote:
For the stdlib, I think we should stick to standards and not go for spreading non-standard ones.
So -1 on adding WHATWG encodings to the stdlib.
We already support HTML5 in the standard library, and saying "We'll accept WHATWG's definition of HTML, but not their associated text encodings" seems like a strange place to draw a line when it comes to standards support.
There's a problem with these encodings: they are mostly meant for decoding (broken) data, but as soon as we have them in the stdlib, people will also start using them for encoding data, producing more corrupted data. Do you really things it's a good idea to support this natively in Python ? The other problem is that WHATWG considers its documents "living standards", i.e. they are subject to change and don't come with a version number (apart from a date). This makes sense when you look at their mostly decoding-only nature, but, again for encoding, creates an interoperability problem.
I do think your observation constitutes a compelling reason to leave the existing codecs alone though, and treat the web codecs as a distinct set of mappings. Given that, I think Rob's original suggestion of using "web-1252" et al is a good one.
We can also separate them out in the documentation, such that we have three tables:
* https://docs.python.org/3/library/codecs.html#standard-encodings (Unicode Consortium) * https://docs.python.org/3/library/codecs.html#python-specific-encodings (python-dev/PSF) * a new table for WHATWG encodings
Cheers, Nick.
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jan 11 2018)
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/
On Thu, Jan 11, 2018 at 7:58 PM, M.-A. Lemburg <mal@egenix.com> wrote:
On 11.01.2018 01:22, Nick Coghlan wrote:
On 11 January 2018 at 05:04, M.-A. Lemburg <mal@egenix.com> wrote:
For the stdlib, I think we should stick to standards and not go for spreading non-standard ones.
So -1 on adding WHATWG encodings to the stdlib.
We already support HTML5 in the standard library, and saying "We'll accept WHATWG's definition of HTML, but not their associated text encodings" seems like a strange place to draw a line when it comes to standards support.
There's a problem with these encodings: they are mostly meant for decoding (broken) data, but as soon as we have them in the stdlib, people will also start using them for encoding data, producing more corrupted data.
Do you really things it's a good idea to support this natively in Python ?
The other problem is that WHATWG considers its documents "living standards", i.e. they are subject to change and don't come with a version number (apart from a date).
This makes sense when you look at their mostly decoding-only nature, but, again for encoding, creates an interoperability problem.
Would it be viable to have them in the stdlib for decoding only? To have them simply not work for encoding? ChrisA
On 11.01.2018 10:01, Chris Angelico wrote:
On Thu, Jan 11, 2018 at 7:58 PM, M.-A. Lemburg <mal@egenix.com> wrote:
On 11.01.2018 01:22, Nick Coghlan wrote:
On 11 January 2018 at 05:04, M.-A. Lemburg <mal@egenix.com> wrote:
For the stdlib, I think we should stick to standards and not go for spreading non-standard ones.
So -1 on adding WHATWG encodings to the stdlib.
We already support HTML5 in the standard library, and saying "We'll accept WHATWG's definition of HTML, but not their associated text encodings" seems like a strange place to draw a line when it comes to standards support.
There's a problem with these encodings: they are mostly meant for decoding (broken) data, but as soon as we have them in the stdlib, people will also start using them for encoding data, producing more corrupted data.
Do you really things it's a good idea to support this natively in Python ?
The other problem is that WHATWG considers its documents "living standards", i.e. they are subject to change and don't come with a version number (apart from a date).
This makes sense when you look at their mostly decoding-only nature, but, again for encoding, creates an interoperability problem.
Would it be viable to have them in the stdlib for decoding only? To have them simply not work for encoding?
That would be possible and resolve the above issues I have with the encodings. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jan 11 2018)
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/
On Thu, Jan 11, 2018, at 03:58, M.-A. Lemburg wrote:
There's a problem with these encodings: they are mostly meant for decoding (broken) data, but as soon as we have them in the stdlib, people will also start using them for encoding data, producing more corrupted data.
Is it really corrupted?
Do you really things it's a good idea to support this natively in Python ?
The problem is, that's ignoring the very real fact that this is, and has always been* the behavior of the native encodings built in to Windows. My opinion is that Microsoft, for whatever reason, misrepresented their encodings when they submitted them to Unicode. The native APIs for text conversion have mechanisms for error reporting, and these supposedly undefined characters do not trigger them as they do for e.g. CP932 0xA0. Without the MB_ERR_INVALID_CHARS flag, cp932 0xA0 maps to U+F8F0 (private use), a best fit mapping, and cp1252 0x81 maps to U+0081 (one of the mappings being discussed here) If you do set the MB_ERR_INVALID_CHARS flag, however, cp932 0xA0 returns an error 1113** (ERROR_NO_UNICODE_TRANSLATION), whereas cp1252 0x81 still returns U+0081. As far as the actual encoding implemented in windows is concerned, CP1252's 0x81->U+0081 mapping is a wholly valid one (though undocumented), and not in any way a fallback or a "best fit" or an invalid character. *except for the addition of the Euro sign to each encoding at typically 0x80 in circa 1998. **It's worth mentioning that our cp932 returns U+F8F0, even with errors='strict', despite this not being present in the unicode published mapping. It has done this at least since the CJKCodecs change in 2004. I can't determine where (or if) it was implemented at all before that.
The question is rather: how often does web-XXX mojibake happen?
Very often. Particularly web-1252 mixed up with UTF-8. My ftfy library is tested on data from Twitter and the Common Crawl, both prime sources of mojibake. One common mojibake sequence is when a right curly quote is encoded as UTF-8 and decoded as codepage 1252. In Python's official windows-1252, this would at best be "â€�", using the 'replace' error handler. In web-1252, this would be "â€\x9d". The web-1252 version is more common. Of course, since Python itself is widespread, there is some survivorship bias here. Another thing you could get instead of "â€�" is your code crashing. On Thu, 11 Jan 2018 at 12:20 Random832 <random832@fastmail.com> wrote:
On Thu, Jan 11, 2018, at 03:58, M.-A. Lemburg wrote:
There's a problem with these encodings: they are mostly meant for decoding (broken) data, but as soon as we have them in the stdlib, people will also start using them for encoding data, producing more corrupted data.
Is it really corrupted?
Do you really things it's a good idea to support this natively in Python ?
The problem is, that's ignoring the very real fact that this is, and has always been* the behavior of the native encodings built in to Windows. My opinion is that Microsoft, for whatever reason, misrepresented their encodings when they submitted them to Unicode. The native APIs for text conversion have mechanisms for error reporting, and these supposedly undefined characters do not trigger them as they do for e.g. CP932 0xA0.
Without the MB_ERR_INVALID_CHARS flag, cp932 0xA0 maps to U+F8F0 (private use), a best fit mapping, and cp1252 0x81 maps to U+0081 (one of the mappings being discussed here) If you do set the MB_ERR_INVALID_CHARS flag, however, cp932 0xA0 returns an error 1113** (ERROR_NO_UNICODE_TRANSLATION), whereas cp1252 0x81 still returns U+0081.
As far as the actual encoding implemented in windows is concerned, CP1252's 0x81->U+0081 mapping is a wholly valid one (though undocumented), and not in any way a fallback or a "best fit" or an invalid character.
*except for the addition of the Euro sign to each encoding at typically 0x80 in circa 1998. **It's worth mentioning that our cp932 returns U+F8F0, even with errors='strict', despite this not being present in the unicode published mapping. It has done this at least since the CJKCodecs change in 2004. I can't determine where (or if) it was implemented at all before that. _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
On 2018-01-11 19:42, Rob Speer wrote:
The question is rather: how often does web-XXX mojibake happen?
Very often. Particularly web-1252 mixed up with UTF-8.
My ftfy library is tested on data from Twitter and the Common Crawl, both prime sources of mojibake. One common mojibake sequence is when a right curly quote is encoded as UTF-8 and decoded as codepage 1252. In Python's official windows-1252, this would at best be "â€�", using the 'replace' error handler. In web-1252, this would be "â€\x9d". The web-1252 version is more common.
Of course, since Python itself is widespread, there is some survivorship bias here. Another thing you could get instead of "�" is your code crashing.
FWIW, I've occasionally seen that kind of mojibake on the news ticker of the BBC News channel. :-( [snip]
On Wed, Jan 10, 2018 at 11:04 AM, M.-A. Lemburg <mal@egenix.com> wrote:
I don't believe it's a good strategy to create the confusion that WHATWG is introducing by using the same names for non-standard encodings.
agreed.
Python uses the Unicode Consortium standard encodings or otherwise internationally standardized ones for the stdlib.
If someone wants to use different encodings, it's easily possible to pip install these as necessary.
For the stdlib, I think we should stick to standards and not go for spreading non-standard ones.
So -1 on adding WHATWG encodings to the stdlib.
If the OP is right that it is one of the most widely used encodings in the world, it's kinda hard to call it "non-standard" I think practicality beats purity here -- if the WHATWG encoding(s) are clearly defined, widely used, and the names don't conflict with other standard encodings then it seems like a very good addition to the stdlib. So +1 -- provided that the proposed encoding(s) is "clearly defined, widely used, and the name doesn't conflict with other standard encodings" -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Wed, 10 Jan 2018 16:24:33 -0800 Chris Barker <chris.barker@noaa.gov> wrote:
On Wed, Jan 10, 2018 at 11:04 AM, M.-A. Lemburg <mal@egenix.com> wrote:
I don't believe it's a good strategy to create the confusion that WHATWG is introducing by using the same names for non-standard encodings.
agreed.
Python uses the Unicode Consortium standard encodings or otherwise internationally standardized ones for the stdlib.
If someone wants to use different encodings, it's easily possible to pip install these as necessary.
For the stdlib, I think we should stick to standards and not go for spreading non-standard ones.
So -1 on adding WHATWG encodings to the stdlib.
If the OP is right that it is one of the most widely used encodings in the world, it's kinda hard to call it "non-standard"
Define "widely used". If web-XXX is a superset of windows-XXX, then perhaps web-XXX is "used" in the sense of "used to decode valid windows-XXX data" (but windows-XXX could be used just as well to decode the same data). The question is rather: how often does web-XXX mojibake happen? We're well in the 2010s now and you'd hope that mojibake doesn't happen as often as it used to in, e.g., 1998. Regards Antoine.
I was originally proposing these encodings under different names, and that's what I think they should have. Indeed, that helps because a pip installable library can backport the new encodings to previous versions of Python. Having a pip installable library as the _only_ way to use these encodings is the status quo that I am very familiar with. It's awkward. To use a package that registers new codecs, you have to import something from that package, even if you never call anything from what you imported, and that makes flake8 complain. The idea that an encoding name may or may not be registered, based on what has been imported, breaks our intuition about reading Python code and is very hard to statically analyze. I disagree with calling the WHATWG encodings that are implemented in every Web browser "non-standard". WHATWG may not have a typical origin story as a standards organization, but it _is_ the standards organization for the Web. I'm really not interested in best-fit mappings that turn infinity into "8" and square roots into "v". Making weird mappings like that sounds like a job for the "unidecode" library, not the stdlib. On Wed, 10 Jan 2018 at 13:36 Rob Speer <rspeer@luminoso.com> wrote:
I'm looking at the documentation of "best fit" mappings, and that seems to be a different matter. It appears that best-fit mappings are designed to be many-to-one mappings used only for encoding.
"Examples of best fit are converting fullwidth letters to their counterparts when converting to single byte code pages, and mapping the Infinity character to the number 8." (Mapping ∞ to 8? Seriously?!) It also does things such as mapping Cyrillic letters to Latin letters that look like them.
This is not what I'm interested in implementing. I just want there to be encodings that match the WHATWG encodings exactly. If they have to be given a different name, that's fine with me.
On Wed, 10 Jan 2018 at 03:38 M.-A. Lemburg <mal@egenix.com> wrote:
On 10.01.2018 00:56, Rob Speer wrote:
Oh that's interesting. So it seems to be Python that's the exception here.
Would we really be able to add entries to character mappings that haven't changed since Python 2.0?
The Windows mappings in Python come directly from the Unicode Consortium mapping files.
If the Consortium changes the mappings, we can update them.
If not, then we have a problem, since consumers are not only the win32 APIs, but also other tools out there running on completely different platforms, e.g. Java tools or web servers providing downloads using the Windows code page encodings.
Allowing such mappings in the existing codecs would then result failures when the "other" sides see the decoded Unicode version and try to encode back into the original encoding - you'd move the problem from the Python side to the "other" side of the integration.
I had a look on the Unicode FTP site and they have since added a new directory with mapping files they call "best fit":
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/readme....
The WideCharToMultiByte() defaults to best fit, but also offers a mode where it operates in standards compliant mode:
https://msdn.microsoft.com/en-us/library/windows/desktop/dd374130%28v=vs.85%...
See flag WC_NO_BEST_FIT_CHARS.
Unicode TR#22 is also clear on this:
https://www.unicode.org/reports/tr22/tr22-3.html#Illegal_and_Unassigned
It allows such best fit mappings to make encodings round-trip safe, but requires to keep these separate from the original standard mappings:
""" It is very important that systems be able to distinguish between the fallback mappings and regular mappings. Systems like XML require the use of hex escape sequences (NCRs) to preserve round-trip integrity; use of fallback characters in that case corrupts the data. """
If you read the above section in TR#22 you quickly get reminded of what the Unicode error handlers do (we basically implement the three modes it mentions... raise, ignore, replace).
Now, for unmapped sequences an error handler can opt for using a fallback sequence instead.
So in addition to adding best fit codecs, there's also the option to add an error handler for best fit resolution of unmapped sequences.
Given the above, I don't think we ought to change the existing standards compliant mappings, but use one of two solutions:
a) add "best fit" encodings (see the Unicode FTP site for a list)
b) add an error handlers "bestfit" which implements the fallback modes for the encodings in question
On Tue, 9 Jan 2018 at 16:53 Ivan Pozdeev via Python-ideas < python-ideas@python.org> wrote:
First of all, many thanks for such a excellently writen letter. It was a real pleasure to read. On 10.01.2018 0:15, Rob Speer wrote:
Hi! I joined this list because I'm interested in filling a gap in Python's standard library, relating to text encodings.
There is an encoding with no name of its own. It's supported by every current web browser and standardized by WHATWG. It's so prevalent that if you ask a Web browser to decode "iso-8859-1" or "windows-1252", you will get this encoding _instead_. It is probably the second or third most common text encoding in the world. And Python doesn't quite support it.
You can see the character table for this encoding at: https://encoding.spec.whatwg.org/index-windows-1252.txt
For the sake of discussion, let's call this encoding "web-1252". WHATWG calls it "windows-1252", but notice that it's subtly different from Python's "windows-1252" encoding. Python's windows-1252 has bytes that are undefined:
> b'\x90'.decode('windows-1252') UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 0: character maps to <undefined>
In web-1252, the bytes that are undefined according to windows-1252 map to the control characters in those positions in iso-8859-1 -- that is, the Unicode codepoints with the same number as the byte. In web-1252, b'\x90' would decode as '\u0090'.
According to https://en.wikipedia.org/wiki/Windows-1252 , Windows does the same:
"According to the information on Microsoft's and the Unicode Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused; however, the Windows API MultiByteToWideChar < http://msdn.microsoft.com/en-us/library/windows/desktop/dd319072%28v=vs.85%2...
maps these to the corresponding C1 control codes <https://en.wikipedia.org/wiki/C0_and_C1_control_codes>." And in ISO-8859-1, the same handling is done for unused code points even by the standard ( https://en.wikipedia.org/wiki/ISO/IEC_8859-1 ) :
"*ISO-8859-1* is the IANA <https://en.wikipedia.org/wiki/Internet_Assigned_Numbers_Authority> preferred name for this standard when supplemented with the C0 and C1 control codes <https://en.wikipedia.org/wiki/C0_and_C1_control_codes> from ISO/IEC 6429 <https://en.wikipedia.org/wiki/ISO/IEC_6429>" And what would you think -- these "C1 control codes" are also the corresponding Unicode points! ( https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block) )
Since Windows is pretty much the reference implementation for "windows-xxxx" encodings, it even makes sense to alter the existing encodings rather than add new ones.
This may seem like a silly encoding that encourages doing horrible things with text. That's pretty much the case. But there's a reason every Web browser implements it:
- It's compatible with windows-1252 - Any sequence of bytes can be round-tripped through it without losing information
It's not just this one encoding. WHATWG's encoding standard ( https://encoding.spec.whatwg.org/) contains modified versions of windows-1250 through windows-1258 and windows-874.
Support for these encodings matters to me, in part, because I maintain a Unicode data-cleaning library, "ftfy". One thing it does is to detect and undo encoding/decoding errors that cause mojibake, as long as they're detectible and reversible. Looking at real-world examples of text that has been damaged by mojibake, it's clear that lots of text is transferred through what I'm calling the "web-1252" encoding, in a way that's incompatible with Python's "windows-1252".
In order to be able to work with and fix this kind of text, ftfy registers new codecs -- and I implemented this even before I knew that they were standardized in Web browsers. When ftfy is imported, you can decode text as "sloppy-windows-1252" (the name I chose for this encoding), for example.
ftfy can tell people a sequence of steps that they can use in the future to fix text that's like the text they provided. Very often, these steps require the sloppy-windows-1252 or sloppy-windows-1251 encoding, which means the steps only work with ftfy imported, even for people who are not using the features of ftfy.
Support for these encodings also seems highly relevant to people who use Python for web scraping, as it would be desirable to maximize compatibility with what a Web browser would do.
This really seems like it belongs in the standard library instead of being an incidental feature of my library. I know that code in the standard library has "one foot in the grave". I _want_ these legacy encodings to have one foot in the grave. But some of them are extremely common, and Python code should be able to deal with them.
Adding these encodings to Python would be straightforward to implement. Does this require a PEP, a pull request, or further discussion?
_______________________________________________ Python-ideas mailing listPython-ideas@python.orghttps:// mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- Regards, Ivan
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- Marc-Andre Lemburg eGenix.com
Professional Python Services directly from the Experts (#1, Jan 10 2018)
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/
::: We implement business ideas - efficiently in both time and costs :::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/
On 10/01/2018 19:13, Rob Speer wrote:
I was originally proposing these encodings under different names, and that's what I think they should have. Indeed, that helps because a pip installable library can backport the new encodings to previous versions of Python.
Having a pip installable library as the _only_ way to use these encodings is the status quo that I am very familiar with. It's awkward. To use a package that registers new codecs, you have to import something from that package, even if you never call anything from what you imported, and that makes flake8 complain. The idea that an encoding name may or may not be registered, based on what has been imported, breaks our intuition about reading Python code and is very hard to statically analyze.
I disagree with calling the WHATWG encodings that are implemented in every Web browser "non-standard". WHATWG may not have a typical origin story as a standards organization, but it _is_ the standards organization for the Web.
Please note that the WHATWG standard describes Windows-1252 as a "Legacy Single Byte Encoding" and to me the name suggests expects it to be implemented on Windows platforms and for Windows Specific Web Pages. THE Encoding - i.e. the standard that all browsers, and other web applications, are expected to adhere to is UTF-8. I am somewhat confused because according to https://encoding.spec.whatwg.org/index-windows-1252.txt 0x90 (one of the original examples) is undefined as the table only runs to 127 i.e. 0x7F.
I'm really not interested in best-fit mappings that turn infinity into "8" and square roots into "v". Making weird mappings like that sounds like a job for the "unidecode" library, not the stdlib.
On Wed, 10 Jan 2018 at 13:36 Rob Speer <rspeer@luminoso.com <mailto:rspeer@luminoso.com>> wrote:
I'm looking at the documentation of "best fit" mappings, and that seems to be a different matter. It appears that best-fit mappings are designed to be many-to-one mappings used only for encoding.
"Examples of best fit are converting fullwidth letters to their counterparts when converting to single byte code pages, and mapping the Infinity character to the number 8." (Mapping ∞ to 8? Seriously?!) It also does things such as mapping Cyrillic letters to Latin letters that look like them.
This is not what I'm interested in implementing. I just want there to be encodings that match the WHATWG encodings exactly. If they have to be given a different name, that's fine with me.
On Wed, 10 Jan 2018 at 03:38 M.-A. Lemburg <mal@egenix.com <mailto:mal@egenix.com>> wrote:
On 10.01.2018 00:56, Rob Speer wrote: > Oh that's interesting. So it seems to be Python that's the exception here. > > Would we really be able to add entries to character mappings that haven't > changed since Python 2.0?
The Windows mappings in Python come directly from the Unicode Consortium mapping files.
If the Consortium changes the mappings, we can update them.
If not, then we have a problem, since consumers are not only the win32 APIs, but also other tools out there running on completely different platforms, e.g. Java tools or web servers providing downloads using the Windows code page encodings.
Allowing such mappings in the existing codecs would then result failures when the "other" sides see the decoded Unicode version and try to encode back into the original encoding - you'd move the problem from the Python side to the "other" side of the integration.
I had a look on the Unicode FTP site and they have since added a new directory with mapping files they call "best fit":
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/readme.... <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.unicode.org%2FPublic%2FMAPPINGS%2FVENDORS%2FMICSFT%2FWindowsBestFit%2Freadme.txt&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=pn8U1DXOag2w4v%2BLWQYvj52CPyFMQAA6hleOHNJb7Qg%3D&reserved=0>
The WideCharToMultiByte() defaults to best fit, but also offers a mode where it operates in standards compliant mode:
https://msdn.microsoft.com/en-us/library/windows/desktop/dd374130%28v=vs.85%... <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmsdn.microsoft.com%2Fen-us%2Flibrary%2Fwindows%2Fdesktop%2Fdd374130%2528v%3Dvs.85%2529.aspx&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=Vk4Ta9wswCUTJ39VKxdRfpaP9GYTtqht2NAY%2BXIsUdY%3D&reserved=0>
See flag WC_NO_BEST_FIT_CHARS.
Unicode TR#22 is also clear on this:
https://www.unicode.org/reports/tr22/tr22-3.html#Illegal_and_Unassigned <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.unicode.org%2Freports%2Ftr22%2Ftr22-3.html%23Illegal_and_Unassigned&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=7BtwAMC5H%2BPDH6iun6aMpwvSAl2ZlMKm%2F97MNP8%2FB2c%3D&reserved=0>
It allows such best fit mappings to make encodings round-trip safe, but requires to keep these separate from the original standard mappings:
""" It is very important that systems be able to distinguish between the fallback mappings and regular mappings. Systems like XML require the use of hex escape sequences (NCRs) to preserve round-trip integrity; use of fallback characters in that case corrupts the data. """
If you read the above section in TR#22 you quickly get reminded of what the Unicode error handlers do (we basically implement the three modes it mentions... raise, ignore, replace).
Now, for unmapped sequences an error handler can opt for using a fallback sequence instead.
So in addition to adding best fit codecs, there's also the option to add an error handler for best fit resolution of unmapped sequences.
Given the above, I don't think we ought to change the existing standards compliant mappings, but use one of two solutions:
a) add "best fit" encodings (see the Unicode FTP site for a list)
b) add an error handlers "bestfit" which implements the fallback modes for the encodings in question
> On Tue, 9 Jan 2018 at 16:53 Ivan Pozdeev via Python-ideas < > python-ideas@python.org <mailto:python-ideas@python.org>> wrote: > >> First of all, many thanks for such a excellently writen letter. It was a >> real pleasure to read. >> On 10.01.2018 0:15, Rob Speer wrote: >> >> Hi! I joined this list because I'm interested in filling a gap in Python's >> standard library, relating to text encodings. >> >> There is an encoding with no name of its own. It's supported by every >> current web browser and standardized by WHATWG. It's so prevalent that if >> you ask a Web browser to decode "iso-8859-1" or "windows-1252", you will >> get this encoding _instead_. It is probably the second or third most common >> text encoding in the world. And Python doesn't quite support it. >> >> You can see the character table for this encoding at: >> https://encoding.spec.whatwg.org/index-windows-1252.txt <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fencoding.spec.whatwg.org%2Findex-windows-1252.txt&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=DRenmkuzqskrscXkOnLQSeKQeEF25eg9jvSbUZ3XM3I%3D&reserved=0> >> >> For the sake of discussion, let's call this encoding "web-1252". WHATWG >> calls it "windows-1252", but notice that it's subtly different from >> Python's "windows-1252" encoding. Python's windows-1252 has bytes that are >> undefined: >> >>>>> b'\x90'.decode('windows-1252') >> UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 0: >> character maps to <undefined> >> >> In web-1252, the bytes that are undefined according to windows-1252 map to >> the control characters in those positions in iso-8859-1 -- that is, the >> Unicode codepoints with the same number as the byte. In web-1252, b'\x90' >> would decode as '\u0090'. >> >> According to https://en.wikipedia.org/wiki/Windows-1252 <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FWindows-1252&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=bZaB6dSY8wVy8TnQ75i0SRtyHF2XiH3bfRcs1JQr%2BZQ%3D&reserved=0> , Windows does >> the same: >> >> "According to the information on Microsoft's and the Unicode >> Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused; >> however, the Windows API MultiByteToWideChar >> <http://msdn.microsoft.com/en-us/library/windows/desktop/dd319072%28v=vs.85%2... <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmsdn.microsoft.com%2Fen-us%2Flibrary%2Fwindows%2Fdesktop%2Fdd319072%2528v%3Dvs.85%2529.aspx&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=RPpM7UWhZnAA%2FggB6qXzI3fgsPK03DD4logfqoNCcK0%3D&reserved=0>> >> maps these to the corresponding C1 control codes >> <https://en.wikipedia.org/wiki/C0_and_C1_control_codes <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FC0_and_C1_control_codes&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=khZeJxsbNKIuaKmwHLpVH5g8mFbhyDf7I2dXzvCNA60%3D&reserved=0>>." >> And in ISO-8859-1, the same handling is done for unused code points even >> by the standard ( https://en.wikipedia.org/wiki/ISO/IEC_8859-1 <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FISO%2FIEC_8859-1&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=m5XnIP%2FAr2vZEZWZgSX%2F2UFLRAa4SbN7dHp4kvHDzQI%3D&reserved=0> ) : >> >> "*ISO-8859-1* is the IANA >> <https://en.wikipedia.org/wiki/Internet_Assigned_Numbers_Authority <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FInternet_Assigned_Numbers_Authority&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=HVS%2FJCAqcUqyrNBuxQqy9LSmdaXqh8TtYYXxwE12wh8%3D&reserved=0>> >> preferred name for this standard when supplemented with the C0 and C1 >> control codes <https://en.wikipedia.org/wiki/C0_and_C1_control_codes <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FC0_and_C1_control_codes&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=khZeJxsbNKIuaKmwHLpVH5g8mFbhyDf7I2dXzvCNA60%3D&reserved=0>> >> from ISO/IEC 6429 <https://en.wikipedia.org/wiki/ISO/IEC_6429 <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FISO%2FIEC_6429&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=QyF0XyG%2BKXQvfRuIKSwlLSgQ1WIjmrUgxDsVto7oqqA%3D&reserved=0>>" >> And what would you think -- these "C1 control codes" are also the >> corresponding Unicode points! ( >> https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block) <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FLatin-1_Supplement_(Unicode_block)&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=1Hne6kIyR2KBDxnxzJS9pc4Mra01W6aL1mIayUBWiRE%3D&reserved=0> ) >> >> Since Windows is pretty much the reference implementation for >> "windows-xxxx" encodings, it even makes sense to alter the existing >> encodings rather than add new ones. >> >> >> This may seem like a silly encoding that encourages doing horrible things >> with text. That's pretty much the case. But there's a reason every Web >> browser implements it: >> >> - It's compatible with windows-1252 >> - Any sequence of bytes can be round-tripped through it without losing >> information >> >> It's not just this one encoding. WHATWG's encoding standard ( >> https://encoding.spec.whatwg.org/ <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fencoding.spec.whatwg.org%2F&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=pIltxhrJqWWET90I3YB0WRw7LhSTfpJ6dUA9oaNV7Eo%3D&reserved=0>) contains modified versions of >> windows-1250 through windows-1258 and windows-874. >> >> Support for these encodings matters to me, in part, because I maintain a >> Unicode data-cleaning library, "ftfy". One thing it does is to detect and >> undo encoding/decoding errors that cause mojibake, as long as they're >> detectible and reversible. Looking at real-world examples of text that has >> been damaged by mojibake, it's clear that lots of text is transferred >> through what I'm calling the "web-1252" encoding, in a way that's >> incompatible with Python's "windows-1252". >> >> In order to be able to work with and fix this kind of text, ftfy registers >> new codecs -- and I implemented this even before I knew that they were >> standardized in Web browsers. When ftfy is imported, you can decode text as >> "sloppy-windows-1252" (the name I chose for this encoding), for example. >> >> ftfy can tell people a sequence of steps that they can use in the future >> to fix text that's like the text they provided. Very often, these steps >> require the sloppy-windows-1252 or sloppy-windows-1251 encoding, which >> means the steps only work with ftfy imported, even for people who are not >> using the features of ftfy. >> >> Support for these encodings also seems highly relevant to people who use >> Python for web scraping, as it would be desirable to maximize compatibility >> with what a Web browser would do. >> >> This really seems like it belongs in the standard library instead of being >> an incidental feature of my library. I know that code in the standard >> library has "one foot in the grave". I _want_ these legacy encodings to >> have one foot in the grave. But some of them are extremely common, and >> Python code should be able to deal with them. >> >> Adding these encodings to Python would be straightforward to implement. >> Does this require a PEP, a pull request, or further discussion? >> >> >> _______________________________________________ >> Python-ideas mailing listPython-ideas@python.orghttps://mail.python.org/mailman/listinfo/python-ideas <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmail.python.org%2Fmailman%2Flistinfo%2Fpython-ideas&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=AhPXPrT3NaO9BlKLUAH%2Fw7Pw%2FTuG9cNU1qVY0ahmTlM%3D&reserved=0> >> Code of Conduct: http://python.org/psf/codeofconduct/ <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpython.org%2Fpsf%2Fcodeofconduct%2F&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=ynnrEBP0NvRs5fxsS%2F7RDf%2B6Lzm3mZH%2BMOtZ4qi9TKA%3D&reserved=0> >> >> >> -- >> Regards, >> Ivan >> >> _______________________________________________ >> Python-ideas mailing list >> Python-ideas@python.org <mailto:Python-ideas@python.org> >> https://mail.python.org/mailman/listinfo/python-ideas <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.python.org%2Fmailman%2Flistinfo%2Fpython-ideas&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=88f%2Fr0E7x2mHJtETG7LEKZd4mARlCvgGIbhmFmZnmcQ%3D&reserved=0> >> Code of Conduct: http://python.org/psf/codeofconduct/ <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpython.org%2Fpsf%2Fcodeofconduct%2F&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=ynnrEBP0NvRs5fxsS%2F7RDf%2B6Lzm3mZH%2BMOtZ4qi9TKA%3D&reserved=0> >> > > > > _______________________________________________ > Python-ideas mailing list > Python-ideas@python.org <mailto:Python-ideas@python.org> > https://mail.python.org/mailman/listinfo/python-ideas <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.python.org%2Fmailman%2Flistinfo%2Fpython-ideas&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=88f%2Fr0E7x2mHJtETG7LEKZd4mARlCvgGIbhmFmZnmcQ%3D&reserved=0> > Code of Conduct: http://python.org/psf/codeofconduct/ <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpython.org%2Fpsf%2Fcodeofconduct%2F&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=ynnrEBP0NvRs5fxsS%2F7RDf%2B6Lzm3mZH%2BMOtZ4qi9TKA%3D&reserved=0> >
-- Marc-Andre Lemburg eGenix.com
Professional Python Services directly from the Experts (#1, Jan 10 2018) >>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.egenix.com%2F&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=dyDZJ4zqiwg%2BK38qw3j5IfNcN8Mnb4Y7eNUij7ehlZ8%3D&reserved=0> >>> Python Database Interfaces ... http://products.egenix.com/ <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fproducts.egenix.com%2F&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=T44vL466%2BWhFpLENeWawtqjsTrtOF2bjoSyvsIJzG%2FA%3D&reserved=0> >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fzope.egenix.com%2F&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=kjSu82ifY%2BA9xD761whXay4pSvVfJ%2FSvb7m%2FBYE9iXM%3D&reserved=0> ________________________________________________________________________
::: We implement business ideas - efficiently in both time and costs :::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.egenix.com%2Fcompany%2Fcontact%2F&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=2YkSya%2BsIzXqlbZU9CZsVLIt6qJlYigeoZYyDAOK1x0%3D&reserved=0> http://www.malemburg.com/ <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.malemburg.com%2F&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531969851&sdata=JkK6rZQ%2BE%2FZhuxOaoSjsjwj0e%2F%2FmLG16nQ0ELG1Kg2s%3D&reserved=0>
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.python.org%2Fmailman%2Flistinfo%2Fpython-ideas&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531969851&sdata=Ruay67LA%2Fyv3Ki5jevX7qbBtaw1PfG6I5c00kFZzxNY%3D&reserved=0 Code of Conduct: https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpython.org%2Fpsf%2Fcodeofconduct%2F&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531969851&sdata=Dfsa4ryYvzqKJFN9FtbuQPJ9T6mlArpkL0Z%2BwzAGGTg%3D&reserved=0
-- Steve (Gadget) Barnes Any opinions in this message are my personal opinions and do not reflect those of my employer.
On Thu, Jan 11, 2018 at 6:44 AM, Steve Barnes <gadgetsteve@live.co.uk> wrote:
I am somewhat confused because according to https://encoding.spec.whatwg.org/index-windows-1252.txt 0x90 (one of the original examples) is undefined as the table only runs to 127 i.e. 0x7F.
AIUI the table in that file assumes that the first 128 bytes are interpreted as per ASCII. So you're looking at the *next* 128 bytes, and line 16 is the one that handles byte 0x90. ChrisA
On Wed, Jan 10, 2018, at 14:44, Steve Barnes wrote:
I am somewhat confused because according to https://encoding.spec.whatwg.org/index-windows-1252.txt 0x90 (one of the original examples) is undefined as the table only runs to 127 i.e. 0x7F.
The spec referenced in the comments says "Let code point be the index code point for byte − 0x80 in index single-byte."
On 10.01.2018 20:13, Rob Speer wrote:
I was originally proposing these encodings under different names, and that's what I think they should have. Indeed, that helps because a pip installable library can backport the new encodings to previous versions of Python.
Having a pip installable library as the _only_ way to use these encodings is the status quo that I am very familiar with. It's awkward. To use a package that registers new codecs, you have to import something from that package, even if you never call anything from what you imported, and that makes flake8 complain. The idea that an encoding name may or may not be registered, based on what has been imported, breaks our intuition about reading Python code and is very hard to statically analyze.
You can have a function in the package which registers the codecs. That way you do have a call into the library and intuition is restored :-) (and flake should be happy as well): import mycodecs mycodecs.register()
I disagree with calling the WHATWG encodings that are implemented in every Web browser "non-standard". WHATWG may not have a typical origin story as a standards organization, but it _is_ the standards organization for the Web.
I don't really want to get into a discussion here. Browsers use these modified encodings to cope with mojibake or web content which isn't quite standard compliant. That's a valid use case, but promoting such wrong use by having work-around encodings in the stdlib and having Python produce non-standard output doesn't strike me as a good way forward. We do have error handlers for dealing with partially corrupted data. I think that's good enough.
I'm really not interested in best-fit mappings that turn infinity into "8" and square roots into "v". Making weird mappings like that sounds like a job for the "unidecode" library, not the stdlib.
Well, one of your main arguments was that the Windows API follows these best fit encodings. I agree that best fit may not necessarily be best fit for everyone :-)
On Wed, 10 Jan 2018 at 13:36 Rob Speer <rspeer@luminoso.com> wrote:
I'm looking at the documentation of "best fit" mappings, and that seems to be a different matter. It appears that best-fit mappings are designed to be many-to-one mappings used only for encoding.
"Examples of best fit are converting fullwidth letters to their counterparts when converting to single byte code pages, and mapping the Infinity character to the number 8." (Mapping ∞ to 8? Seriously?!) It also does things such as mapping Cyrillic letters to Latin letters that look like them.
This is not what I'm interested in implementing. I just want there to be encodings that match the WHATWG encodings exactly. If they have to be given a different name, that's fine with me.
On Wed, 10 Jan 2018 at 03:38 M.-A. Lemburg <mal@egenix.com> wrote:
On 10.01.2018 00:56, Rob Speer wrote:
Oh that's interesting. So it seems to be Python that's the exception here.
Would we really be able to add entries to character mappings that haven't changed since Python 2.0?
The Windows mappings in Python come directly from the Unicode Consortium mapping files.
If the Consortium changes the mappings, we can update them.
If not, then we have a problem, since consumers are not only the win32 APIs, but also other tools out there running on completely different platforms, e.g. Java tools or web servers providing downloads using the Windows code page encodings.
Allowing such mappings in the existing codecs would then result failures when the "other" sides see the decoded Unicode version and try to encode back into the original encoding - you'd move the problem from the Python side to the "other" side of the integration.
I had a look on the Unicode FTP site and they have since added a new directory with mapping files they call "best fit":
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/readme....
The WideCharToMultiByte() defaults to best fit, but also offers a mode where it operates in standards compliant mode:
https://msdn.microsoft.com/en-us/library/windows/desktop/dd374130%28v=vs.85%...
See flag WC_NO_BEST_FIT_CHARS.
Unicode TR#22 is also clear on this:
https://www.unicode.org/reports/tr22/tr22-3.html#Illegal_and_Unassigned
It allows such best fit mappings to make encodings round-trip safe, but requires to keep these separate from the original standard mappings:
""" It is very important that systems be able to distinguish between the fallback mappings and regular mappings. Systems like XML require the use of hex escape sequences (NCRs) to preserve round-trip integrity; use of fallback characters in that case corrupts the data. """
If you read the above section in TR#22 you quickly get reminded of what the Unicode error handlers do (we basically implement the three modes it mentions... raise, ignore, replace).
Now, for unmapped sequences an error handler can opt for using a fallback sequence instead.
So in addition to adding best fit codecs, there's also the option to add an error handler for best fit resolution of unmapped sequences.
Given the above, I don't think we ought to change the existing standards compliant mappings, but use one of two solutions:
a) add "best fit" encodings (see the Unicode FTP site for a list)
b) add an error handlers "bestfit" which implements the fallback modes for the encodings in question
On Tue, 9 Jan 2018 at 16:53 Ivan Pozdeev via Python-ideas < python-ideas@python.org> wrote:
First of all, many thanks for such a excellently writen letter. It was a real pleasure to read. On 10.01.2018 0:15, Rob Speer wrote:
Hi! I joined this list because I'm interested in filling a gap in Python's standard library, relating to text encodings.
There is an encoding with no name of its own. It's supported by every current web browser and standardized by WHATWG. It's so prevalent that if you ask a Web browser to decode "iso-8859-1" or "windows-1252", you will get this encoding _instead_. It is probably the second or third most common text encoding in the world. And Python doesn't quite support it.
You can see the character table for this encoding at: https://encoding.spec.whatwg.org/index-windows-1252.txt
For the sake of discussion, let's call this encoding "web-1252". WHATWG calls it "windows-1252", but notice that it's subtly different from Python's "windows-1252" encoding. Python's windows-1252 has bytes that are undefined:
>> b'\x90'.decode('windows-1252') UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 0: character maps to <undefined>
In web-1252, the bytes that are undefined according to windows-1252 map to the control characters in those positions in iso-8859-1 -- that is, the Unicode codepoints with the same number as the byte. In web-1252, b'\x90' would decode as '\u0090'.
According to https://en.wikipedia.org/wiki/Windows-1252 , Windows does the same:
"According to the information on Microsoft's and the Unicode Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused; however, the Windows API MultiByteToWideChar < http://msdn.microsoft.com/en-us/library/windows/desktop/dd319072%28v=vs.85%2...
maps these to the corresponding C1 control codes <https://en.wikipedia.org/wiki/C0_and_C1_control_codes>." And in ISO-8859-1, the same handling is done for unused code points even by the standard ( https://en.wikipedia.org/wiki/ISO/IEC_8859-1 ) :
"*ISO-8859-1* is the IANA <https://en.wikipedia.org/wiki/Internet_Assigned_Numbers_Authority> preferred name for this standard when supplemented with the C0 and C1 control codes <https://en.wikipedia.org/wiki/C0_and_C1_control_codes> from ISO/IEC 6429 <https://en.wikipedia.org/wiki/ISO/IEC_6429>" And what would you think -- these "C1 control codes" are also the corresponding Unicode points! ( https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block) )
Since Windows is pretty much the reference implementation for "windows-xxxx" encodings, it even makes sense to alter the existing encodings rather than add new ones.
This may seem like a silly encoding that encourages doing horrible things with text. That's pretty much the case. But there's a reason every Web browser implements it:
- It's compatible with windows-1252 - Any sequence of bytes can be round-tripped through it without losing information
It's not just this one encoding. WHATWG's encoding standard ( https://encoding.spec.whatwg.org/) contains modified versions of windows-1250 through windows-1258 and windows-874.
Support for these encodings matters to me, in part, because I maintain a Unicode data-cleaning library, "ftfy". One thing it does is to detect and undo encoding/decoding errors that cause mojibake, as long as they're detectible and reversible. Looking at real-world examples of text that has been damaged by mojibake, it's clear that lots of text is transferred through what I'm calling the "web-1252" encoding, in a way that's incompatible with Python's "windows-1252".
In order to be able to work with and fix this kind of text, ftfy registers new codecs -- and I implemented this even before I knew that they were standardized in Web browsers. When ftfy is imported, you can decode text as "sloppy-windows-1252" (the name I chose for this encoding), for example.
ftfy can tell people a sequence of steps that they can use in the future to fix text that's like the text they provided. Very often, these steps require the sloppy-windows-1252 or sloppy-windows-1251 encoding, which means the steps only work with ftfy imported, even for people who are not using the features of ftfy.
Support for these encodings also seems highly relevant to people who use Python for web scraping, as it would be desirable to maximize compatibility with what a Web browser would do.
This really seems like it belongs in the standard library instead of being an incidental feature of my library. I know that code in the standard library has "one foot in the grave". I _want_ these legacy encodings to have one foot in the grave. But some of them are extremely common, and Python code should be able to deal with them.
Adding these encodings to Python would be straightforward to implement. Does this require a PEP, a pull request, or further discussion?
_______________________________________________ Python-ideas mailing listPython-ideas@python.orghttps:// mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- Regards, Ivan
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- Marc-Andre Lemburg eGenix.com
Professional Python Services directly from the Experts (#1, Jan 10 2018)
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/
::: We implement business ideas - efficiently in both time and costs :::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jan 10 2018)
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/
Well, one of your main arguments was that the Windows API follows these best fit encodings.
No, that wasn't me, that was Ivan. My argument has been based on compatibility with Web technologies; I wanted these encodings before I knew what Windows did (and now what Windows does kind of horrifies me). Calling a register() function makes flake8 happy, at the cost of convenience, but it still has no clear connection to the place where you use the registered encodings. On Wed, 10 Jan 2018 at 14:57 M.-A. Lemburg <mal@egenix.com> wrote:
On 10.01.2018 20:13, Rob Speer wrote:
I was originally proposing these encodings under different names, and that's what I think they should have. Indeed, that helps because a pip installable library can backport the new encodings to previous versions of Python.
Having a pip installable library as the _only_ way to use these encodings is the status quo that I am very familiar with. It's awkward. To use a package that registers new codecs, you have to import something from that package, even if you never call anything from what you imported, and that makes flake8 complain. The idea that an encoding name may or may not be registered, based on what has been imported, breaks our intuition about reading Python code and is very hard to statically analyze.
You can have a function in the package which registers the codecs. That way you do have a call into the library and intuition is restored :-) (and flake should be happy as well):
import mycodecs mycodecs.register()
I disagree with calling the WHATWG encodings that are implemented in every Web browser "non-standard". WHATWG may not have a typical origin story as a standards organization, but it _is_ the standards organization for the Web.
I don't really want to get into a discussion here. Browsers use these modified encodings to cope with mojibake or web content which isn't quite standard compliant. That's a valid use case, but promoting such wrong use by having work-around encodings in the stdlib and having Python produce non-standard output doesn't strike me as a good way forward. We do have error handlers for dealing with partially corrupted data. I think that's good enough.
I'm really not interested in best-fit mappings that turn infinity into "8" and square roots into "v". Making weird mappings like that sounds like a job for the "unidecode" library, not the stdlib.
Well, one of your main arguments was that the Windows API follows these best fit encodings.
I agree that best fit may not necessarily be best fit for everyone :-)
> Python Projects, Coaching and Consulting ... http://www.egenix.com/ > Python Database Interfaces ... http://products.egenix.com/ > Plone/Zope Database Interfaces ... http://zope.egenix.com/
On Wed, 10 Jan 2018 at 13:36 Rob Speer <rspeer@luminoso.com> wrote:
I'm looking at the documentation of "best fit" mappings, and that seems to be a different matter. It appears that best-fit mappings are designed to be many-to-one mappings used only for encoding.
"Examples of best fit are converting fullwidth letters to their counterparts when converting to single byte code pages, and mapping the Infinity character to the number 8." (Mapping ∞ to 8? Seriously?!) It also does things such as mapping Cyrillic letters to Latin letters that look like them.
This is not what I'm interested in implementing. I just want there to be encodings that match the WHATWG encodings exactly. If they have to be given a different name, that's fine with me.
On Wed, 10 Jan 2018 at 03:38 M.-A. Lemburg <mal@egenix.com> wrote:
On 10.01.2018 00:56, Rob Speer wrote:
Oh that's interesting. So it seems to be Python that's the exception here.
Would we really be able to add entries to character mappings that haven't changed since Python 2.0?
The Windows mappings in Python come directly from the Unicode Consortium mapping files.
If the Consortium changes the mappings, we can update them.
If not, then we have a problem, since consumers are not only the win32 APIs, but also other tools out there running on completely different platforms, e.g. Java tools or web servers providing downloads using the Windows code page encodings.
Allowing such mappings in the existing codecs would then result failures when the "other" sides see the decoded Unicode version and try to encode back into the original encoding - you'd move the problem from the Python side to the "other" side of the integration.
I had a look on the Unicode FTP site and they have since added a new directory with mapping files they call "best fit":
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/readme....
The WideCharToMultiByte() defaults to best fit, but also offers a mode where it operates in standards compliant mode:
https://msdn.microsoft.com/en-us/library/windows/desktop/dd374130%28v=vs.85%...
See flag WC_NO_BEST_FIT_CHARS.
Unicode TR#22 is also clear on this:
https://www.unicode.org/reports/tr22/tr22-3.html#Illegal_and_Unassigned
It allows such best fit mappings to make encodings round-trip safe, but requires to keep these separate from the original standard mappings:
""" It is very important that systems be able to distinguish between the fallback mappings and regular mappings. Systems like XML require the
of hex escape sequences (NCRs) to preserve round-trip integrity; use of fallback characters in that case corrupts the data. """
If you read the above section in TR#22 you quickly get reminded of what the Unicode error handlers do (we basically implement the three modes it mentions... raise, ignore, replace).
Now, for unmapped sequences an error handler can opt for using a fallback sequence instead.
So in addition to adding best fit codecs, there's also the option to add an error handler for best fit resolution of unmapped sequences.
Given the above, I don't think we ought to change the existing standards compliant mappings, but use one of two solutions:
a) add "best fit" encodings (see the Unicode FTP site for a list)
b) add an error handlers "bestfit" which implements the fallback modes for the encodings in question
On Tue, 9 Jan 2018 at 16:53 Ivan Pozdeev via Python-ideas < python-ideas@python.org> wrote:
First of all, many thanks for such a excellently writen letter. It was a real pleasure to read. On 10.01.2018 0:15, Rob Speer wrote:
Hi! I joined this list because I'm interested in filling a gap in Python's standard library, relating to text encodings.
There is an encoding with no name of its own. It's supported by every current web browser and standardized by WHATWG. It's so prevalent
if
you ask a Web browser to decode "iso-8859-1" or "windows-1252", you will get this encoding _instead_. It is probably the second or third most common text encoding in the world. And Python doesn't quite support it.
You can see the character table for this encoding at: https://encoding.spec.whatwg.org/index-windows-1252.txt
For the sake of discussion, let's call this encoding "web-1252". WHATWG calls it "windows-1252", but notice that it's subtly different from Python's "windows-1252" encoding. Python's windows-1252 has bytes
are
undefined:
>>> b'\x90'.decode('windows-1252') UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in
0:
character maps to <undefined>
In web-1252, the bytes that are undefined according to windows-1252 map to the control characters in those positions in iso-8859-1 -- that is,
use that that position the
Unicode codepoints with the same number as the byte. In web-1252, b'\x90' would decode as '\u0090'.
According to https://en.wikipedia.org/wiki/Windows-1252 , Windows does the same:
"According to the information on Microsoft's and the Unicode Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused; however, the Windows API MultiByteToWideChar <
http://msdn.microsoft.com/en-us/library/windows/desktop/dd319072%28v=vs.85%2...
maps these to the corresponding C1 control codes <https://en.wikipedia.org/wiki/C0_and_C1_control_codes>." And in ISO-8859-1, the same handling is done for unused code points
even
by the standard ( https://en.wikipedia.org/wiki/ISO/IEC_8859-1 ) :
"*ISO-8859-1* is the IANA <https://en.wikipedia.org/wiki/Internet_Assigned_Numbers_Authority> preferred name for this standard when supplemented with the C0 and C1 control codes <https://en.wikipedia.org/wiki/C0_and_C1_control_codes
from ISO/IEC 6429 <https://en.wikipedia.org/wiki/ISO/IEC_6429>" And what would you think -- these "C1 control codes" are also the corresponding Unicode points! ( https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block) )
Since Windows is pretty much the reference implementation for "windows-xxxx" encodings, it even makes sense to alter the existing encodings rather than add new ones.
This may seem like a silly encoding that encourages doing horrible things with text. That's pretty much the case. But there's a reason every Web browser implements it:
- It's compatible with windows-1252 - Any sequence of bytes can be round-tripped through it without losing information
It's not just this one encoding. WHATWG's encoding standard ( https://encoding.spec.whatwg.org/) contains modified versions of windows-1250 through windows-1258 and windows-874.
Support for these encodings matters to me, in part, because I maintain a Unicode data-cleaning library, "ftfy". One thing it does is to detect and undo encoding/decoding errors that cause mojibake, as long as they're detectible and reversible. Looking at real-world examples of text that has been damaged by mojibake, it's clear that lots of text is transferred through what I'm calling the "web-1252" encoding, in a way that's incompatible with Python's "windows-1252".
In order to be able to work with and fix this kind of text, ftfy registers new codecs -- and I implemented this even before I knew that they were standardized in Web browsers. When ftfy is imported, you can decode text as "sloppy-windows-1252" (the name I chose for this encoding), for example.
ftfy can tell people a sequence of steps that they can use in the future to fix text that's like the text they provided. Very often, these steps require the sloppy-windows-1252 or sloppy-windows-1251 encoding, which means the steps only work with ftfy imported, even for people who are not using the features of ftfy.
Support for these encodings also seems highly relevant to people who use Python for web scraping, as it would be desirable to maximize compatibility with what a Web browser would do.
This really seems like it belongs in the standard library instead of being an incidental feature of my library. I know that code in the standard library has "one foot in the grave". I _want_ these legacy encodings to have one foot in the grave. But some of them are extremely common, and Python code should be able to deal with them.
Adding these encodings to Python would be straightforward to implement. Does this require a PEP, a pull request, or further discussion?
_______________________________________________ Python-ideas mailing listPython-ideas@python.orghttps:// mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- Regards, Ivan
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- Marc-Andre Lemburg eGenix.com
Professional Python Services directly from the Experts (#1, Jan 10
::: We implement business ideas - efficiently in both time and costs
:::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/
-- Marc-Andre Lemburg eGenix.com
Professional Python Services directly from the Experts (#1, Jan 10 2018)
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/
::: We implement business ideas - efficiently in both time and costs :::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/
On Wed, Jan 10, 2018 at 07:13:39PM +0000, Rob Speer wrote: [...]
Having a pip installable library as the _only_ way to use these encodings is the status quo that I am very familiar with. It's awkward. To use a package that registers new codecs, you have to import something from that package, even if you never call anything from what you imported, and that makes flake8 complain. The idea that an encoding name may or may not be registered, based on what has been imported, breaks our intuition about reading Python code and is very hard to statically analyze.
Breaks whose intuition? You don't speak for me on that matter -- while I don't like modules which operate by side-effect on import, I know that they are possible. In the stdlib, we have rlcompleter which operates like that. Whether such a design is good or bad (I think bad), nevertheless registering codecs by side-effect at import time should be an obvious possibility to any reasonably experienced developer. But regardless, I don't think that "the existing codec library has a poor API, and flake8 complains about it" is a good reason for adding the codecs to the stdlib. We don't necessarily add functionality to the stdlib just because existing third-party solutions are awkward to use. Having said that, I'm not actually against adding this, although I lean slightly towards "add". I think the case for adding is unclear, and needs a PEP to discuss the issues fully. I think we've come to a consensus on the following question: - Should we change the behaviour of the existing codecs to match the WHATWG encodings? No. but there are others that do not have a consensus: - Are existing stdlib solutions satisfactory to meet the WHATWG standard? - If not, should the WHATWG encodings be added to the stdlib? - If so, should they be built-in codecs, or should we import a library to register them? - Or use the error handler mechanism? - If codecs, should we offer both encode and decode support, or just decoding? - What about the Unicode best-fit encodings? Regarding that first undecided question, I'm particularly interested to see your response to Stephen Turnbull's statements here: https://mail.python.org/pipermail/python-ideas/2018-January/048628.html
I disagree with calling the WHATWG encodings that are implemented in every Web browser "non-standard". WHATWG may not have a typical origin story as a standards organization, but it _is_ the standards organization for the Web.
I wonder what the W3C would say about that last statement.
I'm really not interested in best-fit mappings that turn infinity into "8" and square roots into "v". Making weird mappings like that sounds like a job for the "unidecode" library, not the stdlib.
Frankly, the idea that browsers should ignore the HTML's declared encoding in favour of some other hybrid encoding which never existed outside of broken web pages in order to be called "standards compliant" seems weird if not broken to me. Possibly even more weird than mapping ∞ to 8 and √ to v. (I really wish the Unicode Consortium would do a better job of explaining the reasoning behind some of their more unintuitive or flat out strange-seeming decisions. But that's a rant for another day.) I know that web browsers aren't quite the same as programming languages, and "Practicality beats purity", but still, "In the face of ambiguity, resist the temptation to guess". The WHATWG standard strikes me as "Do What You Guess I Mean". -- Steve
Can someone explain to me why this is such a controversial issue? It seems reasonable to me to add new encodings to the stdlib that do the roundtripping requested in the first message of the thread. As long as they have new names that seems to fall under "practicality beats purity". (Modifying existing encodings seems wrong -- did the feature request somehow transmogrify into that?) -- --Guido van Rossum (python.org/~guido)
On Thu, Jan 18, 2018 at 7:51 PM, Guido van Rossum <guido@python.org> wrote:
Can someone explain to me why this is such a controversial issue?
I guess practicality versus purity is always controversial :-)
It seems reasonable to me to add new encodings to the stdlib that do the roundtripping requested in the first message of the thread. As long as they have new names that seems to fall under "practicality beats purity". (Modifying existing encodings seems wrong -- did the feature request somehow transmogrify into that?)
Someone did discover that Microsoft's current implementations of the windows-* encodings matches the WHAT-WG spec, rather than the Unicode spec that Microsoft originally wrote. So there is some argument that the Python's existing encodings are simply out of date, and changing them would be a bugfix. (And standards aside, it is surely going to be somewhat error-prone if Python's windows-1252 doesn't match everyone else's implementations of windows-1252.) But yeah, AFAICT the original requesters would be happy either way; they just want it available under some name. -n -- Nathaniel J. Smith -- https://vorpus.org
On 19.01.2018 05:38, Nathaniel Smith wrote:
On Thu, Jan 18, 2018 at 7:51 PM, Guido van Rossum <guido@python.org> wrote:
Can someone explain to me why this is such a controversial issue?
I guess practicality versus purity is always controversial :-)
It seems reasonable to me to add new encodings to the stdlib that do the roundtripping requested in the first message of the thread. As long as they have new names that seems to fall under "practicality beats purity".
There are a few issues here: * WHATWG encodings are mostly for decoding content in order to show it in the browser, accepting broken encoding data. Python already has support for this by using one of the available error handlers, or adding new ones to suit the needs. If we'd add the encodings, people will start creating more broken data, since this is what the WHATWG codecs output when encoding Unicode. As discussed, this could be addressed by making the WHATWG codecs decode-only. * The use case seems limited to implementing browsers or headless implementations working like browsers. That's not really general enough to warrant adding lots of new codecs to the stdlib. A PyPI package is better suited for this. * The WHATWG codecs do not only cover simple mapping codecs, but also many multi-byte ones for e.g. Asian languages. I doubt that we'd want to maintain such codecs in the stdlib, since this will increase the download sizes of the installers and also require people knowledgeable about these variants to work on them and fix any issues. Overall, I think either pointing people to error handlers or perhaps adding a new one specifically for the case of dealing with control character mappings would provide a better maintenance / usefulness ratio than adding lots of new legacy codecs to the stdlib. BTW: WHATWG pushes for always using UTF-8 as far as I can tell from their website.
(Modifying existing encodings seems wrong -- did the feature request somehow transmogrify into that?)
Someone did discover that Microsoft's current implementations of the windows-* encodings matches the WHAT-WG spec, rather than the Unicode spec that Microsoft originally wrote.
No, MS implements somethings called "best fit encodings" and these are different than what WHATWG uses. Unlike the WHATWG encodings, these are documented as vendor encodings on the Unicode site, which is what we normally use as reference for out stdlib codecs. However, whether these are actually a good idea, is open to discussion as well, since they sometimes go a bit far with "best fit", e.g. mapping the infinity symbol to 8. Again, using the error handles we have for dealing with situations which require non-standard encoding behavior are the better approach: https://docs.python.org/3.7/library/codecs.html#error-handlers Adding new ones is possible as well.
So there is some argument that the Python's existing encodings are simply out of date, and changing them would be a bugfix. (And standards aside, it is surely going to be somewhat error-prone if Python's windows-1252 doesn't match everyone else's implementations of windows-1252.) But yeah, AFAICT the original requesters would be happy either way; they just want it available under some name.
The encodings are not out of date. I don't know where you got that impression from. The Windows API WideCharToMultiByte which was quoted in the discussion: https://msdn.microsoft.com/en-us/library/windows/desktop/dd374130%28v=vs.85%... unfortunately uses the above mentioned best fit encodings, but this can and should be switched off by specifying the WC_NO_BEST_FIT_CHARS for anything that requires validation or needs to be interoperable: """ For strings that require validation, such as file, resource, and user names, the application should always use the WC_NO_BEST_FIT_CHARS flag. This flag prevents the function from mapping characters to characters that appear similar but have very different semantics. In some cases, the semantic change can be extreme. For example, the symbol for "∞" (infinity) maps to 8 (eight) in some code pages. """ -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jan 19 2018)
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/
On Fri, Jan 19, 2018 at 5:30 AM, M.-A. Lemburg <mal@egenix.com> wrote:
On Thu, Jan 18, 2018 at 7:51 PM, Guido van Rossum <guido@python.org> wrote:
Can someone explain to me why this is such a controversial issue?
I guess practicality versus purity is always controversial :-)
It seems reasonable to me to add new encodings to the stdlib that do the roundtripping requested in the first message of the thread. As long as
On 19.01.2018 05:38, Nathaniel Smith wrote: they
have new names that seems to fall under "practicality beats purity".
There are a few issues here:
* WHATWG encodings are mostly for decoding content in order to show it in the browser, accepting broken encoding data.
And sometimes Python apps that pull data from the web.
Python already has support for this by using one of the available error handlers, or adding new ones to suit the needs.
This seems cumbersome though.
If we'd add the encodings, people will start creating more broken data, since this is what the WHATWG codecs output when encoding Unicode.
That's FUD. Only apps that specifically use the new WHATWG encodings would be able to consume that data. And surely the practice of web browsers will have a much bigger effect than Python's choice.
As discussed, this could be addressed by making the WHATWG codecs decode-only.
But that would defeat the point of roundtripping, right?
* The use case seems limited to implementing browsers or headless implementations working like browsers.
That's not really general enough to warrant adding lots of new codecs to the stdlib. A PyPI package is better suited for this.
Perhaps, but such a package already exists and its author (who surely has read a lot of bug reports from its users) says that this is cumbersome.
* The WHATWG codecs do not only cover simple mapping codecs, but also many multi-byte ones for e.g. Asian languages.
I doubt that we'd want to maintain such codecs in the stdlib, since this will increase the download sizes of the installers and also require people knowledgeable about these variants to work on them and fix any issues.
Really? Why is adding a bunch of codecs so much effort? Surely the translation tables contain data that compresses well? And surely we don't need a separate dedicated piece of C code for each new codec?
Overall, I think either pointing people to error handlers or perhaps adding a new one specifically for the case of dealing with control character mappings would provide a better maintenance / usefulness ratio than adding lots of new legacy codecs to the stdlib.
Wouldn't error handlers be much slower? And to me it seems a new error handler is a much *bigger* deal than some new encodings -- error handlers must work for *all* encodings.
BTW: WHATWG pushes for always using UTF-8 as far as I can tell from their website.
As does Python. But apparently it will take decades more to get there.
(Modifying existing encodings seems wrong -- did the feature request somehow transmogrify into that?)
Someone did discover that Microsoft's current implementations of the windows-* encodings matches the WHAT-WG spec, rather than the Unicode spec that Microsoft originally wrote.
No, MS implements somethings called "best fit encodings" and these are different than what WHATWG uses.
Unlike the WHATWG encodings, these are documented as vendor encodings on the Unicode site, which is what we normally use as reference for out stdlib codecs.
However, whether these are actually a good idea, is open to discussion as well, since they sometimes go a bit far with "best fit", e.g. mapping the infinity symbol to 8.
Again, using the error handles we have for dealing with situations which require non-standard encoding behavior are the better approach:
https://docs.python.org/3.7/library/codecs.html#error-handlers
Adding new ones is possible as well.
So there is some argument that the Python's existing encodings are simply out of date, and changing them would be a bugfix. (And standards aside, it is surely going to be somewhat error-prone if Python's windows-1252 doesn't match everyone else's implementations of windows-1252.) But yeah, AFAICT the original requesters would be happy either way; they just want it available under some name.
The encodings are not out of date. I don't know where you got that impression from.
The Windows API WideCharToMultiByte which was quoted in the discussion:
https://msdn.microsoft.com/en-us/library/windows/desktop/ dd374130%28v=vs.85%29.aspx
unfortunately uses the above mentioned best fit encodings, but this can and should be switched off by specifying the WC_NO_BEST_FIT_CHARS for anything that requires validation or needs to be interoperable:
""" For strings that require validation, such as file, resource, and user names, the application should always use the WC_NO_BEST_FIT_CHARS flag. This flag prevents the function from mapping characters to characters that appear similar but have very different semantics. In some cases, the semantic change can be extreme. For example, the symbol for "∞" (infinity) maps to 8 (eight) in some code pages. """
-- Marc-Andre Lemburg eGenix.com
Professional Python Services directly from the Experts (#1, Jan 19 2018)
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/
::: We implement business ideas - efficiently in both time and costs :::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/
-- --Guido van Rossum (python.org/~guido)
On 19.01.2018 17:20, Guido van Rossum wrote:
On Fri, Jan 19, 2018 at 5:30 AM, M.-A. Lemburg <mal@egenix.com <mailto:mal@egenix.com>> wrote:
On 19.01.2018 05:38, Nathaniel Smith wrote: > On Thu, Jan 18, 2018 at 7:51 PM, Guido van Rossum <guido@python.org <mailto:guido@python.org>> wrote: >> Can someone explain to me why this is such a controversial issue? > > I guess practicality versus purity is always controversial :-) > >> It seems reasonable to me to add new encodings to the stdlib that do the >> roundtripping requested in the first message of the thread. As long as they >> have new names that seems to fall under "practicality beats purity".
There are a few issues here:
* WHATWG encodings are mostly for decoding content in order to show it in the browser, accepting broken encoding data.
And sometimes Python apps that pull data from the web.
Python already has support for this by using one of the available error handlers, or adding new ones to suit the needs.
This seems cumbersome though. Why is that ?
Python 3 uses such error handlers for most of the I/O that's done with the OS already and for very similar reasons: dealing with broken data or broken configurations.
If we'd add the encodings, people will start creating more broken data, since this is what the WHATWG codecs output when encoding Unicode.
That's FUD. Only apps that specifically use the new WHATWG encodings would be able to consume that data. And surely the practice of web browsers will have a much bigger effect than Python's choice.
It's not FUD. I don't think we ought to encourage having Python create more broken data. The purpose of the WHATWG encodings is to help browsers deal with decoding broken data in a uniform way. It's not to generate more such data. That may be regarded as purists view, but also has a very practical meaning. The output of the codecs will only readable by browsers implementing the WHATWG encodings. Other tools receiving the data will run into the same decoding problems. Once you have Unicode, it's better to stay there and use UTF-8 for encoding to avoid any such issues.
As discussed, this could be addressed by making the WHATWG codecs decode-only.
But that would defeat the point of roundtripping, right?
Yes, intentionally. Once you have Unicode, the data should be encoded correctly back into UTF-8 or whatever legacy encoding is needed, fixing any issues while in Unicode. As always, it's better to explicitly address such problems than to simply punt on them and write back broken data.
* The use case seems limited to implementing browsers or headless implementations working like browsers.
That's not really general enough to warrant adding lots of new codecs to the stdlib. A PyPI package is better suited for this.
Perhaps, but such a package already exists and its author (who surely has read a lot of bug reports from its users) says that this is cumbersome.
The only critique I read was that registering the codecs is not explicit enough, but that's really only a nit, since you can easily have the codec package expose a register function which you then call explicitly in the code using the codecs.
* The WHATWG codecs do not only cover simple mapping codecs, but also many multi-byte ones for e.g. Asian languages.
I doubt that we'd want to maintain such codecs in the stdlib, since this will increase the download sizes of the installers and also require people knowledgeable about these variants to work on them and fix any issues.
Really? Why is adding a bunch of codecs so much effort? Surely the translation tables contain data that compresses well? And surely we don't need a separate dedicated piece of C code for each new codec?
For the simple charmap style codecs that's true. Not so for the Asian ones and the latter also do require dedicated C code (see Modules/cjkcodecs).
Overall, I think either pointing people to error handlers or perhaps adding a new one specifically for the case of dealing with control character mappings would provide a better maintenance / usefulness ratio than adding lots of new legacy codecs to the stdlib.
Wouldn't error handlers be much slower? And to me it seems a new error handler is a much *bigger* deal than some new encodings -- error handlers must work for *all* encodings.
Error handlers have a standard interface and so they will work for all codecs. Some codecs limits the number of handlers that can be used, but most accept all registered handlers. If a handler is too slow in Python, it can be coded in C for speed.
BTW: WHATWG pushes for always using UTF-8 as far as I can tell from their website.
As does Python. But apparently it will take decades more to get there.
Yes indeed, so let's not add even more confusion by adding more variants of the legacy encodings. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jan 19 2018)
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/
Error handlers are quite orthogonal to this problem. If you try to solve this problem with an error handler, you will have a different problem. Suppose you made "c1-control-passthrough" or whatever into an error handler, similar to "replace" or "ignore", and then you encounter an unassigned character that's *not* in the range 0x80 to 0x9f. (Many encodings have these.) Do you replace it? Do you ignore it? You don't know because you just replaced the error handler with something that's not about error handling. I will also repeat that having these encodings (in both directions) will provide more ways for Python to *reduce* the amount of mojibake that exists. If acknowledging that mojibake exists offends your sense of purity, and you'd rather just destroy all mojibake at the source... that's great, and please get back to me after you've fixed Microsoft Excel. I hope to make a pull request shortly that implements these mappings as new encodings that work just like the other ones. On Fri, 19 Jan 2018 at 11:54 M.-A. Lemburg <mal@egenix.com> wrote:
On Fri, Jan 19, 2018 at 5:30 AM, M.-A. Lemburg <mal@egenix.com <mailto:mal@egenix.com>> wrote:
On 19.01.2018 05:38, Nathaniel Smith wrote: > On Thu, Jan 18, 2018 at 7:51 PM, Guido van Rossum < guido@python.org <mailto:guido@python.org>> wrote: >> Can someone explain to me why this is such a controversial issue? > > I guess practicality versus purity is always controversial :-) > >> It seems reasonable to me to add new encodings to the stdlib that do the >> roundtripping requested in the first message of the thread. As long as they >> have new names that seems to fall under "practicality beats
On 19.01.2018 17:20, Guido van Rossum wrote: purity".
There are a few issues here:
* WHATWG encodings are mostly for decoding content in order to show it in the browser, accepting broken encoding data.
And sometimes Python apps that pull data from the web.
Python already has support for this by using one of the available error handlers, or adding new ones to suit the needs.
This seems cumbersome though.
Why is that ?
Python 3 uses such error handlers for most of the I/O that's done with the OS already and for very similar reasons: dealing with broken data or broken configurations.
If we'd add the encodings, people will start creating more broken data, since this is what the WHATWG codecs output when encoding Unicode.
That's FUD. Only apps that specifically use the new WHATWG encodings would be able to consume that data. And surely the practice of web browsers will have a much bigger effect than Python's choice.
It's not FUD. I don't think we ought to encourage having Python create more broken data. The purpose of the WHATWG encodings is to help browsers deal with decoding broken data in a uniform way. It's not to generate more such data.
That may be regarded as purists view, but also has a very practical meaning. The output of the codecs will only readable by browsers implementing the WHATWG encodings. Other tools receiving the data will run into the same decoding problems.
Once you have Unicode, it's better to stay there and use UTF-8 for encoding to avoid any such issues.
As discussed, this could be addressed by making the WHATWG codecs decode-only.
But that would defeat the point of roundtripping, right?
Yes, intentionally. Once you have Unicode, the data should be encoded correctly back into UTF-8 or whatever legacy encoding is needed, fixing any issues while in Unicode.
As always, it's better to explicitly address such problems than to simply punt on them and write back broken data.
* The use case seems limited to implementing browsers or headless implementations working like browsers.
That's not really general enough to warrant adding lots of new codecs to the stdlib. A PyPI package is better suited for this.
Perhaps, but such a package already exists and its author (who surely has read a lot of bug reports from its users) says that this is
cumbersome.
The only critique I read was that registering the codecs is not explicit enough, but that's really only a nit, since you can easily have the codec package expose a register function which you then call explicitly in the code using the codecs.
* The WHATWG codecs do not only cover simple mapping codecs, but also many multi-byte ones for e.g. Asian languages.
I doubt that we'd want to maintain such codecs in the stdlib, since this will increase the download sizes of the installers and also require people knowledgeable about these variants to work on them and fix any issues.
Really? Why is adding a bunch of codecs so much effort? Surely the translation tables contain data that compresses well? And surely we don't need a separate dedicated piece of C code for each new codec?
For the simple charmap style codecs that's true. Not so for the Asian ones and the latter also do require dedicated C code (see Modules/cjkcodecs).
Overall, I think either pointing people to error handlers or perhaps adding a new one specifically for the case of dealing with control character mappings would provide a better maintenance / usefulness ratio than adding lots of new legacy codecs to the stdlib.
Wouldn't error handlers be much slower? And to me it seems a new error handler is a much *bigger* deal than some new encodings -- error handlers must work for *all* encodings.
Error handlers have a standard interface and so they will work for all codecs. Some codecs limits the number of handlers that can be used, but most accept all registered handlers.
If a handler is too slow in Python, it can be coded in C for speed.
BTW: WHATWG pushes for always using UTF-8 as far as I can tell from their website.
As does Python. But apparently it will take decades more to get there.
Yes indeed, so let's not add even more confusion by adding more variants of the legacy encodings.
-- Marc-Andre Lemburg eGenix.com
Professional Python Services directly from the Experts (#1, Jan 19 2018)
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/
::: We implement business ideas - efficiently in both time and costs :::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
OK, I will tune out this conversation. It is clearly not going anywhere. On Fri, Jan 19, 2018 at 9:12 AM, Rob Speer <rspeer@luminoso.com> wrote:
Error handlers are quite orthogonal to this problem. If you try to solve this problem with an error handler, you will have a different problem.
Suppose you made "c1-control-passthrough" or whatever into an error handler, similar to "replace" or "ignore", and then you encounter an unassigned character that's *not* in the range 0x80 to 0x9f. (Many encodings have these.) Do you replace it? Do you ignore it? You don't know because you just replaced the error handler with something that's not about error handling.
I will also repeat that having these encodings (in both directions) will provide more ways for Python to *reduce* the amount of mojibake that exists. If acknowledging that mojibake exists offends your sense of purity, and you'd rather just destroy all mojibake at the source... that's great, and please get back to me after you've fixed Microsoft Excel.
I hope to make a pull request shortly that implements these mappings as new encodings that work just like the other ones.
On Fri, 19 Jan 2018 at 11:54 M.-A. Lemburg <mal@egenix.com> wrote:
On Fri, Jan 19, 2018 at 5:30 AM, M.-A. Lemburg <mal@egenix.com <mailto:mal@egenix.com>> wrote:
On 19.01.2018 05:38, Nathaniel Smith wrote: > On Thu, Jan 18, 2018 at 7:51 PM, Guido van Rossum < guido@python.org <mailto:guido@python.org>> wrote: >> Can someone explain to me why this is such a controversial issue? > > I guess practicality versus purity is always controversial :-) > >> It seems reasonable to me to add new encodings to the stdlib
On 19.01.2018 17:20, Guido van Rossum wrote: that do the
>> roundtripping requested in the first message of the thread. As
long as they
>> have new names that seems to fall under "practicality beats
purity".
There are a few issues here:
* WHATWG encodings are mostly for decoding content in order to show it in the browser, accepting broken encoding data.
And sometimes Python apps that pull data from the web.
Python already has support for this by using one of the available error handlers, or adding new ones to suit the needs.
This seems cumbersome though.
Why is that ?
Python 3 uses such error handlers for most of the I/O that's done with the OS already and for very similar reasons: dealing with broken data or broken configurations.
If we'd add the encodings, people will start creating more broken data, since this is what the WHATWG codecs output when encoding Unicode.
That's FUD. Only apps that specifically use the new WHATWG encodings would be able to consume that data. And surely the practice of web browsers will have a much bigger effect than Python's choice.
It's not FUD. I don't think we ought to encourage having Python create more broken data. The purpose of the WHATWG encodings is to help browsers deal with decoding broken data in a uniform way. It's not to generate more such data.
That may be regarded as purists view, but also has a very practical meaning. The output of the codecs will only readable by browsers implementing the WHATWG encodings. Other tools receiving the data will run into the same decoding problems.
Once you have Unicode, it's better to stay there and use UTF-8 for encoding to avoid any such issues.
As discussed, this could be addressed by making the WHATWG codecs decode-only.
But that would defeat the point of roundtripping, right?
Yes, intentionally. Once you have Unicode, the data should be encoded correctly back into UTF-8 or whatever legacy encoding is needed, fixing any issues while in Unicode.
As always, it's better to explicitly address such problems than to simply punt on them and write back broken data.
* The use case seems limited to implementing browsers or headless implementations working like browsers.
That's not really general enough to warrant adding lots of new codecs to the stdlib. A PyPI package is better suited for this.
Perhaps, but such a package already exists and its author (who surely has read a lot of bug reports from its users) says that this is
cumbersome.
The only critique I read was that registering the codecs is not explicit enough, but that's really only a nit, since you can easily have the codec package expose a register function which you then call explicitly in the code using the codecs.
* The WHATWG codecs do not only cover simple mapping codecs, but also many multi-byte ones for e.g. Asian languages.
I doubt that we'd want to maintain such codecs in the stdlib, since this will increase the download sizes of the installers and also require people knowledgeable about these variants to work on them and fix any issues.
Really? Why is adding a bunch of codecs so much effort? Surely the translation tables contain data that compresses well? And surely we don't need a separate dedicated piece of C code for each new codec?
For the simple charmap style codecs that's true. Not so for the Asian ones and the latter also do require dedicated C code (see Modules/cjkcodecs).
Overall, I think either pointing people to error handlers or perhaps adding a new one specifically for the case of dealing with control character mappings would provide a better maintenance / usefulness ratio than adding lots of new legacy codecs to the stdlib.
Wouldn't error handlers be much slower? And to me it seems a new error handler is a much *bigger* deal than some new encodings -- error handlers must work for *all* encodings.
Error handlers have a standard interface and so they will work for all codecs. Some codecs limits the number of handlers that can be used, but most accept all registered handlers.
If a handler is too slow in Python, it can be coded in C for speed.
BTW: WHATWG pushes for always using UTF-8 as far as I can tell from their website.
As does Python. But apparently it will take decades more to get there.
Yes indeed, so let's not add even more confusion by adding more variants of the legacy encodings.
-- Marc-Andre Lemburg eGenix.com
Professional Python Services directly from the Experts (#1, Jan 19 2018)
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/
::: We implement business ideas - efficiently in both time and costs :::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 <https://maps.google.com/?q=Pastor-Loeh-Str.48+%0D+%C2%A0+%C2%A0+D-40764+Langenfeld,+Germany&entry=gmail&source=g> D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido van Rossum (python.org/~guido)
On 19.01.2018 18:12, Rob Speer wrote:
Error handlers are quite orthogonal to this problem. If you try to solve this problem with an error handler, you will have a different problem.
Suppose you made "c1-control-passthrough" or whatever into an error handler, similar to "replace" or "ignore", and then you encounter an unassigned character that's *not* in the range 0x80 to 0x9f. (Many encodings have these.) Do you replace it? Do you ignore it? You don't know because you just replaced the error handler with something that's not about error handling.
It depends on what you want to achieve. You may want to fail, assign a code point from a private area or use a surrogate escape approach. Based on the context it may also make sense to escape the input data using a different syntax, e.g. XML escapes, backslash notations, HTML numeric entities, etc. You could also add a "latin1replace" error handler which simply passes through everything that's undefined as-is. The Unicode error handlers are pretty flexible when it comes to providing a solution: https://www.python.org/dev/peps/pep-0293/ You can even have the handler work "patch" an encoding, since it also gets the encoding name as input. You could probably create an error handler which implements most of their workarounds into a single "whatwg" handler.
I will also repeat that having these encodings (in both directions) will provide more ways for Python to *reduce* the amount of mojibake that exists. If acknowledging that mojibake exists offends your sense of purity, and you'd rather just destroy all mojibake at the source... that's great, and please get back to me after you've fixed Microsoft Excel.
I acknowledge that we have different views on this :-) Note that I'm not saying that the encodings are bad idea, or should not be used. I just don't want to have people start using "web-1252" as encoding simply because they they are writing out text for a web application - they should use "utf-8" instead. The extra hurdle to pip-install a package for this feels like the right way to turn this into a more conscious decision and who knows... perhaps it'll even help fix Excel once they have decided on including Python as scripting language: https://excel.uservoice.com/forums/304921-excel-for-windows-desktop-applicat...
I hope to make a pull request shortly that implements these mappings as new encodings that work just like the other ones.
On Fri, 19 Jan 2018 at 11:54 M.-A. Lemburg <mal@egenix.com <mailto:mal@egenix.com>> wrote:
On 19.01.2018 17:20, Guido van Rossum wrote: > On Fri, Jan 19, 2018 at 5:30 AM, M.-A. Lemburg <mal@egenix.com <mailto:mal@egenix.com> > <mailto:mal@egenix.com <mailto:mal@egenix.com>>> wrote: > > On 19.01.2018 05:38, Nathaniel Smith wrote: > > On Thu, Jan 18, 2018 at 7:51 PM, Guido van Rossum <guido@python.org <mailto:guido@python.org> <mailto:guido@python.org <mailto:guido@python.org>>> wrote: > >> Can someone explain to me why this is such a controversial issue? > > > > I guess practicality versus purity is always controversial :-) > > > >> It seems reasonable to me to add new encodings to the stdlib that do the > >> roundtripping requested in the first message of the thread. As long as they > >> have new names that seems to fall under "practicality beats purity". > > There are a few issues here: > > * WHATWG encodings are mostly for decoding content in order to > show it in the browser, accepting broken encoding data. > > > And sometimes Python apps that pull data from the web. > > > Python already has support for this by using one of the available > error handlers, or adding new ones to suit the needs. > > > This seems cumbersome though. Why is that ?
Python 3 uses such error handlers for most of the I/O that's done with the OS already and for very similar reasons: dealing with broken data or broken configurations.
> If we'd add the encodings, people will start creating more > broken data, since this is what the WHATWG codecs output > when encoding Unicode. > > > That's FUD. Only apps that specifically use the new WHATWG encodings > would be able to consume that data. And surely the practice of web > browsers will have a much bigger effect than Python's choice. It's not FUD. I don't think we ought to encourage having Python create more broken data. The purpose of the WHATWG encodings is to help browsers deal with decoding broken data in a uniform way. It's not to generate more such data.
That may be regarded as purists view, but also has a very practical meaning. The output of the codecs will only readable by browsers implementing the WHATWG encodings. Other tools receiving the data will run into the same decoding problems.
Once you have Unicode, it's better to stay there and use UTF-8 for encoding to avoid any such issues.
> As discussed, this could be addressed by making the WHATWG > codecs decode-only. > > > But that would defeat the point of roundtripping, right?
Yes, intentionally. Once you have Unicode, the data should be encoded correctly back into UTF-8 or whatever legacy encoding is needed, fixing any issues while in Unicode.
As always, it's better to explicitly address such problems than to simply punt on them and write back broken data.
> * The use case seems limited to implementing browsers or headless > implementations working like browsers. > > That's not really general enough to warrant adding lots of > new codecs to the stdlib. A PyPI package is better suited > for this. > > > Perhaps, but such a package already exists and its author (who surely > has read a lot of bug reports from its users) says that this is cumbersome. The only critique I read was that registering the codecs is not explicit enough, but that's really only a nit, since you can easily have the codec package expose a register function which you then call explicitly in the code using the codecs.
> * The WHATWG codecs do not only cover simple mapping codecs, > but also many multi-byte ones for e.g. Asian languages. > > I doubt that we'd want to maintain such codecs in the stdlib, > since this will increase the download sizes of the installers > and also require people knowledgeable about these variants > to work on them and fix any issues. > > > Really? Why is adding a bunch of codecs so much effort? Surely the > translation tables contain data that compresses well? And surely we > don't need a separate dedicated piece of C code for each new codec? For the simple charmap style codecs that's true. Not so for the Asian ones and the latter also do require dedicated C code (see Modules/cjkcodecs).
> Overall, I think either pointing people to error handlers > or perhaps adding a new one specifically for the case of > dealing with control character mappings would provide a better > maintenance / usefulness ratio than adding lots of new > legacy codecs to the stdlib. > > > Wouldn't error handlers be much slower? And to me it seems a new error > handler is a much *bigger* deal than some new encodings -- error > handlers must work for *all* encodings. Error handlers have a standard interface and so they will work for all codecs. Some codecs limits the number of handlers that can be used, but most accept all registered handlers.
If a handler is too slow in Python, it can be coded in C for speed.
> BTW: WHATWG pushes for always using UTF-8 as far as I can tell > from their website. > > > As does Python. But apparently it will take decades more to get there.
Yes indeed, so let's not add even more confusion by adding more variants of the legacy encodings.
-- Marc-Andre Lemburg eGenix.com
Professional Python Services directly from the Experts (#1, Jan 19 2018) >>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ >>> Python Database Interfaces ... http://products.egenix.com/ >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ ________________________________________________________________________
::: We implement business ideas - efficiently in both time and costs :::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/
_______________________________________________ Python-ideas mailing list Python-ideas@python.org <mailto:Python-ideas@python.org> https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jan 19 2018)
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/
It depends on what you want to achieve. You may want to fail, assign a code point from a private area or use a surrogate escape approach.
You could also add a "latin1replace" error handler which simply passes
And the way to express that is with errors='replace', errors='surrogateescape', or whatever, which Python already does. We do not need an explosion of error handlers. This problem can be very straightforwardly solved with encodings, and error handlers can keep doing their usual job on top of encodings. through everything that's undefined as-is. Nobody asked for this.
I just don't want to have people start using "web-1252" as encoding simply because they they are writing out text for a web application - they should use "utf-8" instead.
I did ask for input on the name. If the problem is that you think my working name for the encoding is misleading, you could help with that instead of constantly trying to replace the proposal with something different. Guido had some very sensible feedback just a moment ago. I am wondering now if we lost Guido because I broke python-ideas etiquette (is a pull request not the next step, for example? I never got a good answer on the process), or because this thread is just constantly being derailed. On Fri, 19 Jan 2018 at 13:14 M.-A. Lemburg <mal@egenix.com> wrote:
On 19.01.2018 18:12, Rob Speer wrote:
Error handlers are quite orthogonal to this problem. If you try to solve this problem with an error handler, you will have a different problem.
Suppose you made "c1-control-passthrough" or whatever into an error handler, similar to "replace" or "ignore", and then you encounter an unassigned character that's *not* in the range 0x80 to 0x9f. (Many encodings have these.) Do you replace it? Do you ignore it? You don't know because you just replaced the error handler with something that's not about error handling.
It depends on what you want to achieve. You may want to fail, assign a code point from a private area or use a surrogate escape approach. Based on the context it may also make sense to escape the input data using a different syntax, e.g. XML escapes, backslash notations, HTML numeric entities, etc.
You could also add a "latin1replace" error handler which simply passes through everything that's undefined as-is.
The Unicode error handlers are pretty flexible when it comes to providing a solution:
https://www.python.org/dev/peps/pep-0293/
You can even have the handler work "patch" an encoding, since it also gets the encoding name as input.
You could probably create an error handler which implements most of their workarounds into a single "whatwg" handler.
I will also repeat that having these encodings (in both directions) will provide more ways for Python to *reduce* the amount of mojibake that exists. If acknowledging that mojibake exists offends your sense of purity, and you'd rather just destroy all mojibake at the source... that's great, and please get back to me after you've fixed Microsoft Excel.
I acknowledge that we have different views on this :-)
Note that I'm not saying that the encodings are bad idea, or should not be used.
I just don't want to have people start using "web-1252" as encoding simply because they they are writing out text for a web application - they should use "utf-8" instead.
The extra hurdle to pip-install a package for this feels like the right way to turn this into a more conscious decision and who knows... perhaps it'll even help fix Excel once they have decided on including Python as scripting language:
https://excel.uservoice.com/forums/304921-excel-for-windows-desktop-applicat...
I hope to make a pull request shortly that implements these mappings as new encodings that work just like the other ones.
On Fri, 19 Jan 2018 at 11:54 M.-A. Lemburg <mal@egenix.com <mailto:mal@egenix.com>> wrote:
On 19.01.2018 17:20, Guido van Rossum wrote: > On Fri, Jan 19, 2018 at 5:30 AM, M.-A. Lemburg <mal@egenix.com <mailto:mal@egenix.com> > <mailto:mal@egenix.com <mailto:mal@egenix.com>>> wrote: > > On 19.01.2018 05:38, Nathaniel Smith wrote: > > On Thu, Jan 18, 2018 at 7:51 PM, Guido van Rossum <guido@python.org <mailto:guido@python.org> <mailto:guido@python.org <mailto:guido@python.org>>> wrote: > >> Can someone explain to me why this is such a controversial issue? > > > > I guess practicality versus purity is always controversial :-) > > > >> It seems reasonable to me to add new encodings to the stdlib that do the > >> roundtripping requested in the first message of the thread. As long as they > >> have new names that seems to fall under "practicality beats purity". > > There are a few issues here: > > * WHATWG encodings are mostly for decoding content in order to > show it in the browser, accepting broken encoding data. > > > And sometimes Python apps that pull data from the web. > > > Python already has support for this by using one of the available > error handlers, or adding new ones to suit the needs. > > > This seems cumbersome though.
Why is that ?
Python 3 uses such error handlers for most of the I/O that's done with the OS already and for very similar reasons: dealing with broken data or broken configurations.
> If we'd add the encodings, people will start creating more > broken data, since this is what the WHATWG codecs output > when encoding Unicode. > > > That's FUD. Only apps that specifically use the new WHATWG encodings > would be able to consume that data. And surely the practice of web > browsers will have a much bigger effect than Python's choice.
It's not FUD. I don't think we ought to encourage having Python create more broken data. The purpose of the WHATWG encodings is to help browsers deal with decoding broken data in a uniform way. It's not to generate more such data.
That may be regarded as purists view, but also has a very practical meaning. The output of the codecs will only readable by browsers implementing the WHATWG encodings. Other tools receiving the data will run into the same decoding problems.
Once you have Unicode, it's better to stay there and use UTF-8 for encoding to avoid any such issues.
> As discussed, this could be addressed by making the WHATWG > codecs decode-only. > > > But that would defeat the point of roundtripping, right?
Yes, intentionally. Once you have Unicode, the data should be encoded correctly back into UTF-8 or whatever legacy encoding is needed, fixing any issues while in Unicode.
As always, it's better to explicitly address such problems than to simply punt on them and write back broken data.
> * The use case seems limited to implementing browsers or headless > implementations working like browsers. > > That's not really general enough to warrant adding lots of > new codecs to the stdlib. A PyPI package is better suited > for this. > > > Perhaps, but such a package already exists and its author (who surely > has read a lot of bug reports from its users) says that this is cumbersome.
The only critique I read was that registering the codecs is not explicit enough, but that's really only a nit, since you can easily have the codec package expose a register function which you then call explicitly in the code using the codecs.
> * The WHATWG codecs do not only cover simple mapping codecs, > but also many multi-byte ones for e.g. Asian languages. > > I doubt that we'd want to maintain such codecs in the stdlib, > since this will increase the download sizes of the installers > and also require people knowledgeable about these variants > to work on them and fix any issues. > > > Really? Why is adding a bunch of codecs so much effort? Surely the > translation tables contain data that compresses well? And surely we > don't need a separate dedicated piece of C code for each new codec?
For the simple charmap style codecs that's true. Not so for the Asian ones and the latter also do require dedicated C code (see Modules/cjkcodecs).
> Overall, I think either pointing people to error handlers > or perhaps adding a new one specifically for the case of > dealing with control character mappings would provide a better > maintenance / usefulness ratio than adding lots of new > legacy codecs to the stdlib. > > > Wouldn't error handlers be much slower? And to me it seems a new error > handler is a much *bigger* deal than some new encodings -- error > handlers must work for *all* encodings.
Error handlers have a standard interface and so they will work for all codecs. Some codecs limits the number of handlers that can be used, but most accept all registered handlers.
If a handler is too slow in Python, it can be coded in C for speed.
> BTW: WHATWG pushes for always using UTF-8 as far as <https://maps.google.com/?q=%C2%A0BTW:+WHATWG+pushes+for+always+using+UTF-8+as+far+as&entry=gmail&source=g> I can tell > from their website. > > > As does Python. But apparently it will take decades more to get there.
Yes indeed, so let's not add even more confusion by adding more variants of the legacy encodings.
-- Marc-Andre Lemburg eGenix.com
Professional Python Services directly from the Experts (#1, Jan 19
>>> Python Projects, Coaching and Consulting ...
>>> Python Database Interfaces ...
>>> Plone/Zope Database Interfaces ...
________________________________________________________________________
::: We implement business ideas - efficiently in both time and costs
:::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/
_______________________________________________ Python-ideas mailing list Python-ideas@python.org <mailto:Python-ideas@python.org> https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- Marc-Andre Lemburg eGenix.com
Professional Python Services directly from the Experts (#1, Jan 19 2018)
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/
::: We implement business ideas - efficiently in both time and costs :::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/
Rob: I think I was very clear very early in the thread that I'm opposed to adding a complete set of new encodings to the stdlib which only slightly alter many existing ones. Ever since I've been trying to give you suggestions on how we can solve the issue you're trying to address with the encodings in different ways which achieve much of the same but with the existing code base. I've also tried to understand the issue with WideCharToMultiByte() et al. apparently using different encodings than the ones which MS itself published to the Unicode Consortium, to see whether there's an issue we may need to resolve. That's a different topic, which is why I changed the subject line. If you call that derailing, I cannot help it, but won't engage any further in this discussion. Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jan 19 2018)
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/ On 19.01.2018 19:35, Rob Speer wrote:
It depends on what you want to achieve. You may want to fail, assign a code point from a private area or use a surrogate escape approach.
And the way to express that is with errors='replace', errors='surrogateescape', or whatever, which Python already does. We do not need an explosion of error handlers. This problem can be very straightforwardly solved with encodings, and error handlers can keep doing their usual job on top of encodings.
You could also add a "latin1replace" error handler which simply passes through everything that's undefined as-is.
Nobody asked for this.
I just don't want to have people start using "web-1252" as encoding simply because they they are writing out text for a web application - they should use "utf-8" instead.
I did ask for input on the name. If the problem is that you think my working name for the encoding is misleading, you could help with that instead of constantly trying to replace the proposal with something different.
Guido had some very sensible feedback just a moment ago. I am wondering now if we lost Guido because I broke python-ideas etiquette (is a pull request not the next step, for example? I never got a good answer on the process), or because this thread is just constantly being derailed.
On Fri, 19 Jan 2018 at 13:14 M.-A. Lemburg <mal@egenix.com <mailto:mal@egenix.com>> wrote:
On 19.01.2018 18:12, Rob Speer wrote: > Error handlers are quite orthogonal to this problem. If you try to solve > this problem with an error handler, you will have a different problem. > > Suppose you made "c1-control-passthrough" or whatever into an error > handler, similar to "replace" or "ignore", and then you encounter an > unassigned character that's *not* in the range 0x80 to 0x9f. (Many > encodings have these.) Do you replace it? Do you ignore it? You don't > know because you just replaced the error handler with something that's > not about error handling.
It depends on what you want to achieve. You may want to fail, assign a code point from a private area or use a surrogate escape approach. Based on the context it may also make sense to escape the input data using a different syntax, e.g. XML escapes, backslash notations, HTML numeric entities, etc.
You could also add a "latin1replace" error handler which simply passes through everything that's undefined as-is.
The Unicode error handlers are pretty flexible when it comes to providing a solution:
https://www.python.org/dev/peps/pep-0293/
You can even have the handler work "patch" an encoding, since it also gets the encoding name as input.
You could probably create an error handler which implements most of their workarounds into a single "whatwg" handler.
> I will also repeat that having these encodings (in both directions) will > provide more ways for Python to *reduce* the amount of mojibake that > exists. If acknowledging that mojibake exists offends your sense of > purity, and you'd rather just destroy all mojibake at the source... > that's great, and please get back to me after you've fixed Microsoft Excel.
I acknowledge that we have different views on this :-)
Note that I'm not saying that the encodings are bad idea, or should not be used.
I just don't want to have people start using "web-1252" as encoding simply because they they are writing out text for a web application - they should use "utf-8" instead.
The extra hurdle to pip-install a package for this feels like the right way to turn this into a more conscious decision and who knows... perhaps it'll even help fix Excel once they have decided on including Python as scripting language:
https://excel.uservoice.com/forums/304921-excel-for-windows-desktop-applicat...
> I hope to make a pull request shortly that implements these mappings as > new encodings that work just like the other ones. > > On Fri, 19 Jan 2018 at 11:54 M.-A. Lemburg <mal@egenix.com <mailto:mal@egenix.com> > <mailto:mal@egenix.com <mailto:mal@egenix.com>>> wrote: > > On 19.01.2018 17:20, Guido van Rossum wrote: > > On Fri, Jan 19, 2018 at 5:30 AM, M.-A. Lemburg <mal@egenix.com <mailto:mal@egenix.com> > <mailto:mal@egenix.com <mailto:mal@egenix.com>> > > <mailto:mal@egenix.com <mailto:mal@egenix.com> <mailto:mal@egenix.com <mailto:mal@egenix.com>>>> wrote: > > > > On 19.01.2018 05:38, Nathaniel Smith wrote: > > > On Thu, Jan 18, 2018 at 7:51 PM, Guido van Rossum > <guido@python.org <mailto:guido@python.org> <mailto:guido@python.org <mailto:guido@python.org>> <mailto:guido@python.org <mailto:guido@python.org> > <mailto:guido@python.org <mailto:guido@python.org>>>> wrote: > > >> Can someone explain to me why this is such a controversial > issue? > > > > > > I guess practicality versus purity is always controversial :-) > > > > > >> It seems reasonable to me to add new encodings to the > stdlib that do the > > >> roundtripping requested in the first message of the thread. > As long as they > > >> have new names that seems to fall under "practicality beats > purity". > > > > There are a few issues here: > > > > * WHATWG encodings are mostly for decoding content in order to > > show it in the browser, accepting broken encoding data. > > > > > > And sometimes Python apps that pull data from the web. > > > > > > Python already has support for this by using one of the > available > > error handlers, or adding new ones to suit the needs. > > > > > > This seems cumbersome though. > > Why is that ? > > Python 3 uses such error handlers for most of the I/O that's done > with the OS already and for very similar reasons: dealing with > broken data or broken configurations. > > > If we'd add the encodings, people will start creating more > > broken data, since this is what the WHATWG codecs output > > when encoding Unicode. > > > > > > That's FUD. Only apps that specifically use the new WHATWG encodings > > would be able to consume that data. And surely the practice of web > > browsers will have a much bigger effect than Python's choice. > > It's not FUD. I don't think we ought to encourage having > Python create more broken data. The purpose of the WHATWG > encodings is to help browsers deal with decoding broken > data in a uniform way. It's not to generate more such data. > > That may be regarded as purists view, but also has a very > practical meaning. The output of the codecs will only readable > by browsers implementing the WHATWG encodings. Other tools > receiving the data will run into the same decoding problems. > > Once you have Unicode, it's better to stay there and use > UTF-8 for encoding to avoid any such issues. > > > As discussed, this could be addressed by making the WHATWG > > codecs decode-only. > > > > > > But that would defeat the point of roundtripping, right? > > Yes, intentionally. Once you have Unicode, the data should > be encoded correctly back into UTF-8 or whatever legacy encoding > is needed, fixing any issues while in Unicode. > > As always, it's better to explicitly address such problems than > to simply punt on them and write back broken data. > > > * The use case seems limited to implementing browsers or headless > > implementations working like browsers. > > > > That's not really general enough to warrant adding lots of > > new codecs to the stdlib. A PyPI package is better suited > > for this. > > > > > > Perhaps, but such a package already exists and its author (who surely > > has read a lot of bug reports from its users) says that this is > cumbersome. > > The only critique I read was that registering the codecs > is not explicit enough, but that's really only a nit, since > you can easily have the codec package expose a register > function which you then call explicitly in the code using > the codecs. > > > * The WHATWG codecs do not only cover simple mapping codecs, > > but also many multi-byte ones for e.g. Asian languages. > > > > I doubt that we'd want to maintain such codecs in the stdlib, > > since this will increase the download sizes of the installers > > and also require people knowledgeable about these variants > > to work on them and fix any issues. > > > > > > Really? Why is adding a bunch of codecs so much effort? Surely the > > translation tables contain data that compresses well? And surely we > > don't need a separate dedicated piece of C code for each new codec? > > For the simple charmap style codecs that's true. Not so for the > Asian ones and the latter also do require dedicated C code (see > Modules/cjkcodecs). > > > Overall, I think either pointing people to error handlers > > or perhaps adding a new one specifically for the case of > > dealing with control character mappings would provide a better > > maintenance / usefulness ratio than adding lots of new > > legacy codecs to the stdlib. > > > > > > Wouldn't error handlers be much slower? And to me it seems a new error > > handler is a much *bigger* deal than some new encodings -- error > > handlers must work for *all* encodings. > > Error handlers have a standard interface and so they will work > for all codecs. Some codecs limits the number of handlers that > can be used, but most accept all registered handlers. > > If a handler is too slow in Python, it can be coded in C for > speed. > > > BTW: WHATWG pushes for always using UTF-8 as far as <https://maps.google.com/?q=%C2%A0BTW:+WHATWG+pushes+for+always+using+UTF-8+as+far+as&entry=gmail&source=g> I can tell > > from their website. > > > > > > As does Python. But apparently it will take decades more to get there. > > Yes indeed, so let's not add even more confusion by adding more > variants of the legacy encodings. > > -- > Marc-Andre Lemburg > eGenix.com > > Professional Python Services directly from the Experts (#1, Jan 19 2018) > >>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ > >>> Python Database Interfaces ... http://products.egenix.com/ > >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ > ________________________________________________________________________ > > ::: We implement business ideas - efficiently in both time and costs ::: > > eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 > D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg > Registered at Amtsgericht Duesseldorf: HRB 46611 > http://www.egenix.com/company/contact/ > http://www.malemburg.com/ > > _______________________________________________ > Python-ideas mailing list > Python-ideas@python.org <mailto:Python-ideas@python.org> <mailto:Python-ideas@python.org <mailto:Python-ideas@python.org>> > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ >
-- Marc-Andre Lemburg eGenix.com
Professional Python Services directly from the Experts (#1, Jan 19 2018) >>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ >>> Python Database Interfaces ... http://products.egenix.com/ >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ ________________________________________________________________________
::: We implement business ideas - efficiently in both time and costs :::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/
On Fri, Jan 19, 2018 at 06:35:30PM +0000, Rob Speer wrote:
It depends on what you want to achieve. You may want to fail, assign a code point from a private area or use a surrogate escape approach.
And the way to express that is with errors='replace', errors='surrogateescape', or whatever, which Python already does. We do not need an explosion of error handlers. This problem can be very straightforwardly solved with encodings, and error handlers can keep doing their usual job on top of encodings.
You could also add a "latin1replace" error handler which simply passes through everything that's undefined as-is.
Nobody asked for this.
Actually, Soni L. seems to have suggested a similar idea in the thread titled "Chaining coders" (codecs). But what does it matter whether someone asked for it? Until this thread, nobody had asked for support for WHATWG encodings either. The question to my mind is whether or not this "latin1replace" handler, in conjunction with existing codecs, will do the same thing as the WHATWG codecs. If I have understood you correctly, I think it will. Have I missed something?
I just don't want to have people start using "web-1252" as encoding simply because they they are writing out text for a web application - they should use "utf-8" instead.
I did ask for input on the name. If the problem is that you think my working name for the encoding is misleading, you could help with that instead of constantly trying to replace the proposal with something different.
Rob, you've come here with a proposal based on an actual problem (web pages with mojibake and broken encodings), an existing solution (a third party library) you dislike, and a suggested new solution you will like (move the encodings into the std lib). That's great, and we need more suggestions like this: concrete use-cases and concrete solutions. But you cannot expect that we're going to automatically agree that: - the problem is something that Python the language has to solve (it seems to be a *browser* problem, not a general programming problem); - the existing solution is not sufficient; and - your proposal is the right solution. All of these things need to be justified, and counter-proposals are part of that. When we make a non-trivial proposal on Python-Ideas, it is very rare that they are so clearly the right solution for the right problem that they get instant approval and you can go straight to the PR. Often there are legitimate questions about all three steps. That's why I suggested earlier that (in my opinion) there needs to be a PEP to summarise the issue, justify the proposal, and counter the arguments against it. (Even if the proposal is agreed upon by everyone, if it is sufficiently non-trivial, we sometimes require a PEP summarising the issue for future reference.) As the author of one PEP myself, I know how frustrating this process can seem when you think that this is a bloody obvious proposal with no downside that all right-thinking people ought to instantly recognise as a great idea *wink* but nevertheless, in *my opinion* (I don't speak for anyone else) I think a PEP would be a good idea.
Guido had some very sensible feedback just a moment ago. I am wondering now if we lost Guido because I broke python-ideas etiquette (is a pull request not the next step, for example? I never got a good answer on the process), or because this thread is just constantly being derailed.
I don't speak for Guido, but it might simply be he isn't invested enough in *this specific issue* to spend the time wading through a long thread. (That's another reason why a PEP is sometimes valuable.) Perhaps he's still on holiday and only has limited time to spend on this. If I were in your position, my next step would be to write a new post summarising the thread so far: - a brief summary of the nature of the problem; - why you think a solution (whatever that solution turns out to be) should be in the stdlib rather than a third-party library; - what you think the solution should be; - and give a fair critique of the alternatives suggested so far and why you thik that they aren't suitable. That's the same sort of information given in a PEP, but without having to go through the formal PEP process. That might be enough to gain consensus on what happens next -- and maybe even agreement that a formal and more detailed PEP is not needed. Oh, and in case you're thinking this is all a great PITA, it might help if you read these to get an understanding of why things are as they are: https://www.curiousefficiency.org/posts/2011/02/status-quo-wins-stalemate.ht... https://www.curiousefficiency.org/posts/2011/04/musings-on-culture-of-python... Good luck! -- Steve
The question to my mind is whether or not this "latin1replace" handler, in conjunction with existing codecs, will do the same thing as the WHATWG codecs. If I have understood you correctly, I think it will. Have I missed something?
It won't do the same thing, and neither will the "chaining coders" proposal. It's easy to miss details like this in all the counterproposals. The difference between WHATWG encodings and the ones in Python is, in all but one case, *only* in the C1 control character range (0x80 to 0x9F), a range of Unicode characters that has historically evaded standardization because they never had a clear purpose even before Unicode. Filling in all the gaps with Latin-1 would do the right thing for, I think, 3 of the encodings, and the wrong thing in the other 5 cases. (In the anomalous case of Windows-1255, it would do a more explicitly wrong thing.) Let's take Windows-1253 (Greek) as an example. Windows-1253 has a bunch of gaps in the 0x80 to 0x9F range, like most of the others. It also has gaps for 0xAA, 0xD2, and 0xFF. WHATWG does _not_ recommend decoding these as the letters "ª", "Ò", and "ÿ", the characters in the equivalent positions in Latin-1. They are simply unassigned. Other software sometimes maps them to the Private Use Area, but this is not standardized at all, and it seems clear that Python should handle them with its usual error handler for unassigned bytes. (Which is one of the reasons not to replace the error handler with something different: we still need the error handler.) Of course, you could define an encoding that's Windows-1253 plus the letters "ª", "Ò", and "ÿ", filling in all the gaps with Latin-1. It would be weird and new (who ever heard of an encoding that has a mapping for "Ò" but not "ò"?). One point I hope to have agreement on is that we do not want to create _new_ legacy encodings that are not used anywhere else. The reason I was proposing to move ahead with a PR was not that I thought it would be automatically accepted -- it was to have a point of reference for exactly what I'm proposing, so we can discuss exactly what the functional difference is between this and counterproposals without getting lost. But I can see how writing the point of reference in PEP form instead of PR form can be the right way to focus discussion. Thanks for the recommendation there, and I'd like a little extra information -- I don't know _mechanically_ how to write a PEP. (Where do I submit it to, for example?) -- Rob Speer On Sun, 21 Jan 2018 at 05:44 Steven D'Aprano <steve@pearwood.info> wrote:
On Fri, Jan 19, 2018 at 06:35:30PM +0000, Rob Speer wrote:
It depends on what you want to achieve. You may want to fail, assign a code point from a private area or use a surrogate escape approach.
And the way to express that is with errors='replace', errors='surrogateescape', or whatever, which Python already does. We do not need an explosion of error handlers. This problem can be very straightforwardly solved with encodings, and error handlers can keep doing their usual job on top of encodings.
You could also add a "latin1replace" error handler which simply passes through everything that's undefined as-is.
Nobody asked for this.
Actually, Soni L. seems to have suggested a similar idea in the thread titled "Chaining coders" (codecs).
But what does it matter whether someone asked for it? Until this thread, nobody had asked for support for WHATWG encodings either.
The question to my mind is whether or not this "latin1replace" handler, in conjunction with existing codecs, will do the same thing as the WHATWG codecs. If I have understood you correctly, I think it will. Have I missed something?
I just don't want to have people start using "web-1252" as encoding simply because they they are writing out text for a web application - they should use "utf-8" instead.
I did ask for input on the name. If the problem is that you think my working name for the encoding is misleading, you could help with that instead of constantly trying to replace the proposal with something different.
Rob, you've come here with a proposal based on an actual problem (web pages with mojibake and broken encodings), an existing solution (a third party library) you dislike, and a suggested new solution you will like (move the encodings into the std lib). That's great, and we need more suggestions like this: concrete use-cases and concrete solutions.
But you cannot expect that we're going to automatically agree that:
- the problem is something that Python the language has to solve (it seems to be a *browser* problem, not a general programming problem);
- the existing solution is not sufficient; and
- your proposal is the right solution.
All of these things need to be justified, and counter-proposals are part of that.
When we make a non-trivial proposal on Python-Ideas, it is very rare that they are so clearly the right solution for the right problem that they get instant approval and you can go straight to the PR. Often there are legitimate questions about all three steps. That's why I suggested earlier that (in my opinion) there needs to be a PEP to summarise the issue, justify the proposal, and counter the arguments against it.
(Even if the proposal is agreed upon by everyone, if it is sufficiently non-trivial, we sometimes require a PEP summarising the issue for future reference.)
As the author of one PEP myself, I know how frustrating this process can seem when you think that this is a bloody obvious proposal with no downside that all right-thinking people ought to instantly recognise as a great idea *wink* but nevertheless, in *my opinion* (I don't speak for anyone else) I think a PEP would be a good idea.
Guido had some very sensible feedback just a moment ago. I am wondering now if we lost Guido because I broke python-ideas etiquette (is a pull request not the next step, for example? I never got a good answer on the process), or because this thread is just constantly being derailed.
I don't speak for Guido, but it might simply be he isn't invested enough in *this specific issue* to spend the time wading through a long thread. (That's another reason why a PEP is sometimes valuable.) Perhaps he's still on holiday and only has limited time to spend on this.
If I were in your position, my next step would be to write a new post summarising the thread so far:
- a brief summary of the nature of the problem; - why you think a solution (whatever that solution turns out to be) should be in the stdlib rather than a third-party library; - what you think the solution should be; - and give a fair critique of the alternatives suggested so far and why you thik that they aren't suitable.
That's the same sort of information given in a PEP, but without having to go through the formal PEP process. That might be enough to gain consensus on what happens next -- and maybe even agreement that a formal and more detailed PEP is not needed.
Oh, and in case you're thinking this is all a great PITA, it might help if you read these to get an understanding of why things are as they are:
https://www.curiousefficiency.org/posts/2011/02/status-quo-wins-stalemate.ht...
https://www.curiousefficiency.org/posts/2011/04/musings-on-culture-of-python...
Good luck!
-- Steve _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
On Mon, Jan 22, 2018 at 3:36 AM, Rob Speer <rspeer@luminoso.com> wrote:
Thanks for the recommendation there, and I'd like a little extra information -- I don't know _mechanically_ how to write a PEP. (Where do I submit it to, for example?)
I can help you with that side of things. Start by checking out PEP 1: https://www.python.org/dev/peps/pep-0001/ Feel free to ping me off-list if you have difficulties, or if you need a hand getting the formatting tidy. ChrisA
I don't expect to change your mind about the "right" way to deal with this, but this is a more explicit description of what those of us who advocate error handlers are thinking about. It may be useful in writing your PEP (PEPs describe rejected counterproposals and amendments along with adopted proposals and rationale in either case). Rob Speer writes:
The question to my mind is whether or not this "latin1replace" handler, in conjunction with existing codecs, will do the same thing as the WHATWG codecs. If I have understood you correctly, I think it will. Have I missed something?
It won't do the same thing, and neither will the "chaining coders" proposal.
The "chaining coders" proposal isn't well-enough specified to be sure. However, for practical purposes you may think of a Python *codec* as a "whole array" decoder/encoder, and an *error handler* as a "token-by- token" decoder/encoder. The distinction in type is for efficiency, of course. Codecs can't be "chained" (I think, but I didn't think very hard), but handlers can, in the sense that each handler can handle some input values and delegate anything it can't deal with to the next handler in the chain (under the hood handler implementationss are just Python functions with a particular signature, so this is just "loop until non-None").
It's easy to miss details like this in all the counterproposals.
I see no reason why a 'whatwgreplace' error handler with the logic # I am assuming decoding, and single-byte encodings. Encoding # with 'html' error mode would insert format("%d;", ord(unicode)). # Multibyte is a little harder. # ASCII bytes never error except maybe in UTF16, UTF32, Shift JIS # and Big5. assert the_byte >= 0x80 # Handle C1 control characters. if the_byte < 0xA0: append_to_output(chr(the_byte)) # Handle extended repertoire with a dict. # This condition will depend on the particular codec. elif the_byte in additional_code_points: append_to_output(additional_code_points[the_byte]) # Implement WHATWG error modes. elif whatwg_error_mode is replacement: append_to_output("\uFFFD") else: raise doesn't have the effect you want. This can be done in pure Python. (Note: The actions in the pseudocode are not accurate. IIRC real handlers take a UnicodeError as argument, and return a tuple of the text to append to output and number of input tokens to skip, or return None to indicate an unhandled error, rather than doing the appending and raising themselves.) The main objection to doing it this way would be efficiency. To be honest, I personally don't think that's an important objection since this handler is frequently invoked only if the source text is badly broken. (Remember, you'll already be greatly expanding the repertoire of at least ASCII and ISO 8859/1 by promoting to windows-1252.) And it would surely be "fast enough" if written in C. Caveat: I'm not sure I agree with MAL about windows-1255. I think it's arguable that the WHAT-WG index is a better approximation to reality, and I'd like to hear Hebrew speakers argue about that (I'm not one).
The difference between WHATWG encodings and the ones in Python is, in all but one case, *only* in the C1 control character range (0x80 to 0x9F),
Also in Japanese, where "corporate characters" have been added (frequently twice, preventing round-tripping ... yuck) to the JIS standard. I haven't checked the Chinese and Korean tables for similar damage, but they're not quite as wacky about this stuff as the JISC is, so they're probably OK (and of course Big5 was "corporate" from the get-go).
a range of Unicode characters that has historically evaded standardization because they never had a clear purpose even before Unicode. Filling in all the gaps with Latin-1
That's wrong, as you explain:
[Eg, in Greek, some code points] are simply unassigned. Other software sometimes maps them to the Private Use Area, but this is not standardized at all, and it seems clear that Python should handle them with its usual error handler for unassigned bytes. (Which is one of the reasons not to replace the error handler with something different: we still need the error handler.)
The logic above handles all this. As mentioned, a stdlib error handler ('strict', 'replace', or 'xmlcharrefreplace' for WHAT-WG conformance, or 'surrogatereplace' for the Pythonic equivalent of mapping to the private area) could be chained if desired, and the defaults could be changed and the names aliased to the WHAT-WG terms. This could be automated with a factory function that takes a list of predefined handlers and composes them, although that would add another layer of inefficiency (the composition would presumably be done in a loop, and possibly using try although I think the error handler convention is to return the text to insert if handled, and None if the error can't be handled). Steve
I don't really understand what you're doing when you take a fragment of my sentence where I explain a wrong understanding of WHATWG encodings, and say "that's wrong, as you explain". I know it's wrong. That's what I was saying. You quoted the part where I said "Filling in all the gaps with Latin-1", cut out the part where I said "is wrong", and replied with "that's wrong". I guess I'm glad we're in agreement, but this has been a strange bit of discourse. In this pseudocode that implements a "whatwg_error_mode", can you describe what the Python code to call it would look like? Does every call to .encode and .decode now have a "whatwg_error_mode" parameter, in addition to the "errors" parameter? Or are there twice as many possible strings you could pass as the "errors" parameter, so you can have "replace", "replace-whatwg", "surrogateescape", "surrogateescape-whatwg", etc? My objection here isn't efficiency, it's adding confusing extra options to .encode() and .decode() that aren't relevant in most cases. I'd like to limit this proposal to single-byte encodings, addressing the discrepancies in the C1 characters and possibly that Hebrew vowel point. If there are differences in the JIS encodings, that is a can of worms I'd like to not open at the moment. -- Rob Speer On Mon, 22 Jan 2018 at 01:43 Stephen J. Turnbull < turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
I don't expect to change your mind about the "right" way to deal with this, but this is a more explicit description of what those of us who advocate error handlers are thinking about. It may be useful in writing your PEP (PEPs describe rejected counterproposals and amendments along with adopted proposals and rationale in either case).
Rob Speer writes:
The question to my mind is whether or not this "latin1replace" handler, in conjunction with existing codecs, will do the same thing as the WHATWG codecs. If I have understood you correctly, I think it will. Have I missed something?
It won't do the same thing, and neither will the "chaining coders" proposal.
The "chaining coders" proposal isn't well-enough specified to be sure.
However, for practical purposes you may think of a Python *codec* as a "whole array" decoder/encoder, and an *error handler* as a "token-by- token" decoder/encoder. The distinction in type is for efficiency, of course. Codecs can't be "chained" (I think, but I didn't think very hard), but handlers can, in the sense that each handler can handle some input values and delegate anything it can't deal with to the next handler in the chain (under the hood handler implementationss are just Python functions with a particular signature, so this is just "loop until non-None").
It's easy to miss details like this in all the counterproposals.
I see no reason why a 'whatwgreplace' error handler with the logic
# I am assuming decoding, and single-byte encodings. Encoding # with 'html' error mode would insert format("%d;", ord(unicode)). # Multibyte is a little harder.
# ASCII bytes never error except maybe in UTF16, UTF32, Shift JIS # and Big5. assert the_byte >= 0x80 # Handle C1 control characters. if the_byte < 0xA0: append_to_output(chr(the_byte)) # Handle extended repertoire with a dict. # This condition will depend on the particular codec. elif the_byte in additional_code_points: append_to_output(additional_code_points[the_byte]) # Implement WHATWG error modes. elif whatwg_error_mode is replacement: append_to_output("\uFFFD") else: raise
doesn't have the effect you want. This can be done in pure Python. (Note: The actions in the pseudocode are not accurate. IIRC real handlers take a UnicodeError as argument, and return a tuple of the text to append to output and number of input tokens to skip, or return None to indicate an unhandled error, rather than doing the appending and raising themselves.)
The main objection to doing it this way would be efficiency. To be honest, I personally don't think that's an important objection since this handler is frequently invoked only if the source text is badly broken. (Remember, you'll already be greatly expanding the repertoire of at least ASCII and ISO 8859/1 by promoting to windows-1252.) And it would surely be "fast enough" if written in C.
Caveat: I'm not sure I agree with MAL about windows-1255. I think it's arguable that the WHAT-WG index is a better approximation to reality, and I'd like to hear Hebrew speakers argue about that (I'm not one).
The difference between WHATWG encodings and the ones in Python is, in all but one case, *only* in the C1 control character range (0x80 to 0x9F),
Also in Japanese, where "corporate characters" have been added (frequently twice, preventing round-tripping ... yuck) to the JIS standard. I haven't checked the Chinese and Korean tables for similar damage, but they're not quite as wacky about this stuff as the JISC is, so they're probably OK (and of course Big5 was "corporate" from the get-go).
a range of Unicode characters that has historically evaded standardization because they never had a clear purpose even before Unicode. Filling in all the gaps with Latin-1
That's wrong, as you explain:
[Eg, in Greek, some code points] are simply unassigned. Other software sometimes maps them to the Private Use Area, but this is not standardized at all, and it seems clear that Python should handle them with its usual error handler for unassigned bytes. (Which is one of the reasons not to replace the error handler with something different: we still need the error handler.)
The logic above handles all this. As mentioned, a stdlib error handler ('strict', 'replace', or 'xmlcharrefreplace' for WHAT-WG conformance, or 'surrogatereplace' for the Pythonic equivalent of mapping to the private area) could be chained if desired, and the defaults could be changed and the names aliased to the WHAT-WG terms.
This could be automated with a factory function that takes a list of predefined handlers and composes them, although that would add another layer of inefficiency (the composition would presumably be done in a loop, and possibly using try although I think the error handler convention is to return the text to insert if handled, and None if the error can't be handled).
Steve
Sorry for the long delay. I had a lot on my plate at work, and was spending 14 hours a day sleeping because of the flu. "It got better." Rob Speer writes:
I don't really understand what you're doing when you take a fragment of my sentence where I explain a wrong understanding of WHATWG encodings, and say "that's wrong, as you explain". I know it's wrong. That's what I was saying.
Sure, but you're not my entire audience: the part I care most about is the committers. I've seen proposals to "fill in" seriously made in other contexts, I wanted to agree that's wrong for Python.
In this pseudocode that implements a "whatwg_error_mode", can you describe what the Python code to call it would look like?
There isn't any Python code that calls it. It's an error handler, like 'strict' or 'surrogateescape', and all the functions that call it are in C.
Does every call to .encode and .decode now have a "whatwg_error_mode" parameter, in addition to the "errors" parameter? Or are there twice as many possible strings you could pass as the "errors" parameter, so you can have "replace", "replace-whatwg", "surrogateescape", "surrogateescape-whatwg", etc?
It would be the latter. I haven't thought about it carefully, but what I would likely do is define a factory function taking an encoding name (str), an error handler name, and a bytes-str mapping for the exceptional cases like windows-1255 where WHAT-WG enhances the graphic repertoire, and returns a name like "whatwg-windows-1255-fatal". Internally it would 1. Check if the error handler name is 'fatal' or 'strict', or 'html' or 'xmlcharrefreplace' ('strict' and 'xmlcharrefreplace' would be used internally to the factory function, the registered name would be 'fatal' or 'html'). 'replace' has the same semantics in Python and in WHAT-WG, and other error handlers 'backslashreplace', 'ignore', and 'surrogateescape' would be up to the programmer to use or avoid. They'd go by their Python names. Alternatively we could follow the strict WHAT-WG standard and not allow those, or provide another argument to allow "lax" checking of the handler argument. 2. Check if the name is already registered. If so, return it. 3. Otherwise, def a function that takes an Unicode error and a mapping that defaults to the one passed to the factory, and a. passes C0 and C1 control characters through, else b. returns the mapped value if present, else c. passes the Unicode error to the named error handler and returns what that returns 4. Register the new handler with that name, and return the name. You would use it like handler = factory('windows-1255', 'html', [(b'0x00', '\Udeadbeef')]) b'deadbeef'.decode('windows-1255', errors=handler) The mapping would default to [], and the remaining question would be what the default for the error handler should be. I guess that would 'strict' (the problem is that the WHAT-WG defaults differ for decoding and encoding). (The choice of a list of tuples for the mapping is due to JIS, where the map is not 1-1, and a specific reverse mapping is defined.)
My objection here isn't efficiency, it's adding confusing extra options to .encode() and .decode() that aren't relevant in most cases.
There wouldn't be extra *arguments*, but there would be additional handler names to use as values. We'd want three standard handlers for everything but windows-1255 and JIS (AFAIK). One would be mainly for validating XML, and the name would be 'whatwg-any-fatal'. (Note that the name of the encoding is actually only used in the name of the handler, and that only to identify auxiliary mappings, such as that for windows-1255.) The others would be for everyday HTML (and maybe for XHTML form input?). They would be named 'whatwg-any-replace' and 'whatwg-any-html'. I'm not sure whether to have a separate suite for windows-1255, or let the programmer take care of that. Also, since 'replace' is a pretty simplistic handler, I suspect a lot of programmers would like to use surrogateescape, but since WHAT-WG explicitly restricts error modes to fatal, replace, and html, that's on the programmer to define, at least until it's clear there's overwhelming demand for it.
I'd like to limit this proposal to single-byte encodings, addressing the discrepancies in the C1 characters and possibly that Hebrew vowel point.
I wonder what Microsoft's representatives to Unicode and WHAT-WG would say about that. I think it should definitely be handled somehow. I find adding it to the stdlib 1255 codec attractive, and I think the chance that Microsoft would sign off on that is nonzero. If they didn't, it would go into 1255-specific handlers.
If there are differences in the JIS encodings, that is a can of worms I'd like to not open at the moment.
Addressed by the factory function, which is needed anyway as discussed above. Footnotes: [1] I had this wrong. It's not the number of tokens to skip, it's the position to restart reading the input. [2] The actual handlers are all in C, and return 0 if they don't know what to do. I haven't had time to figure out what actually happens here (None is an actual object and I'm sure it doesn't live at 0x0). I'm guessing that a pure Python handler would return None, but perhaps it should reraise. That doesn't affect the ability to construct a chaining handler, only what such a handler would do if it "knows" the input is *bad* and decides to stop rather than delegate.
On Sun, Jan 21, 2018 at 2:43 AM, Steven D'Aprano <steve@pearwood.info> wrote:
Guido had some very sensible feedback just a moment ago. I am wondering now if we lost Guido because I broke python-ideas etiquette (is a pull request not the next step, for example? I never got a good answer on the
On Fri, Jan 19, 2018 at 06:35:30PM +0000, Rob Speer wrote: process),
or because this thread is just constantly being derailed.
I don't speak for Guido, but it might simply be he isn't invested enough in *this specific issue* to spend the time wading through a long thread. (That's another reason why a PEP is sometimes valuable.) Perhaps he's still on holiday and only has limited time to spend on this.
Actually my reason to withdraw is that the sides seem to be about as well dug in as the sides during WW1. There's not much I can do in such case (except point out that the status quo wins). -- --Guido van Rossum (python.org/~guido)
On Fri, Jan 19, 2018, at 08:30, M.-A. Lemburg wrote:
Someone did discover that Microsoft's current implementations of the windows-* encodings matches the WHAT-WG spec, rather than the Unicode spec that Microsoft originally wrote.
No, MS implements somethings called "best fit encodings" and these are different than what WHATWG uses.
NO. I made this absolutely clear in my previous message, best fit mappings can be clearly distinguished from regular mappings by the behavior of the native conversion functions with certain argument flags (the mapping of 0xA0 to some private use character in cp932, for example, is a best-fit mapping in the decoding direction - but is treated as a regular mapping for encoding purposes), and the mapping of 0x81 to U+0081 in cp1252 etc is NOT a best fit mapping or in any way different from the rest of the mappings. We are not talking about implementing the best fit mappings. We are talking about real regular mappings that actually exist in these codepages that were for some unknown reason not included in the files published by Unicode.
https://msdn.microsoft.com/en-us/library/windows/desktop/dd374130%28v=vs.85%...
unfortunately uses the above mentioned best fit encodings, but this can and should be switched off by specifying the WC_NO_BEST_FIT_CHARS for anything that requires validation or needs to be interoperable:
Specifying this flag (and MB_ERR_INVALID_CHARS in the other direction) in fact does not disable the mappings we are discussing.
On 19.01.2018 17:24, Random832 wrote:
On Fri, Jan 19, 2018, at 08:30, M.-A. Lemburg wrote:
Someone did discover that Microsoft's current implementations of the windows-* encodings matches the WHAT-WG spec, rather than the Unicode spec that Microsoft originally wrote.
No, MS implements somethings called "best fit encodings" and these are different than what WHATWG uses.
NO. I made this absolutely clear in my previous message, best fit mappings can be clearly distinguished from regular mappings by the behavior of the native conversion functions with certain argument flags (the mapping of 0xA0 to some private use character in cp932, for example, is a best-fit mapping in the decoding direction - but is treated as a regular mapping for encoding purposes), and the mapping of 0x81 to U+0081 in cp1252 etc is NOT a best fit mapping or in any way different from the rest of the mappings.
We are not talking about implementing the best fit mappings. We are talking about real regular mappings that actually exist in these codepages that were for some unknown reason not included in the files published by Unicode.
I only know the best fit encoding maps that are available on the Unicode site. If I read your comment correctly, you are saying that MS has moved away from the standard code pages towards something else - perhaps even something other than the best fit encodings listed on the Unicode site ? Do you have some references for this ? Note that the Windows code page codecs implemented in Python are all based on the Unicode mapping files and those were created by MS.
https://msdn.microsoft.com/en-us/library/windows/desktop/dd374130%28v=vs.85%...
unfortunately uses the above mentioned best fit encodings, but this can and should be switched off by specifying the WC_NO_BEST_FIT_CHARS for anything that requires validation or needs to be interoperable:
Specifying this flag (and MB_ERR_INVALID_CHARS in the other direction) in fact does not disable the mappings we are discussing.
Interesting. The CP1252 mapping clearly defines 0x80 to map to undefined, whereas the bestfit1252 maps it to 0x0081: http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT http://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit... Same for the example you gave for CP932: http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT http://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit... So at least following the documentation you'd expect the function to implement the regular mappings. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jan 19 2018)
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/
Hi Steve, do you know of a definite resource for Windows code pages on MSDN or another official MS website ? I tried to find some links, but only got these ancient ones: https://msdn.microsoft.com/en-us/library/cc195054.aspx (this version of cp1252 doesn't even have the euro sign yet) Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jan 19 2018)
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/ On 19.01.2018 18:17, M.-A. Lemburg wrote:
On 19.01.2018 17:24, Random832 wrote:
On Fri, Jan 19, 2018, at 08:30, M.-A. Lemburg wrote:
Someone did discover that Microsoft's current implementations of the windows-* encodings matches the WHAT-WG spec, rather than the Unicode spec that Microsoft originally wrote.
No, MS implements somethings called "best fit encodings" and these are different than what WHATWG uses.
NO. I made this absolutely clear in my previous message, best fit mappings can be clearly distinguished from regular mappings by the behavior of the native conversion functions with certain argument flags (the mapping of 0xA0 to some private use character in cp932, for example, is a best-fit mapping in the decoding direction - but is treated as a regular mapping for encoding purposes), and the mapping of 0x81 to U+0081 in cp1252 etc is NOT a best fit mapping or in any way different from the rest of the mappings.
We are not talking about implementing the best fit mappings. We are talking about real regular mappings that actually exist in these codepages that were for some unknown reason not included in the files published by Unicode.
I only know the best fit encoding maps that are available on the Unicode site.
If I read your comment correctly, you are saying that MS has moved away from the standard code pages towards something else - perhaps even something other than the best fit encodings listed on the Unicode site ?
Do you have some references for this ?
Note that the Windows code page codecs implemented in Python are all based on the Unicode mapping files and those were created by MS.
https://msdn.microsoft.com/en-us/library/windows/desktop/dd374130%28v=vs.85%...
unfortunately uses the above mentioned best fit encodings, but this can and should be switched off by specifying the WC_NO_BEST_FIT_CHARS for anything that requires validation or needs to be interoperable:
Specifying this flag (and MB_ERR_INVALID_CHARS in the other direction) in fact does not disable the mappings we are discussing.
Interesting. The CP1252 mapping clearly defines 0x80 to map to undefined, whereas the bestfit1252 maps it to 0x0081:
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT http://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit...
Same for the example you gave for CP932:
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT http://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit...
So at least following the documentation you'd expect the function to implement the regular mappings.
On 20Jan2018 0518, M.-A. Lemburg wrote:
do you know of a definite resource for Windows code pages on MSDN or another official MS website ?
I don't know of anything sorry, and my quick search didn't turn up anything public. But I can at least confirm that the internal table for cp1252 has the same undefined characters as on unicode.org, so presumably if MultiByteToWideChar is mapping those to "best fit" characters it's only because the flag has been passed. As far as I can tell, Microsoft has not been secretly redefining any encodings. Cheers, Steve
On Sat, Jan 20, 2018, at 02:01, Steve Dower wrote:
On 20Jan2018 0518, M.-A. Lemburg wrote:
do you know of a definite resource for Windows code pages on MSDN or another official MS website ?
I don't know what happened to this page, but I was able to find better-looking codepage tables at http://web.archive.org/web/20160314211032/https://msdn.microsoft.com/en-us/g... Older versions at: web.archive.org/web/*/http://www.microsoft.com:80/globaldev/reference/WinCP.asp web.archive.org/web/*/http://www.microsoft.com:80/globaldev/reference/WinCP.mspx See also, still live: https://www.microsoft.com/typography/unicode/cscp.htm (this has 0xCA in the graphical table for cp1255, the other does not)
I don't know of anything sorry, and my quick search didn't turn up anything public. But I can at least confirm that the internal table for cp1252 has the same undefined characters as on unicode.org , so presumably if MultiByteToWideChar is mapping those to "best fit" characters it's only because the flag has been passed.
I'm passing MB_ERR_INVALID_CHARS. And is this just as true for cp1255 0xCA as for the control characters? MultiByteToWideChar doesn't even *have* a flag for "best fit". I was not able to identify any combination of flags that can be passed to either function on Windows 7 that would cause e.g. 0x81 in cp1252 to be treated any differently from any other character. The C_1252.NLS file appears to consist of: 28 bytes of header 512 bytes WCHAR[256] of mappings e.g. 0000010c: 7800 7900 7a00 7b00 7c00 7d00 7e00 7f00 x.y.z.{.|.}.~... 0000011c: ac20 8100 1a20 9201 1e20 2620 2020 2120 . ... ... & ! 0000012c: c602 3020 6001 3920 5201 8d00 7d01 8f00 ..0 `.9 R...}... 0000013c: 9000 1820 1920 1c20 1d20 2220 1320 1420 ... . . . " . . 0000014c: dc02 2221 6101 3a20 5301 9d00 7e01 7801 .."!a.: S...~.x. 0000015c: a000 a100 a200 a300 a400 a500 a600 a700 ................ Six zero bytes BYTE[65536] apparently of the best fit mappings, e.g. 000002a2: 3f81 3f3f 3f3f 3f3f 3f3f 3f3f 3f8d 3f8f ?.???????????.?. 000002b2: 903f 3f3f 3f3f 3f3f 3f3f 3f3f 3f9d 3f3f .????????????.?? 00000312: f0f1 f2f3 f4f5 f6f7 f8f9 fafb fcfd feff ................ 00000322: 4161 4161 4161 4363 4363 4363 4363 4464 AaAaAaCcCcCcCcDd I don't see where the file format even has room to identify characters as invalid (or how WideCharToMultiByte disables the best fit mappings, unless it's by checking the result against the WCHAR[256] table), though CP1253 and CP1255 seem to manage it. The ones in those codepages that do return an error are mapped (if the flag is not passed in, and in the NLS file tables) to private use characters U+F8xx.
As far as I can tell, Microsoft has not been secretly redefining any encodings.
Not so much redefining as holding back these characters from the published definition. I was being a bit overly dramatic with the 'for some unknown reason' bit, it seems obvious the reason is they wanted to reserve the ability to add new characters in the future, as they did for the Euro sign. And there's nothing wrong with that, per se, though it's unfortunate that their own conversion functions can't treat these bytes as errors. Looking at the actual files, it looks like the ones in the "best fit" directory are in a format used internally by Microsoft (at a glance, they seem to contain enough information to generate the .NLS files, including stuff like the question marks in the header and the structure of DBCS tables), and the ones in the other mappings directory are sanitized and converted to more or less the same format as the other mappings. (As for 1255 0xCA, the comment in the best fit file suggests that it was unclear what hebrew vowel point it was meant to be)
On 20.01.2018 08:01, Steve Dower wrote:
On 20Jan2018 0518, M.-A. Lemburg wrote:
do you know of a definite resource for Windows code pages on MSDN or another official MS website ?
I don't know of anything sorry, and my quick search didn't turn up anything public. But I can at least confirm that the internal table for cp1252 has the same undefined characters as on unicode.org, so presumably if MultiByteToWideChar is mapping those to "best fit" characters it's only because the flag has been passed. As far as I can tell, Microsoft has not been secretly redefining any encodings.
Thanks for confirming, Steve. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jan 21 2018)
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/
19.01.18 05:51, Guido van Rossum пише:
Can someone explain to me why this is such a controversial issue?
It seems reasonable to me to add new encodings to the stdlib that do the roundtripping requested in the first message of the thread. As long as they have new names that seems to fall under "practicality beats purity". (Modifying existing encodings seems wrong -- did the feature request somehow transmogrify into that?)
In any case you need to change your code. If add new error handler -- you need to change the decoding code to use this error handler: text = data.decode(encoding, 'whatwgreplace') If add new encodings -- you need to support an alias table that maps standard encoding names to corresponding names of WHATWG encoding: aliases = {'windows_1252': 'windows-1252-whatwg', 'windows_1251': 'windows-1251-whatwg', 'utf_8': 'utf-8-whatwg', # utf-8 + surrogatepass ... } ... text = data.decode(aliases.get(normalize_encoding(encoding), encoding)) I don't see an advantage of the second approach for the end user. And of course it is more costly for maintainers, because we will need to implement around 20 new encodings, and adds a cognitive burden for new Python users, which now have more tables of encodings in the documentation.
On Wed, Jan 31, 2018 at 3:03 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:
19.01.18 05:51, Guido van Rossum пише:
Can someone explain to me why this is such a controversial issue?
It seems reasonable to me to add new encodings to the stdlib that do the roundtripping requested in the first message of the thread. As long as they have new names that seems to fall under "practicality beats purity". (Modifying existing encodings seems wrong -- did the feature request somehow transmogrify into that?)
In any case you need to change your code. If add new error handler -- you need to change the decoding code to use this error handler:
text = data.decode(encoding, 'whatwgreplace')
If add new encodings -- you need to support an alias table that maps standard encoding names to corresponding names of WHATWG encoding:
aliases = {'windows_1252': 'windows-1252-whatwg', 'windows_1251': 'windows-1251-whatwg', 'utf_8': 'utf-8-whatwg', # utf-8 + surrogatepass ... } ... text = data.decode(aliases.get(normalize_encoding(encoding), encoding))
I don't see an advantage of the second approach for the end user. And of course it is more costly for maintainers, because we will need to implement around 20 new encodings, and adds a cognitive burden for new Python users, which now have more tables of encodings in the documentation.
Hm. As a user, unless I run into problems with a specific encoding, I never care about how many encodings we have, so I don't see how adding extra encodings bothers those users who have no need for them. There's a reason to prefer new encoding names (maybe augmented with alias table) over a new error handler: there are lots of places where encodings are passed around via text files, Internet protocols, RPC calls, layers and layers of function calls. Many of these treat the encoding as a string, not as a (string, errorhandler) pair. So there may be situations where there is no way in a given API to preserve the need for using a special error handler, while the API would not have a problem preserving just the encoding name. -- --Guido van Rossum (python.org/~guido)
On 31.01.2018 17:36, Guido van Rossum wrote:
On Wed, Jan 31, 2018 at 3:03 AM, Serhiy Storchaka <storchaka@gmail.com <mailto:storchaka@gmail.com>> wrote:
19.01.18 05:51, Guido van Rossum пише:
Can someone explain to me why this is such a controversial issue?
It seems reasonable to me to add new encodings to the stdlib that do the roundtripping requested in the first message of the thread. As long as they have new names that seems to fall under "practicality beats purity". (Modifying existing encodings seems wrong -- did the feature request somehow transmogrify into that?)
In any case you need to change your code. If add new error handler -- you need to change the decoding code to use this error handler:
text = data.decode(encoding, 'whatwgreplace')
If add new encodings -- you need to support an alias table that maps standard encoding names to corresponding names of WHATWG encoding:
aliases = {'windows_1252': 'windows-1252-whatwg', 'windows_1251': 'windows-1251-whatwg', 'utf_8': 'utf-8-whatwg', # utf-8 + surrogatepass ... } ... text = data.decode(aliases.get(normalize_encoding(encoding), encoding))
I don't see an advantage of the second approach for the end user. And of course it is more costly for maintainers, because we will need to implement around 20 new encodings, and adds a cognitive burden for new Python users, which now have more tables of encodings in the documentation.
Hm. As a user, unless I run into problems with a specific encoding, I never care about how many encodings we have, so I don't see how adding extra encodings bothers those users who have no need for them.
There's a reason to prefer new encoding names (maybe augmented with alias table) over a new error handler: there are lots of places where encodings are passed around via text files, Internet protocols, RPC calls, layers and layers of function calls. Many of these treat the encoding as a string, not as a (string, errorhandler) pair. So there may be situations where there is no way in a given API to preserve the need for using a special error handler, while the API would not have a problem preserving just the encoding name.
I already mentioned several reasons why I don't believe it's a good idea to add these encodings to the stdlib as opposed to keeping them on PyPI for those who need them, so won't repeat. One detail I did not mention is that these encodings do not have standard names. WHATWG uses the same names as the original encodings from which they derive - which makes sense for their intended purpose to interpret data coming from web servers, essentially in a decoding only way, but cannot be used for Python since our encodings follow the Unicode standard and don't generate mojibake when encoding. Whatever name would be used in the stdlib would neither be compatible to WHATWG nor to IANA. No other tool outside Python would be able to interpret the encoded data using those names. Given all those issues, I don't see what the benefit would be to add these encodings to the stdlib over leaving them on PyPI for the special use case of reading broken web server data. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jan 31 2018)
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/
31.01.18 18:36, Guido van Rossum пише:
On Wed, Jan 31, 2018 at 3:03 AM, Serhiy Storchaka <storchaka@gmail.com <mailto:storchaka@gmail.com>> wrote:
19.01.18 05:51, Guido van Rossum пише:
Can someone explain to me why this is such a controversial issue?
It seems reasonable to me to add new encodings to the stdlib that do the roundtripping requested in the first message of the thread. As long as they have new names that seems to fall under "practicality beats purity". (Modifying existing encodings seems wrong -- did the feature request somehow transmogrify into that?)
In any case you need to change your code. If add new error handler -- you need to change the decoding code to use this error handler:
text = data.decode(encoding, 'whatwgreplace')
If add new encodings -- you need to support an alias table that maps standard encoding names to corresponding names of WHATWG encoding:
aliases = {'windows_1252': 'windows-1252-whatwg', 'windows_1251': 'windows-1251-whatwg', 'utf_8': 'utf-8-whatwg', # utf-8 + surrogatepass ... } ... text = data.decode(aliases.get(normalize_encoding(encoding), encoding))
I don't see an advantage of the second approach for the end user. And of course it is more costly for maintainers, because we will need to implement around 20 new encodings, and adds a cognitive burden for new Python users, which now have more tables of encodings in the documentation.
Hm. As a user, unless I run into problems with a specific encoding, I never care about how many encodings we have, so I don't see how adding extra encodings bothers those users who have no need for them.
The codecs module documentation contains several tables of encodings: standard encodings, Python-specific text encodings, binary transforms and text transforms (a single one). This will add yet one large table. The user that learn Python will need to learn the difference of these encodings from others encodings and how to use them correctly. The new user doesn't know what is important for he, and what he can ignore until he will need it (and how to know that he needs it).
There's a reason to prefer new encoding names (maybe augmented with alias table) over a new error handler: there are lots of places where encodings are passed around via text files, Internet protocols, RPC calls, layers and layers of function calls. Many of these treat the encoding as a string, not as a (string, errorhandler) pair. So there may be situations where there is no way in a given API to preserve the need for using a special error handler, while the API would not have a problem preserving just the encoding name.
The passed encoding differs from the name of new Python encoding. It is just 'windows-1252', not 'windows-1252-whatwg'. If just change the existing encoding, this can break other code that expects the standard 'windows-1252'. Thus every time when you need 'windows-1252-whatwg' instead of 'windows-1252' passed with the text, you need to map encoding names. How this differs from using a special error handler? Yet one problem, is that actually we need two error handlers. WHATWG specifies two behaviors for unmapped codes outside of C0-C1 range: replacing with a special character or error. This corresponds standard Python handlers 'replace' and 'strict'. Thus we need either add two new error handlers 'whatwgreplace' and 'whatwgstrict', or add *two* sets of new encodings (more than 70 encodings totally!).
OK, I am no longer interested in this topic. If you can't reach agreement, so be it, and then the status quo prevails. I am going to mute this thread. There's no need to explain to me why I am wrong. On Wed, Jan 31, 2018 at 9:48 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:
31.01.18 18:36, Guido van Rossum пише:
On Wed, Jan 31, 2018 at 3:03 AM, Serhiy Storchaka <storchaka@gmail.com
<mailto:storchaka@gmail.com>> wrote:
19.01.18 05:51, Guido van Rossum пише:
Can someone explain to me why this is such a controversial issue?
It seems reasonable to me to add new encodings to the stdlib that do the roundtripping requested in the first message of the thread. As long as they have new names that seems to fall under "practicality beats purity". (Modifying existing encodings seems wrong -- did the feature request somehow transmogrify into that?)
In any case you need to change your code. If add new error handler -- you need to change the decoding code to use this error handler:
text = data.decode(encoding, 'whatwgreplace')
If add new encodings -- you need to support an alias table that maps standard encoding names to corresponding names of WHATWG encoding:
aliases = {'windows_1252': 'windows-1252-whatwg', 'windows_1251': 'windows-1251-whatwg', 'utf_8': 'utf-8-whatwg', # utf-8 + surrogatepass ... } ... text = data.decode(aliases.get(normalize_encoding(encoding), encoding))
I don't see an advantage of the second approach for the end user. And of course it is more costly for maintainers, because we will need to implement around 20 new encodings, and adds a cognitive burden for new Python users, which now have more tables of encodings in the documentation.
Hm. As a user, unless I run into problems with a specific encoding, I never care about how many encodings we have, so I don't see how adding extra encodings bothers those users who have no need for them.
The codecs module documentation contains several tables of encodings: standard encodings, Python-specific text encodings, binary transforms and text transforms (a single one). This will add yet one large table. The user that learn Python will need to learn the difference of these encodings from others encodings and how to use them correctly. The new user doesn't know what is important for he, and what he can ignore until he will need it (and how to know that he needs it).
There's a reason to prefer new encoding names (maybe augmented with alias
table) over a new error handler: there are lots of places where encodings are passed around via text files, Internet protocols, RPC calls, layers and layers of function calls. Many of these treat the encoding as a string, not as a (string, errorhandler) pair. So there may be situations where there is no way in a given API to preserve the need for using a special error handler, while the API would not have a problem preserving just the encoding name.
The passed encoding differs from the name of new Python encoding. It is just 'windows-1252', not 'windows-1252-whatwg'. If just change the existing encoding, this can break other code that expects the standard 'windows-1252'. Thus every time when you need 'windows-1252-whatwg' instead of 'windows-1252' passed with the text, you need to map encoding names. How this differs from using a special error handler?
Yet one problem, is that actually we need two error handlers. WHATWG specifies two behaviors for unmapped codes outside of C0-C1 range: replacing with a special character or error. This corresponds standard Python handlers 'replace' and 'strict'. Thus we need either add two new error handlers 'whatwgreplace' and 'whatwgstrict', or add *two* sets of new encodings (more than 70 encodings totally!).
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido van Rossum (python.org/~guido)
On Wed, 31 Jan 2018 at 12:50 Serhiy Storchaka <storchaka@gmail.com> wrote:
The passed encoding differs from the name of new Python encoding. It is just 'windows-1252', not 'windows-1252-whatwg'. If just change the existing encoding, this can break other code that expects the standard 'windows-1252'. Thus every time when you need 'windows-1252-whatwg' instead of 'windows-1252' passed with the text, you need to map encoding names. How this differs from using a special error handler?
How is that the *same* as using a special error handler? This is not at all what error handlers are for. Mapping Python encoding names to the WHATWG standard (which, incidentally, is now also the W3C standard) is currently addressed by the "webencodings" package. That package currently doesn't return the correct encodings (because they don't exist), but it does at least return windows-1252 when a Web page says it's in "iso-8859-1", because that's what the Web standard says to do. … Yet one problem, is that actually we need two error handlers. WHATWG
specifies two behaviors for unmapped codes outside of C0-C1 range: replacing with a special character or error. This corresponds standard Python handlers 'replace' and 'strict'. Thus we need either add two new error handlers 'whatwgreplace' and 'whatwgstrict', or add *two* sets of new encodings (more than 70 encodings totally!).
What?! This is going way off the rails. There are 8 new encodings. Not 70. Those 8 encodings would use the error handlers that already exist in Python. Why are you even talking about the C0 range? The C0 range is in ASCII. The ridiculous complexity of some of these counter-proposals has largely come from trying to use an error handler to do an encoding's job; now you're proposing to also use more encodings to do the error handler's job. I don't think it's a serious proposal, it's just so you could say "now you need 70 encodings lol". Maybe you just like to torpedo things? The "whatwg error handler" thing will not happen. It is a terrible design, a misunderstanding of what error handlers are for, and it attempts to be an overly-general solution to a _problem that does not generalize_. Even if this task could be sensibly implemented with error handlers, there are no other instances where these error handlers would ever be useful.
On Wed, Jan 31, 2018 at 9:48 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:
Hm. As a user, unless I run into problems with a specific encoding, I
never care about how many encodings we have, so I don't see how adding extra encodings bothers those users who have no need for them.
The codecs module documentation contains several tables of encodings: standard encodings, Python-specific text encodings, binary transforms and text transforms (a single one). This will add yet one large table. The user that learn Python will need to learn the difference of these encodings from others encodings and how to use them correctly. The new user doesn't know what is important for he, and what he can ignore until he will need it (and how to know that he needs it).
no new user to Python is going ot study the entire set of built-in encoding in Python to decide what is useful to them -- no one! New (and experienced) users take the opposite approach -- they need an encoding for one reason or another (they are provided data to a service that requires a particular encoding, or they are reading data in another particular encoding). They then look at the built-in encoding to see if the one they want is there. A slightly larger set to look through is a very small burden, particularly if it's properly documented and has all the common synonyms. I still have no ide4a why there is such resistance to this -- yes, it's a fairly small benefit over a package no PyPi, but there is also virtually no downside. (I'm assuming the OP (or someone) will do all the actual work of coding and updating docs....) Practicality Beats Purity -- and this is a practical solution. sigh. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Thu, Feb 1, 2018 at 10:15 AM, Chris Barker <chris.barker@noaa.gov> wrote:
I still have no ide4a why there is such resistance to this -- yes, it's a fairly small benefit over a package no PyPi, but there is also virtually no downside.
I don't understand it either. Aside from maybe bikeshedding the *name* of the encoding, this seems like a pretty straight-forward addition. ChrisA
On 01.02.2018 00:40, Chris Angelico wrote:
On Thu, Feb 1, 2018 at 10:15 AM, Chris Barker <chris.barker@noaa.gov> wrote:
I still have no ide4a why there is such resistance to this -- yes, it's a fairly small benefit over a package no PyPi, but there is also virtually no downside.
I don't understand it either. Aside from maybe bikeshedding the *name* of the encoding, this seems like a pretty straight-forward addition.
I guess many of you are not aware of how we have treated such encoding additions in the past 1.5 decades. In general, we have only added new encodings when there was an encoding missing which a lot of people were actively using. We asked for official documentation defining the mappings, references showing usage and IANA or similar standard names to use for the encoding itself and its aliases. In recent years, we had only very few such requests, mainly because the set we have in Python is already fairly complete. Now the OP comes proposing to add a whole set of encodings which only differ slightly from our existing ones. Backing is their use and definition by WHATWG, a consortium of browser vendors who are interested in showing web pages to users in a consistent way. WHATWG decided to simply override the standard names for encodings with new mappings under their control. Again, their motivation is clear: browsers get documents with advertised encoding which don't always match the standard ones, so they have to make some choices on how to display those documents. The easiest way for them is to define all special cases in a set of new mappings for each standard encoding name. This is all fine, but it's also a very limited use case: that of wanting to display web pages in a browser. It's certainly needed for applications implementing browser interfaces and probably also for ones which do web scraping, but otherwise, the need should rarely arise. What WHATWG uses as workarounds may also not necessarily be what actual users would like to have. Such workarounds are always trade-offs and they can change over time - which WHATWG addresses by making the encodings "living standards". They are a solution, but not a one fits all way of dealing with broken data. We also have the naming issue, since WHATWG chose to use the same names as the standard mappings. Anything we'd define will neither match WHATWG nor any other encoding standard name, so we'd be creating a new set of encoding names - which is really not what the world is after, including WHATWG itself. People would start creating encoded text using these new encoding names, resulting in even more mojibake out there instead of fixing the errors in the data and using Unicode or UTF-8 for interchange. As I mentioned before, we could disable encoding in the new mappings to resolve this concern, but the OP wasn't interested in such an approach. As alternative approach we proposed error handlers, which are the normal technology to use when dealing with encoding errors. Again, the OP wasn't interested. Please also note that once we start adding, say "whatwg-<original name>" encodings (or rather decodings :-), going for the simple charmap encodings first, someone will eventually also request addition of the more complex Asian encodings which WHATWG defines. Maintaining these is hard, since they require writing C code for performance reasons and to keep the mapping tables small. I probably forgot a few aspects, but the above is how I would summarize the discussion from the perspective of the people who have dealt with such discussions in the past. There are quite a few downsides to consider and since the OP is not interested in going for a compromise as described above, I don't see a way forward. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Feb 01 2018)
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/
On Thu, Feb 01, 2018 at 10:20:00AM +0100, M.-A. Lemburg wrote:
In general, we have only added new encodings when there was an encoding missing which a lot of people were actively using. We asked for official documentation defining the mappings, references showing usage and IANA or similar standard names to use for the encoding itself and its aliases. [...] Now the OP comes proposing to add a whole set of encodings which only differ slightly from our existing ones. Backing is their use and definition by WHATWG, a consortium of browser vendors who are interested in showing web pages to users in a consistent way.
That gives us a defined mapping, references showing usage, but (alas) not standard names, due to the WHATWG's (foolish and arrogantly obnoxious, in my opinion) decision to re-use the standard names for the non-standard usages. Two out of three seems like a reasonable start to me. But one thing we haven't really discussed is, why is this an issue for Python? Everything I've seen so far suggests that these standards are only for browsers and/or web scrapers. That seems fairly niche to me. If you're writing a browser in Python, surely it isn't too much to ask that you import a set of codecs from a third party library? If I've missed something, please say so.
We also have the naming issue, since WHATWG chose to use the same names as the standard mappings. Anything we'd define will neither match WHATWG nor any other encoding standard name, so we'd be creating a new set of encoding names - which is really not what the world is after, including WHATWG itself.
I hear you, but I think this is a comparatively minor objection. I don't think it is a major problem for usability if we were to call these encodings "spam-whatwg" instead of "spam". It isn't difficult for browser authors to write: encoding = get_document_encoding() if config.USE_WHATWG_ENCODINGS: encoding += '-whatwg' or otherwise look the encodings up in a mapping. We could even provide that mapping in the codecs module: encoding = codecs.whatwg_mapping.get(encoding, encoding) So the naming issue shouldn't be more than a minor nuisance, and one we can entirely place in the lap of the WHATWG for misusing standard names. Documentation-wise, I'd argue for placing these in a seperate sub-section of the codecs docs, with a strong notice that they should only be used for decoding web documents and not for creating new documents (except for testing purposes).
People would start creating encoded text using these new encoding names, resulting in even more mojibake out there instead of fixing the errors in the data and using Unicode or UTF-8 for interchange.
We can't stop people from doing that: so long as the encodings exist as a third-party package, people who really insist on creating such abominable documents can do so. Just as they currently can accidentally create mojibake in their own documents by misunderstanding encodings, or as they can create new documents using legacy encodings like MacRoman instead of UTF-8 like they should. (And very occasionally, they might even have a good reason for doing so -- while we can and should *discourage* such uses, we cannot and should not expect to prohibit them.) If it were my decision, I'd have these codecs raise a warning (not an error) when used for encoding. But I guess some people will consider that either going too far or not far enough :-)
As I mentioned before, we could disable encoding in the new mappings to resolve this concern, but the OP wasn't interested in such an approach. As alternative approach we proposed error handlers, which are the normal technology to use when dealing with encoding errors. Again, the OP wasn't interested.
Be fair: it isn't that the OP (Rob Speer) merely isn't interested, he does make some reasonable arguments that error handlers are the wrong solution. He's convinced me that an error handler isn't the right way to do this. He *hasn't* convinced me that the stdlib needs to solve this problem, but if it does, I think some new encodings are the right way to do it.
Please also note that once we start adding, say "whatwg-<original name>" encodings (or rather decodings :-), going for the simple charmap encodings first, someone will eventually also request addition of the more complex Asian encodings which WHATWG defines. Maintaining these is hard, since they require writing C code for performance reasons and to keep the mapping tables small.
YAGNI -- we can deal with that when and if it gets requested. This is not the camel's nose: adding a handful of 8-bit WHATWG encodings does not oblige us to add more. [...]
There are quite a few downsides to consider
Indeed -- this isn't a "no-brainer". That's why I'm still hoping to see a fair and balanced PEP.
and since the OP is not interested in going for a compromise as described above, I don't see a way forward.
Status quo wins a stalemate. Sometimes that's better than a broken solution that won't satisfy anyone. -- Steve
On 2 February 2018 at 16:52, Steven D'Aprano <steve@pearwood.info> wrote:
If it were my decision, I'd have these codecs raise a warning (not an error) when used for encoding. But I guess some people will consider that either going too far or not far enough :-)
Rob pointed out that one of the main use cases for these codecs is when going "Oh, this was decoded with a WHATWG encoding, which isn't right, so I need to re-encode it with that encoding, and then decode it with the right encoding". So encoding is very much part of the usage model: it's needed when you've received the data over a Unicode based interface rather than a binary one. So I think the *use case* for the WHATWG encodings has been pretty well established. What hasn't been established is whether our answer to "How do I handle the WHATWG encodings?" is going to be: * "Here they are in the standard library (for 3.8+)!"; or * "These are available as part of the 'ftfy' library on PyPI, which also helps fixes various other problems in decoded text" Personally, I think a See Also note pointing to ftfy in the "codecs" module documentation would be quite a reasonable outcome of the thread - when it comes to consuming arbitrary data from the internet and cleaning up decoding issues, ftfy's data introspection based approach is likely to be far easier to start with than characterising the common errors for specific data sources and applying them individually, and if you're already using ftfy to figure out which fixes are needed, then it shouldn't be a big deal to keep it around for the more relaxed codecs that it provides. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
05.02.18 05:01, Nick Coghlan пише:
On 2 February 2018 at 16:52, Steven D'Aprano <steve@pearwood.info> wrote:
If it were my decision, I'd have these codecs raise a warning (not an error) when used for encoding. But I guess some people will consider that either going too far or not far enough :-)
Rob pointed out that one of the main use cases for these codecs is when going "Oh, this was decoded with a WHATWG encoding, which isn't right, so I need to re-encode it with that encoding, and then decode it with the right encoding". So encoding is very much part of the usage model: it's needed when you've received the data over a Unicode based interface rather than a binary one.
Wasn't the "surrogateescape" error handler designed for this purpose? WHATWG encodings solve the same problem that "surrogateescape", but 1) They use different range for representing unmapped characters. 2) Not all unmapped characters can be decoded, thus a decoding is lossy, and a round-trip not always works.
On 5 February 2018 at 06:40, Serhiy Storchaka <storchaka@gmail.com> wrote:
05.02.18 05:01, Nick Coghlan пише:
On 2 February 2018 at 16:52, Steven D'Aprano <steve@pearwood.info> wrote:
If it were my decision, I'd have these codecs raise a warning (not an error) when used for encoding. But I guess some people will consider that either going too far or not far enough :-)
Rob pointed out that one of the main use cases for these codecs is when going "Oh, this was decoded with a WHATWG encoding, which isn't right, so I need to re-encode it with that encoding, and then decode it with the right encoding". So encoding is very much part of the usage model: it's needed when you've received the data over a Unicode based interface rather than a binary one.
Wasn't the "surrogateescape" error handler designed for this purpose?
WHATWG encodings solve the same problem that "surrogateescape", but
1) They use different range for representing unmapped characters. 2) Not all unmapped characters can be decoded, thus a decoding is lossy, and a round-trip not always works.
Surrogateescape is for when the source of the Unicode data is also Python. The WHATWG encodings (AIUI) can be used by any tool to attempt to decode data. If that "I think this is what it is" data is passed as Unicode to Python, and the Python code determines that the guess was wrong, then re-encoding it using the WHATWG encoding lets you try again to decode it properly. The result would be lossy, yes. Whether this is a problem, I can't say, as I've never encountered the sorts of use cases being discussed here. I assume that the people advocating for this have, and consider this option, even if it's lossy, to be the best approach. For a non-stdlib based solution, I see no problem with this. If the codecs are to go into the stdlib, then I do think we should be able to document clearly what the use case is for these encodings, and why a user reading the codecs docs should pick these encodings over another one. That's where I think the proposal currently falls down - not in the usefulness of the codecs, nor in the naming (both of which seem to me to have been covered) but in providing a good enough explanation *to non-specialists* of why these codecs exist, how they should be used, and what the caveats are. Something that we'd be comfortable including in the docs. Paul
On 05.02.2018 04:01, Nick Coghlan wrote:
On 2 February 2018 at 16:52, Steven D'Aprano <steve@pearwood.info> wrote:
If it were my decision, I'd have these codecs raise a warning (not an error) when used for encoding. But I guess some people will consider that either going too far or not far enough :-)
Rob pointed out that one of the main use cases for these codecs is when going "Oh, this was decoded with a WHATWG encoding, which isn't right, so I need to re-encode it with that encoding, and then decode it with the right encoding". So encoding is very much part of the usage model: it's needed when you've received the data over a Unicode based interface rather than a binary one.
So the use case for encoding into WHATWG is to undo the WHATWG mappings by then decoding using the standard mappings and using an error handler to deal with decoding issues ? This strikes me as a rather unrealistic use case, esp. since it's likely that the original decoding was also done in Python, so the much more intuitive approach to fix this problem would be to not use WHATWG encodings for the initial decoding in the first place.
So I think the *use case* for the WHATWG encodings has been pretty well established. What hasn't been established is whether our answer to "How do I handle the WHATWG encodings?" is going to be:
* "Here they are in the standard library (for 3.8+)!"; or * "These are available as part of the 'ftfy' library on PyPI, which also helps fixes various other problems in decoded text"
Personally, I think a See Also note pointing to ftfy in the "codecs" module documentation would be quite a reasonable outcome of the thread - when it comes to consuming arbitrary data from the internet and cleaning up decoding issues, ftfy's data introspection based approach is likely to be far easier to start with than characterising the common errors for specific data sources and applying them individually, and if you're already using ftfy to figure out which fixes are needed, then it shouldn't be a big deal to keep it around for the more relaxed codecs that it provides. I think we've been going around in circles long enough.
Let's leave things as they are and perhaps a section to the codecs documentation, as you suggest, where to find other encodings which a user might want to use and tools to help with fixing encoding or decoding errors. Here's a random list from PyPI with some packages: https://pypi.python.org/pypi/ebcdic/ https://pypi.python.org/pypi/latexcodec/ https://pypi.python.org/pypi/mysql-latin1-codec/ https://pypi.python.org/pypi/cbmcodecs/ Perhaps fun variants such as: https://pypi.python.org/pypi/emoji-encoding/ -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Feb 05 2018)
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/
05.02.18 12:52, M.-A. Lemburg пише:
Let's leave things as they are and perhaps a section to the codecs documentation, as you suggest, where to find other encodings which a user might want to use and tools to help with fixing encoding or decoding errors.
Here's a random list from PyPI with some packages: https://pypi.python.org/pypi/ebcdic/ https://pypi.python.org/pypi/latexcodec/ https://pypi.python.org/pypi/mysql-latin1-codec/ https://pypi.python.org/pypi/cbmcodecs/
Perhaps fun variants such as: https://pypi.python.org/pypi/emoji-encoding/
But first than add references to third-party packages we should examine them. Check that they are compatible with recent versions of Python, do what they are stated, and don't contain malicious code.
On 05.02.2018 12:39, Serhiy Storchaka wrote:
05.02.18 12:52, M.-A. Lemburg пише:
Let's leave things as they are and perhaps a section to the codecs documentation, as you suggest, where to find other encodings which a user might want to use and tools to help with fixing encoding or decoding errors.
Here's a random list from PyPI with some packages: https://pypi.python.org/pypi/ebcdic/ https://pypi.python.org/pypi/latexcodec/ https://pypi.python.org/pypi/mysql-latin1-codec/ https://pypi.python.org/pypi/cbmcodecs/
Perhaps fun variants such as: https://pypi.python.org/pypi/emoji-encoding/
But first than add references to third-party packages we should examine them. Check that they are compatible with recent versions of Python, do what they are stated, and don't contain malicious code.
Sure. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Feb 05 2018)
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/
Nick Coghlan writes:
Personally, I think a See Also note pointing to ftfy in the "codecs" module documentation would be quite a reasonable outcome of the thread
Yes please. The more I hear about purported use cases (with the exception of Nathaniel's "don't crash when I manipulate the DOM" case, which is best handled by errors='surrogateescape'), the less I see anything "standard" about them.
By now, it sounds right to me that I should implement these codecs in a package. I accept that I've established the use case, but not sufficiently established why it belongs in Python. The package can easily be ftfy -- although I should point out that what's in ftfy at the moment isn't quite right! "ftfy.bad_codecs" implements the "fall back on Latin-1" idea that many people here have intuitively suggested, because I was implementing it just based on the evidence of text I saw; I didn't know at the time that there was an actual standard involved. The result differs subtly from what Web browsers do in cases outside the C1 range. But of course I can work on re-implementing the encodings correctly based on what I've learned. I think it would be best if these encodings were actually implemented in the "webencodings" package, or in a package that both ftfy and webencodings could use. I have certainly encountered cases in web scraping where, because webencodings doesn't use the same Windows-1252 as the actual web does, I have had to decode the text even more incorrectly using Latin-1 and _then_ run it through ftfy -- in effect, adding a layer of mojibake so I can fix two layers of mojibake. That's kind of absurd and it's why I thought this belonged in Python itself. But I'll talk to the webencodings author instead. On Tue, 6 Feb 2018 at 05:12 Stephen J. Turnbull < turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
Nick Coghlan writes:
Personally, I think a See Also note pointing to ftfy in the "codecs" module documentation would be quite a reasonable outcome of the thread
Yes please. The more I hear about purported use cases (with the exception of Nathaniel's "don't crash when I manipulate the DOM" case, which is best handled by errors='surrogateescape'), the less I see anything "standard" about them.
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
On 1/31/2018 6:15 PM, Chris Barker wrote:
I still have no idea why there is such resistance to this [spelling corrected]
Every proposal should be resisted to the extent of requiring clarity, consideration of alternatives, and sufficient justification.
yes, it's a fairly small benefit over a package on PyPi, [spelling corrected]
So why move *this* code? The clash with flake8 is an issue between the package and flake8 and is irrelevant to adding it to the stdlib. Every feature on PyPi would be more convenient for at least a few people if moved. Why specifically this package, more than a couple hundred others? Our current position is that most anything on PyPI should stay there.
but there is also virtually no downside.
All changes, and especially feature additions, have a downside, as has been explained by Steven D'Aprano more than once. M.-A. Lemburg already summarized his view of the specifics for this issue. And see below.
(I'm assuming the OP (or someone) will do all the actual work of coding and updating docs....)
At least one core developer has to *volunteer* to review, likely edit or request edits, merge, and *take responsibility* for the consequences of the PR. At minimum, there is the opportunity cost of the core developer not making some other improvement, which some might see as more valuable.
Practicality Beats Purity -- and this is a practical solution.
It is an ugly hack, which also has practical problems. Here is the full applicable quote from Tim's Zen: Special cases aren't special enough to break the rules. Although practicality beats purity. I take this to mean that normal special cases are not special enough but some special special cases are. The meta meaning is that decisions are not mechanical and require tradeoffs, and that people will honestly disagree in close cases. -- Terry Jan Reedy
On 01/02/18 21:34, Terry Reedy wrote:
On 1/31/2018 6:15 PM, Chris Barker wrote:
I still have no idea why there is such resistance to this [spelling corrected]
Every proposal should be resisted to the extent of requiring clarity, consideration of alternatives, and sufficient justification.
yes, it's a fairly small benefit over a package on PyPi, [spelling corrected]
So why move *this* code? The clash with flake8 is an issue between the package and flake8 and is irrelevant to adding it to the stdlib. Every feature on PyPi would be more convenient for at least a few people if moved. Why specifically this package, more than a couple hundred others? Our current position is that most anything on PyPI should stay there.
but there is also virtually no downside.
All changes, and especially feature additions, have a downside, as has been explained by Steven D'Aprano more than once. M.-A. Lemburg already summarized his view of the specifics for this issue. And see below.
(I'm assuming the OP (or someone) will do all the actual work of coding and updating docs....)
At least one core developer has to *volunteer* to review, likely edit or request edits, merge, and *take responsibility* for the consequences of the PR. At minimum, there is the opportunity cost of the core developer not making some other improvement, which some might see as more valuable.
Practicality Beats Purity -- and this is a practical solution.
It is an ugly hack, which also has practical problems.
Here is the full applicable quote from Tim's Zen:
Special cases aren't special enough to break the rules. Although practicality beats purity.
I take this to mean that normal special cases are not special enough but some special special cases are. The meta meaning is that decisions are not mechanical and require tradeoffs, and that people will honestly disagree in close cases.
I now see this entire thread as Status Quo 1, Proposal -1, so can we please move on? -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence
On Thu, Feb 1, 2018 at 1:34 PM, Terry Reedy <tjreedy@udel.edu> wrote:
On 1/31/2018 6:15 PM, Chris Barker wrote:
I still have no idea why there is such resistance to this [spelling
corrected]
M.-A. Lemburg already summarized his view of the specifics for this issue. And see below.
Thanks for that, I know I phrased it in no-very-open-for-discussion way, but that was what I was looking for. Frankly, I disagree with much of that, but it's been clearly layed out, which is what is needed to make a decision.
(I'm assuming the OP (or someone) will do all the actual work of coding
and updating docs....)
At least one core developer has to *volunteer* to review, likely edit or request edits, merge, and *take responsibility* for the consequences of the PR.
Fair enough -- it would be quite reasonable to say that this (or anything) wont get included u less a core dev decides it worth his/her time to bring it in -- but that is different than saying it won't be brought in regardless.
I take this to mean that normal special cases are not special enough but some special special cases are. The meta meaning is that decisions are not mechanical and require tradeoffs, and that people will honestly disagree in close cases.
yup -- there seems to be much resistance, and not much support -- so I guess we're done. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
09.01.18 23:15, Rob Speer пише:
There is an encoding with no name of its own. It's supported by every current web browser and standardized by WHATWG. It's so prevalent that if you ask a Web browser to decode "iso-8859-1" or "windows-1252", you will get this encoding _instead_. It is probably the second or third most common text encoding in the world. And Python doesn't quite support it.
You can see the character table for this encoding at: https://encoding.spec.whatwg.org/index-windows-1252.txt
For the sake of discussion, let's call this encoding "web-1252". WHATWG calls it "windows-1252", but notice that it's subtly different from Python's "windows-1252" encoding.. Python's windows-1252 has bytes that are undefined:
b'\x90'.decode('windows-1252') UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 0: character maps to <undefined>
In web-1252, the bytes that are undefined according to windows-1252 map to the control characters in those positions in iso-8859-1 -- that is, the Unicode codepoints with the same number as the byte. In web-1252, b'\x90' would decode as '\u0090'.
This may seem like a silly encoding that encourages doing horrible things with text. That's pretty much the case. But there's a reason every Web browser implements it:
- It's compatible with windows-1252 - Any sequence of bytes can be round-tripped through it without losing information
It's not just this one encoding. WHATWG's encoding standard (https://encoding.spec.whatwg.org/ <https://encoding..spec.whatwg.org/>) contains modified versions of windows-1250 through windows-1258 and windows-874.
The way of solving this issue in Python is using an error handler. The "surrogateescape" error handler is specially designed for lossless reversible decoding. It maps every unassigned byte in the range 0x80-0xff to a single character in the range U+dc80-U+dcff. This allows you to distinguish correctly decoded characters from the escaped bytes, perform character by character processing of the decoded text, and encode the result back with the same encoding.
b'\x90\x91\x92\x93'.decode('windows-1252', 'surrogateescape') '\udc90‘’“' '\udc90‘’“'.encode('windows-1252', 'surrogateescape') b'\x90\x91\x92\x93'
If you want to map unassigned bytes to other characters, you should just create a new error handler. There are caveats, since such characters are not distinguished from correctly decoded characters. The same problem with the UTF-8 encoding. WHATWG allows encoding and decoding surrogate characters in the range U+d800-U+dcff. This is contrary to the Unicode Standard and raises an error by default in Python. But you can allow encoding and decoding of surrogate characters by explicitly specifying the "surrogatepass" error handler.
Op 11 jan. 2018 10:56 schreef "Serhiy Storchaka" <storchaka@gmail.com>: 09.01.18 23:15, Rob Speer пише:
For the sake of discussion, let's call this encoding "web-1252". WHATWG calls it "windows-1252",
I'd suggest to name it then "whatwg-windows-152". and in general "whatwg-" + whatgwgs_name_of_encoding Stephan but notice that it's subtly different from Python's "windows-1252" encoding..
Python's windows-1252 has bytes that are undefined:
b'\x90'.decode('windows-1252') UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 0: character maps to <undefined>
In web-1252, the bytes that are undefined according to windows-1252 map to the control characters in those positions in iso-8859-1 -- that is, the Unicode codepoints with the same number as the byte. In web-1252, b'\x90' would decode as '\u0090'.
This may seem like a silly encoding that encourages doing horrible things with text. That's pretty much the case. But there's a reason every Web browser implements it:
- It's compatible with windows-1252 - Any sequence of bytes can be round-tripped through it without losing information
It's not just this one encoding. WHATWG's encoding standard ( https://encoding.spec.whatwg.org/ <https://encoding..spec.whatwg.org/>) contains modified versions of windows-1250 through windows-1258 and windows-874.
The way of solving this issue in Python is using an error handler. The "surrogateescape" error handler is specially designed for lossless reversible decoding. It maps every unassigned byte in the range 0x80-0xff to a single character in the range U+dc80-U+dcff. This allows you to distinguish correctly decoded characters from the escaped bytes, perform character by character processing of the decoded text, and encode the result back with the same encoding.
b'\x90\x91\x92\x93'.decode('windows-1252', 'surrogateescape') '\udc90‘’“' '\udc90‘’“'.encode('windows-1252', 'surrogateescape') b'\x90\x91\x92\x93'
If you want to map unassigned bytes to other characters, you should just create a new error handler. There are caveats, since such characters are not distinguished from correctly decoded characters. The same problem with the UTF-8 encoding. WHATWG allows encoding and decoding surrogate characters in the range U+d800-U+dcff. This is contrary to the Unicode Standard and raises an error by default in Python. But you can allow encoding and decoding of surrogate characters by explicitly specifying the "surrogatepass" error handler. _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
On Thu, Jan 11, 2018, at 04:55, Serhiy Storchaka wrote:
The way of solving this issue in Python is using an error handler. The "surrogateescape" error handler is specially designed for lossless reversible decoding. It maps every unassigned byte in the range 0x80-0xff to a single character in the range U+dc80-U+dcff. This allows you to distinguish correctly decoded characters from the escaped bytes, perform character by character processing of the decoded text, and encode the result back with the same encoding.
Maybe we need a new error handler that maps unassigned bytes in the range 0x80-0x9f to a single character in the range U+0080-U+009F. Do any of the encodings being discussed have behavior other than the "normal" version of the encoding plus what I just described?
On Thu, 11 Jan 2018 at 11:43 Random832 <random832@fastmail.com> wrote:
Maybe we need a new error handler that maps unassigned bytes in the range 0x80-0x9f to a single character in the range U+0080-U+009F. Do any of the encodings being discussed have behavior other than the "normal" version of the encoding plus what I just described?
(accidentally replied individually instead of replaying all) There is one more difference I have found between Python's encodings and WHATWG's. In Python's codepage 1255, b'\xca' is undefined. In WHATWG's, it maps to U+05BA HEBREW POINT HOLAM HASER FOR VAV. I haven't tracked down what the Unicode Consortium has to say about this. Other than that, all the differences are adding the fall-throughs in the range U+0080 to U+009F. For example, elsewhere in windows-1255, the byte b'\xff' is undefined, and it remains undefined in WHATWG's mapping.
On Thu, Jan 11, 2018, at 14:55, Rob Speer wrote:
There is one more difference I have found between Python's encodings and WHATWG's. In Python's codepage 1255, b'\xca' is undefined. In WHATWG's, it maps to U+05BA HEBREW POINT HOLAM HASER FOR VAV. I haven't tracked down what the Unicode Consortium has to say about this.
It appears in the best fit mapping (with a comment suggesting it unclear what vowel point it is actually meant to be) but not the normal mapping.
Other than that, all the differences are adding the fall-throughs in the range U+0080 to U+009F. For example, elsewhere in windows-1255, the byte b'\xff' is undefined, and it remains undefined in WHATWG's mapping.
This is, for the record, also consistent with the results of my test program - 0xCA is treated as a perfectly ordinary mapping that goes to U+05BA, whereas 0xFF returns an error. In permissive mode it maps to U+F896. 0xCA U+05BA appears (with no glyph, though) in the code chart Microsoft published with https://www.microsoft.com/typography/unicode/cscp.htm, but not in the corresponding mapping list. It also does not appear in https://msdn.microsoft.com/en-us/library/cc195057.aspx.
Rob Speer writes:
There is one more difference I have found between Python's encodings and WHATWG's. In Python's codepage 1255, b'\xca' is undefined. In WHATWG's, it maps to U+05BA HEBREW POINT HOLAM HASER FOR VAV. I haven't tracked down what the Unicode Consortium has to say about this.
In the past Microsoft has changed windows-125x coded character sets in Windows without updating the IANA registry. It's not clear to me how to deal with these nonstandards. I suspect that Microsoft will follow WHAT-WG in this in the end. Given that in practice Windows encodings are nonstandards not even followed by their defining authority, it seems reasonable to me that Python could update to following WHAT-WG, as long as it's a superset of the current codec (in a 3.x release, not a 3.x.y release); at least the way the encoding standard is presented they're pretty good at this, and likely more reliable going forward than Microsoft itself is on the legacy encodings.
Other than that, all the differences are adding the fall-throughs in the range U+0080 to U+009F. For example, elsewhere in windows-1255, the byte b'\xff' is undefined, and it remains undefined in WHATWG's mapping.
I really do not want those fall-throughs to control characters in the stdlib, since they have no textual interpretation in any standard encoding. My interpretation is "you're under attack, shutter the windows and call the cops". If people want to use codecs incorporating them, they should have to import them separately in the context of a defensive framework that deals with them at a higher level. Probably there's no harm in a browser that does visual presentation, but in other contexts where there is text mixed with control codes we cannot predict what will happen since there is no standard interpretation in common (cross-platform) use AFAIK. And even in visual representation, out-of-channel codes can be problematic. I once crashed a Prime minicomputer by forwarding some ASCII art tuned for a VT-220 back to its author, who had stolen the very nice Prime console terminal and was using it for email. Hilarity ensued (for me, all my deadlines were weeks off). Programs are generally more robust today, but in most cases it would a lot safer to use xmlcharrefreplace or backslashreplace, or surrogateescape to ensure that paranoid Unicode processes would reject it. Especially since there are real hostiles out there.
On 2018-01-12 06:10 AM, Stephen J. Turnbull wrote:
Rob Speer writes:
There is one more difference I have found between Python's encodings and WHATWG's. In Python's codepage 1255, b'\xca' is undefined. In WHATWG's, it maps to U+05BA HEBREW POINT HOLAM HASER FOR VAV. I haven't tracked down what the Unicode Consortium has to say about this.
In the past Microsoft has changed windows-125x coded character sets in Windows without updating the IANA registry. It's not clear to me how to deal with these nonstandards. I suspect that Microsoft will follow WHAT-WG in this in the end.
Given that in practice Windows encodings are nonstandards not even followed by their defining authority, it seems reasonable to me that Python could update to following WHAT-WG, as long as it's a superset of the current codec (in a 3.x release, not a 3.x.y release); at least the way the encoding standard is presented they're pretty good at this, and likely more reliable going forward than Microsoft itself is on the legacy encodings.
Other than that, all the differences are adding the fall-throughs in the range U+0080 to U+009F. For example, elsewhere in windows-1255, the byte b'\xff' is undefined, and it remains undefined in WHATWG's mapping.
I really do not want those fall-throughs to control characters in the stdlib, since they have no textual interpretation in any standard encoding. My interpretation is "you're under attack, shutter the windows and call the cops". If people want to use codecs incorporating them, they should have to import them separately in the context of a defensive framework that deals with them at a higher level.
This is surprising to me because I always took those encodings to have those fallbacks. It's pretty wild to think someone wouldn't want them.
Probably there's no harm in a browser that does visual presentation, but in other contexts where there is text mixed with control codes we cannot predict what will happen since there is no standard interpretation in common (cross-platform) use AFAIK. And even in visual representation, out-of-channel codes can be problematic. I once crashed a Prime minicomputer by forwarding some ASCII art tuned for a VT-220 back to its author, who had stolen the very nice Prime console terminal and was using it for email. Hilarity ensued (for me, all my deadlines were weeks off). Programs are generally more robust today, but in most cases it would a lot safer to use xmlcharrefreplace or backslashreplace, or surrogateescape to ensure that paranoid Unicode processes would reject it. Especially since there are real hostiles out there.
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Soni L. writes:
This is surprising to me because I always took those encodings to have those fallbacks [to raw control characters].
ISO-8859-1 implementations do, for historical reasons AFAICT. And they frequently produce mojibake and occasionally wilder behavior. Most legacy encodings don't, and their standards documents frequently leave the behavior undefined for control character codes (which means you can error on them) and define use of unassigned codes as an error.
It's pretty wild to think someone wouldn't want them.
In what context? WHAT-WG's encoding standard is *all about browsers*. If a codec is feeding text into a process that renders them all as glyphs for a human to look at, that's one thing. The codec doesn't want to fatal there, and the likely fallback glyph is something from the control glyphs block if even windows-125x doesn't have a glyph there. I guess it sort of makes sense. If you're feeding a program (as with JSON data, which I believe is "supposed" to be UTF-8, but many developers use the legacy charsets they're used to and which are often embedded in the underlying databases etc, ditto XML), the codec has no idea when or how that's going to get interpreted. In one application I've maintained, an editor, it has to deal with whatever characters are sent to it, but we preferred to take charset designations seriously because users were able to flexibly change those if they wanted to, so the error handler is some form of replacement with a human-readable representation (not pass-through), except for the usual HT, CR, LF, FF, and DEL (and ESC in encodings using ISO 2022 extensions). Mostly users would use the editor to remove or replace invalid codes, although of course they could just leave them in (and they would be converted from display form to the original codes on output). In another, a mailing list manager, codes outside the defined repertoires were a recurring nightmare that crashed server processes and blocked queues. It took a decade before we sealed the last known "leak" and I am not confident there are no leaks left. So I don't actually have experience of a use case for control character pass-through, and I wouldn't even automate the superset substitutions if I could avoid it. (In the editor case, I would provide a dialog saying "This is supposed to be iso-8859-1, but I'm seeing C1 control codes. Would you like me to try windows-1252, which uses those codes for graphic characters?") So to my mind, the use case here is relatively restricted (writing user display interfaces) and does not need to be in the stdlib, and would constitute an attractive nuisance there (developers would say "these users will stop complaining about inability to process their dirty data if I use a WHAT-WG version of a codec, then they don't have to clean up"). I don't have an objection to supporting even that use case, but I don't see why that support needs to be available in the stdlib.
On 2018-01-17 03:30 AM, Stephen J. Turnbull wrote:
Soni L. writes:
This is surprising to me because I always took those encodings to have those fallbacks [to raw control characters].
ISO-8859-1 implementations do, for historical reasons AFAICT. And they frequently produce mojibake and occasionally wilder behavior. Most legacy encodings don't, and their standards documents frequently leave the behavior undefined for control character codes (which means you can error on them) and define use of unassigned codes as an error.
It's pretty wild to think someone wouldn't want them.
In what context? WHAT-WG's encoding standard is *all about browsers*. If a codec is feeding text into a process that renders them all as glyphs for a human to look at, that's one thing. The codec doesn't want to fatal there, and the likely fallback glyph is something from the control glyphs block if even windows-125x doesn't have a glyph there. I guess it sort of makes sense.
If you're feeding a program (as with JSON data, which I believe is "supposed" to be UTF-8, but many developers use the legacy charsets they're used to and which are often embedded in the underlying databases etc, ditto XML), the codec has no idea when or how that's going to get interpreted. In one application I've maintained, an editor, it has to deal with whatever characters are sent to it, but we preferred to take charset designations seriously because users were able to flexibly change those if they wanted to, so the error handler is some form of replacement with a human-readable representation (not pass-through), except for the usual HT, CR, LF, FF, and DEL (and ESC in encodings using ISO 2022 extensions). Mostly users would use the editor to remove or replace invalid codes, although of course they could just leave them in (and they would be converted from display form to the original codes on output).
In another, a mailing list manager, codes outside the defined repertoires were a recurring nightmare that crashed server processes and blocked queues. It took a decade before we sealed the last known "leak" and I am not confident there are no leaks left.
So I don't actually have experience of a use case for control character pass-through, and I wouldn't even automate the superset substitutions if I could avoid it. (In the editor case, I would provide a dialog saying "This is supposed to be iso-8859-1, but I'm seeing C1 control codes. Would you like me to try windows-1252, which uses those codes for graphic characters?")
So to my mind, the use case here is relatively restricted (writing user display interfaces) and does not need to be in the stdlib, and would constitute an attractive nuisance there (developers would say "these users will stop complaining about inability to process their dirty data if I use a WHAT-WG version of a codec, then they don't have to clean up"). I don't have an objection to supporting even that use case, but I don't see why that support needs to be available in the stdlib.
We use control characters as formatting/control characters on IRC all the time. ISO-8859-1 explicitly defines control characters in the \x80-\x9F range, IIRC. Windows codepages implicitly define control characters in that range, but they're still technically defined. It's a de-facto standard for those encodings. I think python should follow the (de-facto) standard. This is it.
Soni L. writes:
ISO-8859-1 explicitly defines control characters in the \x80-\x9F range, IIRC.
You recall incorrectly. You're probably thinking of RFC 1345. But I've never seen that cited except in the IANA registry. All of ISO 2022, ISO 4873, ISO 8859, and Unicode suggest the ISO 6429 primary and supplementary control sets as good choices. (Unicode goes so far as to use ISO 6429's names for the supplementary set for C1 code points while explicitly denying them *any* semantics.) But none specifies a default, and as far as I know there is no widespread agreement on what control codes are good for, except for a handful of "whitespace" characters in C0, and a couple of C1 controls that are used by (and reserved to) ISO 2022. In fact, Python ISO-8859 codecs do pass them through (both C0 and C1), and the UTF-8 codec passes through C0 and allows encoding and decoding of C1 code points. On the other hand, the ISO standards forbid use of unassigned graphic code points as characters (graphic or control), and codecs quite reasonably treat unassigned graphic code points as errors. In Python, that practice is extended to the windows-* sets, which seems reasonable to me. But the windows-* encodings do not support C1 controls. Instead the entire right half of the code page is graphic (per Microsoft's IANA registrations), and that, I suppose, is why Python does not allow fallthrough of unassigned code points 0x80-0x9F in windows-* codecs.
I think python should follow the (de-facto) standard. This is it.
WHAT-WG encoding isn't a "de facto" standard, it's a published standard by a recognized (though forked) standards body. However, different standards are designed for different contexts, and WHAT-WG's encoding standard is clearly specifically aimed at browsers. It also may be useful for more specialized UI applications such as your IRC client, although IMO that's asking for trouble. Note also that the WHAT-WG standard is in a peculiar limbo between informative and normative. The standard encoding is UTF-8, end-of-story. What we're talking about here is best practices for UIs that are faced with non-conformant "legacy" documents, and want to display something anyway. But Python is a general-purpose programming language, and should cleave to the most generally-accepted, well-defined standards, which are the ISO standards themselves in the case of ISO-defined coded character sets. Aliasing the ISO character sets (and ASCII! oh, my aching RFC 822 header!) to the corresponding windows-* as a *general* practice is pretty abominable, though it makes some sense in the case of browsers. For windows-* character sets, ISTM that the WHAT-WG repertoires of graphic characters are improvements of Microsoft's (assuming that WHAT-WG version their standards). Applications can do what they want, of course, and I'm all for a PyPI package to make it easier to do that, whether by providing additional codecs, additional error handlers, or by post-processing surrogate- escaped bytes. I still don't think the WHAT-WG approach is a good fit for most use cases, nor should it be included in the stdlib. Most of the use cases I've seen proposed so far are well-served by existing Python features like errors='surrogateescape'. Steve
On 2018-01-18 04:12 PM, Stephen J. Turnbull wrote:
Soni L. writes:
ISO-8859-1 explicitly defines control characters in the \x80-\x9F range, IIRC.
You recall incorrectly. You're probably thinking of RFC 1345. But I've never seen that cited except in the IANA registry.
All of ISO 2022, ISO 4873, ISO 8859, and Unicode suggest the ISO 6429 primary and supplementary control sets as good choices. (Unicode goes so far as to use ISO 6429's names for the supplementary set for C1 code points while explicitly denying them *any* semantics.) But none specifies a default, and as far as I know there is no widespread agreement on what control codes are good for, except for a handful of "whitespace" characters in C0, and a couple of C1 controls that are used by (and reserved to) ISO 2022. In fact, Python ISO-8859 codecs do pass them through (both C0 and C1), and the UTF-8 codec passes through C0 and allows encoding and decoding of C1 code points.
On the other hand, the ISO standards forbid use of unassigned graphic code points as characters (graphic or control), and codecs quite reasonably treat unassigned graphic code points as errors. In Python, that practice is extended to the windows-* sets, which seems reasonable to me. But the windows-* encodings do not support C1 controls. Instead the entire right half of the code page is graphic (per Microsoft's IANA registrations), and that, I suppose, is why Python does not allow fallthrough of unassigned code points 0x80-0x9F in windows-* codecs.
I think python should follow the (de-facto) standard. This is it.
WHAT-WG encoding isn't a "de facto" standard, it's a published standard by a recognized (though forked) standards body. However, different standards are designed for different contexts, and WHAT-WG's encoding standard is clearly specifically aimed at browsers. It also may be useful for more specialized UI applications such as your IRC client, although IMO that's asking for trouble. Note also that the WHAT-WG standard is in a peculiar limbo between informative and normative. The standard encoding is UTF-8, end-of-story. What we're talking about here is best practices for UIs that are faced with non-conformant "legacy" documents, and want to display something anyway.
But Python is a general-purpose programming language, and should cleave to the most generally-accepted, well-defined standards, which are the ISO standards themselves in the case of ISO-defined coded character sets. Aliasing the ISO character sets (and ASCII! oh, my aching RFC 822 header!) to the corresponding windows-* as a *general* practice is pretty abominable, though it makes some sense in the case of browsers. For windows-* character sets, ISTM that the WHAT-WG repertoires of graphic characters are improvements of Microsoft's (assuming that WHAT-WG version their standards).
Applications can do what they want, of course, and I'm all for a PyPI package to make it easier to do that, whether by providing additional codecs, additional error handlers, or by post-processing surrogate- escaped bytes. I still don't think the WHAT-WG approach is a good fit for most use cases, nor should it be included in the stdlib. Most of the use cases I've seen proposed so far are well-served by existing Python features like errors='surrogateescape'.
I'm just glad I *always* use bytestrings when dealing with network protocols, I guess. It's the only reasonable option.
Steve
On Fri, Jan 12, 2018, at 03:10, Stephen J. Turnbull wrote:
Other than that, all the differences are adding the fall-throughs in the range U+0080 to U+009F. For example, elsewhere in windows-1255, the byte b'\xff' is undefined, and it remains undefined in WHATWG's mapping.
I really do not want those fall-throughs to control characters in the stdlib, since they have no textual interpretation in any standard encoding. My interpretation is "you're under attack, shutter the windows and call the cops". If people want to use codecs incorporating them, they should have to import them separately in the context of a defensive framework that deals with them at a higher level.
There are plenty of standard encodings that do have actual representations of the control characters. It's not clear why you consider it more dangerous for the "windows-1252" encoding to be able to return '\x81' for b'\x81' than for "latin-1" to do the same, or for "utf-8" to return it for b'\xc2\x81'. These characters exist. Supporting them in encodings that contain them in the real world, regardless what was submitted to the Unicode consortium, doesn't add any new attack surface.
Random832 writes:
There are plenty of standard encodings that do have actual representations of the control characters.
My complaint was not about coded character sets that don't conform to ISO 2022's conventions about control vs. graphic blocks, especially in the C1 block. It was about promoting *unassigned* codes to the Unicode scalars with the same integer values. These codes don't correspond to characters. They are undefined as far as codecs are concerned. In the case of windows-125x charsets, even though they are IANA registered, Microsoft reserves the right to change and even ignore the published repertoire without updating it. There I think it's reasonable to use WHAT-WG graphic character repertoires even in Python's stdlib codecs, and I wouldn't be surprised if Microsoft was willing to delegate definition of those repertoires to the WG in the end.
On 12Jan2018 0342, Random832 wrote:
On Thu, Jan 11, 2018, at 04:55, Serhiy Storchaka wrote:
The way of solving this issue in Python is using an error handler. The "surrogateescape" error handler is specially designed for lossless reversible decoding. It maps every unassigned byte in the range 0x80-0xff to a single character in the range U+dc80-U+dcff. This allows you to distinguish correctly decoded characters from the escaped bytes, perform character by character processing of the decoded text, and encode the result back with the same encoding.
Maybe we need a new error handler that maps unassigned bytes in the range 0x80-0x9f to a single character in the range U+0080-U+009F. Do any of the encodings being discussed have behavior other than the "normal" version of the encoding plus what I just described?
+1 on this being an error handler (if possible). I suspect the semantics will be more complex than suggested above, but as this seems to be able handling normally un[en/de]codable characters, using an error handler to return something more sensible best represents what is going on. Call it something like 'web' or 'relaxed' or 'whatwg'. I don't know if error handlers have enough context for this though. If not, we should ensure they can have it. I'd much rather explain one new error handler to most people (and a more complex API for implementing them to the few people who do it) than explain a whole suite of new encodings. Cheers, Steve
On 12 January 2018 at 14:55, Steve Dower <steve.dower@python.org> wrote:
On 12Jan2018 0342, Random832 wrote:
On Thu, Jan 11, 2018, at 04:55, Serhiy Storchaka wrote:
The way of solving this issue in Python is using an error handler. The "surrogateescape" error handler is specially designed for lossless reversible decoding. It maps every unassigned byte in the range 0x80-0xff to a single character in the range U+dc80-U+dcff. This allows you to distinguish correctly decoded characters from the escaped bytes, perform character by character processing of the decoded text, and encode the result back with the same encoding.
Maybe we need a new error handler that maps unassigned bytes in the range 0x80-0x9f to a single character in the range U+0080-U+009F. Do any of the encodings being discussed have behavior other than the "normal" version of the encoding plus what I just described?
+1 on this being an error handler (if possible). I suspect the semantics will be more complex than suggested above, but as this seems to be able handling normally un[en/de]codable characters, using an error handler to return something more sensible best represents what is going on. Call it something like 'web' or 'relaxed' or 'whatwg'.
I don't know if error handlers have enough context for this though. If not, we should ensure they can have it. I'd much rather explain one new error handler to most people (and a more complex API for implementing them to the few people who do it) than explain a whole suite of new encodings.
+1 from me, which shifts my position to be: 1. If we can make a decoding-only error handler that does the desired thing in combination with our existing codecs, lets do that (perhaps using a name like "controlpass", since the intent is to pass through otherwise unassigned latin-1 control characters, similar to the way "surrogatepass" allows lone surrogates) 2. Only if 1 fails for some reason would we look at adding the extra decode-only codec variants. Given the power of errors handlers, though, I expect the surrogatepass-style error handler approach will work (see https://docs.python.org/3/library/codecs.html#codecs.register_error and https://docs.python.org/3/library/exceptions.html#UnicodeError for an overview of the information they're given and what they can do about it). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
participants (21)
-
Antoine Pitrou
-
Chris Angelico
-
Chris Barker
-
Guido van Rossum
-
Ivan Pozdeev
-
M.-A. Lemburg
-
Mark Lawrence
-
MRAB
-
Nathaniel Smith
-
Nick Coghlan
-
Paul Moore
-
Random832
-
Rob Speer
-
Serhiy Storchaka
-
Soni L.
-
Stephan Houben
-
Stephen J. Turnbull
-
Steve Barnes
-
Steve Dower
-
Steven D'Aprano
-
Terry Reedy