[Python-ideas] Support WHATWG versions of legacy encodings

Wed Jan 10 14:56:55 EST 2018

On 10.01.2018 20:13, Rob Speer wrote:
> I was originally proposing these encodings under different names, and
> that's what I think they should have. Indeed, that helps because a pip
> installable library can backport the new encodings to previous versions of
> Python.
> 
> Having a pip installable library as the _only_ way to use these encodings
> is the status quo that I am very familiar with. It's awkward. To use a
> package that registers new codecs, you have to import something from that
> package, even if you never call anything from what you imported, and that
> makes flake8 complain. The idea that an encoding name may or may not be
> registered, based on what has been imported, breaks our intuition about
> reading Python code and is very hard to statically analyze.

You can have a function in the package which registers the
codecs. That way you do have a call into the library and intuition
is restored :-) (and flake should be happy as well):

import mycodecs
mycodecs.register()

> I disagree with calling the WHATWG encodings that are implemented in every
> Web browser "non-standard". WHATWG may not have a typical origin story as a
> standards organization, but it _is_ the standards organization for the Web.

I don't really want to get into a discussion here. Browsers
use these modified encodings to cope with mojibake or web content
which isn't quite standard compliant. That's a valid use case,
but promoting such wrong use by having work-around encodings in
the stdlib and having Python produce non-standard output
doesn't strike me as a good way forward. We do have error handlers
for dealing with partially corrupted data. I think that's good
enough.

> I'm really not interested in best-fit mappings that turn infinity into "8"
> and square roots into "v". Making weird mappings like that sounds like a
> job for the "unidecode" library, not the stdlib.

Well, one of your main arguments was that the Windows API follows
these best fit encodings.

I agree that best fit may not necessarily be best fit for
everyone :-)

> On Wed, 10 Jan 2018 at 13:36 Rob Speer <rspeer at luminoso.com> wrote:
> 
>> I'm looking at the documentation of "best fit" mappings, and that seems to
>> be a different matter. It appears that best-fit mappings are designed to be
>> many-to-one mappings used only for encoding.
>>
>> "Examples of best fit are converting fullwidth letters to their
>> counterparts when converting to single byte code pages, and mapping the
>> Infinity character to the number 8." (Mapping ∞ to 8? Seriously?!) It also
>> does things such as mapping Cyrillic letters to Latin letters that look
>> like them.
>>
>> This is not what I'm interested in implementing. I just want there to be
>> encodings that match the WHATWG encodings exactly. If they have to be given
>> a different name, that's fine with me.
>>
>> On Wed, 10 Jan 2018 at 03:38 M.-A. Lemburg <mal at egenix.com> wrote:
>>
>>> On 10.01.2018 00:56, Rob Speer wrote:
>>>> Oh that's interesting. So it seems to be Python that's the exception
>>> here.
>>>>
>>>> Would we really be able to add entries to character mappings that
>>> haven't
>>>> changed since Python 2.0?
>>>
>>> The Windows mappings in Python come directly from the Unicode
>>> Consortium mapping files.
>>>
>>> If the Consortium changes the mappings, we can update them.
>>>
>>> If not, then we have a problem, since consumers are not only
>>> the win32 APIs, but also other tools out there running on
>>> completely different platforms, e.g. Java tools or web servers
>>> providing downloads using the Windows code page encodings.
>>>
>>> Allowing such mappings in the existing codecs would then result
>>> failures when the "other" sides see the decoded Unicode version and
>>> try to encode back into the original encoding - you'd move the
>>> problem from the Python side to the "other" side of the
>>> integration.
>>>
>>> I had a look on the Unicode FTP site and they have since added
>>> a new directory with mapping files they call "best fit":
>>>
>>>
>>> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/readme.txt
>>>
>>> The WideCharToMultiByte() defaults to best fit, but also offers
>>> a mode where it operates in standards compliant mode:
>>>
>>>
>>> https://msdn.microsoft.com/en-us/library/windows/desktop/dd374130%28v=vs.85%29.aspx
>>>
>>> See flag WC_NO_BEST_FIT_CHARS.
>>>
>>> Unicode TR#22 is also clear on this:
>>>
>>> https://www.unicode.org/reports/tr22/tr22-3.html#Illegal_and_Unassigned
>>>
>>> It allows such best fit mappings to make encodings round-trip
>>> safe, but requires to keep these separate from the original
>>> standard mappings:
>>>
>>> """
>>> It is very important that systems be able to distinguish between the
>>> fallback mappings and regular mappings. Systems like XML require the use
>>> of hex escape sequences (NCRs) to preserve round-trip integrity; use of
>>> fallback characters in that case corrupts the data.
>>> """
>>>
>>> If you read the above section in TR#22 you quickly get reminded
>>> of what the Unicode error handlers do (we basically implement
>>> the three modes it mentions... raise, ignore, replace).
>>>
>>> Now, for unmapped sequences an error handler can opt for
>>> using a fallback sequence instead.
>>>
>>> So in addition to adding best fit codecs, there's also the
>>> option to add an error handler for best fit resolution of
>>> unmapped sequences.
>>>
>>> Given the above, I don't think we ought to change the existing
>>> standards compliant mappings, but use one of two solutions:
>>>
>>> a) add "best fit" encodings (see the Unicode FTP site for
>>>    a list)
>>>
>>> b) add an error handlers "bestfit" which implements the
>>>    fallback modes for the encodings in question
>>>
>>>
>>>> On Tue, 9 Jan 2018 at 16:53 Ivan Pozdeev via Python-ideas <
>>>> python-ideas at python.org> wrote:
>>>>
>>>>> First of all, many thanks for such a excellently writen letter. It was
>>> a
>>>>> real pleasure to read.
>>>>> On 10.01.2018 0:15, Rob Speer wrote:
>>>>>
>>>>> Hi! I joined this list because I'm interested in filling a gap in
>>> Python's
>>>>> standard library, relating to text encodings.
>>>>>
>>>>> There is an encoding with no name of its own. It's supported by every
>>>>> current web browser and standardized by WHATWG. It's so prevalent that
>>> if
>>>>> you ask a Web browser to decode "iso-8859-1" or "windows-1252", you
>>> will
>>>>> get this encoding _instead_. It is probably the second or third most
>>> common
>>>>> text encoding in the world. And Python doesn't quite support it.
>>>>>
>>>>> You can see the character table for this encoding at:
>>>>> https://encoding.spec.whatwg.org/index-windows-1252.txt
>>>>>
>>>>> For the sake of discussion, let's call this encoding "web-1252". WHATWG
>>>>> calls it "windows-1252", but notice that it's subtly different from
>>>>> Python's "windows-1252" encoding. Python's windows-1252 has bytes that
>>> are
>>>>> undefined:
>>>>>
>>>>>>>> b'\x90'.decode('windows-1252')
>>>>> UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position
>>> 0:
>>>>> character maps to <undefined>
>>>>>
>>>>> In web-1252, the bytes that are undefined according to windows-1252
>>> map to
>>>>> the control characters in those positions in iso-8859-1 -- that is, the
>>>>> Unicode codepoints with the same number as the byte. In web-1252,
>>> b'\x90'
>>>>> would decode as '\u0090'.
>>>>>
>>>>> According to https://en.wikipedia.org/wiki/Windows-1252 , Windows does
>>>>> the same:
>>>>>
>>>>>     "According to the information on Microsoft's and the Unicode
>>>>> Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused;
>>>>> however, the Windows API MultiByteToWideChar
>>>>> <
>>> http://msdn.microsoft.com/en-us/library/windows/desktop/dd319072%28v=vs.85%29.aspx
>>>>
>>>>> maps these to the corresponding C1 control codes
>>>>> <https://en.wikipedia.org/wiki/C0_and_C1_control_codes>."
>>>>> And in ISO-8859-1, the same handling is done for unused code points
>>> even
>>>>> by the standard ( https://en.wikipedia.org/wiki/ISO/IEC_8859-1 ) :
>>>>>
>>>>>     "*ISO-8859-1* is the IANA
>>>>> <https://en.wikipedia.org/wiki/Internet_Assigned_Numbers_Authority>
>>>>> preferred name for this standard when supplemented with the C0 and C1
>>>>> control codes <https://en.wikipedia.org/wiki/C0_and_C1_control_codes>
>>>>> from ISO/IEC 6429 <https://en.wikipedia.org/wiki/ISO/IEC_6429>"
>>>>> And what would you think -- these "C1 control codes" are also the
>>>>> corresponding Unicode points! (
>>>>> https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block) )
>>>>>
>>>>> Since Windows is pretty much the reference implementation for
>>>>> "windows-xxxx" encodings, it even makes sense to alter the existing
>>>>> encodings rather than add new ones.
>>>>>
>>>>>
>>>>> This may seem like a silly encoding that encourages doing horrible
>>> things
>>>>> with text. That's pretty much the case. But there's a reason every Web
>>>>> browser implements it:
>>>>>
>>>>> - It's compatible with windows-1252
>>>>> - Any sequence of bytes can be round-tripped through it without losing
>>>>> information
>>>>>
>>>>> It's not just this one encoding. WHATWG's encoding standard (
>>>>> https://encoding.spec.whatwg.org/) contains modified versions of
>>>>> windows-1250 through windows-1258 and windows-874.
>>>>>
>>>>> Support for these encodings matters to me, in part, because I maintain
>>> a
>>>>> Unicode data-cleaning library, "ftfy". One thing it does is to detect
>>> and
>>>>> undo encoding/decoding errors that cause mojibake, as long as they're
>>>>> detectible and reversible. Looking at real-world examples of text that
>>> has
>>>>> been damaged by mojibake, it's clear that lots of text is transferred
>>>>> through what I'm calling the "web-1252" encoding, in a way that's
>>>>> incompatible with Python's "windows-1252".
>>>>>
>>>>> In order to be able to work with and fix this kind of text, ftfy
>>> registers
>>>>> new codecs -- and I implemented this even before I knew that they were
>>>>> standardized in Web browsers. When ftfy is imported, you can decode
>>> text as
>>>>> "sloppy-windows-1252" (the name I chose for this encoding), for
>>> example.
>>>>>
>>>>> ftfy can tell people a sequence of steps that they can use in the
>>> future
>>>>> to fix text that's like the text they provided. Very often, these steps
>>>>> require the sloppy-windows-1252 or sloppy-windows-1251 encoding, which
>>>>> means the steps only work with ftfy imported, even for people who are
>>> not
>>>>> using the features of ftfy.
>>>>>
>>>>> Support for these encodings also seems highly relevant to people who
>>> use
>>>>> Python for web scraping, as it would be desirable to maximize
>>> compatibility
>>>>> with what a Web browser would do.
>>>>>
>>>>> This really seems like it belongs in the standard library instead of
>>> being
>>>>> an incidental feature of my library. I know that code in the standard
>>>>> library has "one foot in the grave". I _want_ these legacy encodings to
>>>>> have one foot in the grave. But some of them are extremely common, and
>>>>> Python code should be able to deal with them.
>>>>>
>>>>> Adding these encodings to Python would be straightforward to implement.
>>>>> Does this require a PEP, a pull request, or further discussion?
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Python-ideas mailing listPython-ideas at python.orghttps://
>>> mail.python.org/mailman/listinfo/python-ideas
>>>>> Code of Conduct: http://python.org/psf/codeofconduct/
>>>>>
>>>>>
>>>>> --
>>>>> Regards,
>>>>> Ivan
>>>>>
>>>>> _______________________________________________
>>>>> Python-ideas mailing list
>>>>> Python-ideas at python.org
>>>>> https://mail.python.org/mailman/listinfo/python-ideas
>>>>> Code of Conduct: http://python.org/psf/codeofconduct/
>>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Python-ideas mailing list
>>>> Python-ideas at python.org
>>>> https://mail.python.org/mailman/listinfo/python-ideas
>>>> Code of Conduct: http://python.org/psf/codeofconduct/
>>>>
>>>
>>> --
>>> Marc-Andre Lemburg
>>> eGenix.com
>>>
>>> Professional Python Services directly from the Experts (#1, Jan 10 2018)
>>>>>> Python Projects, Coaching and Consulting ...  http://www.egenix.com/
>>>>>> Python Database Interfaces ...           http://products.egenix.com/
>>>>>> Plone/Zope Database Interfaces ...           http://zope.egenix.com/
>>> ________________________________________________________________________
>>>
>>> ::: We implement business ideas - efficiently in both time and costs :::
>>>
>>>    eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
>>>     D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
>>>            Registered at Amtsgericht Duesseldorf: HRB 46611
>>>                http://www.egenix.com/company/contact/
>>>                       http://www.malemburg.com/
>>>
>>
> 

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Jan 10 2018)
>>> Python Projects, Coaching and Consulting ...  http://www.egenix.com/
>>> Python Database Interfaces ...           http://products.egenix.com/
>>> Plone/Zope Database Interfaces ...           http://zope.egenix.com/
________________________________________________________________________

::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/
                      http://www.malemburg.com/