[Python-ideas] Windows Best Fit Encodings (was: Support WHATWG versions of legacy encodings)

M.-A. Lemburg mal at egenix.com
Fri Jan 19 12:17:48 EST 2018


On 19.01.2018 17:24, Random832 wrote:
> On Fri, Jan 19, 2018, at 08:30, M.-A. Lemburg wrote:
>>> Someone did discover that Microsoft's current implementations of the
>>> windows-* encodings matches the WHAT-WG spec, rather than the Unicode
>>> spec that Microsoft originally wrote.
>>
>> No, MS implements somethings called "best fit encodings"
>> and these are different than what WHATWG uses.
> 
> NO. I made this absolutely clear in my previous message, best fit mappings can be clearly distinguished from regular mappings by the behavior of the native conversion functions with certain argument flags (the mapping of 0xA0 to some private use character in cp932, for example, is a best-fit mapping in the decoding direction - but is treated as a regular mapping for encoding purposes), and the mapping of 0x81 to U+0081 in cp1252 etc is NOT a best fit mapping or in any way different from the rest of the mappings.
> 
> We are not talking about implementing the best fit mappings. We are talking about real regular mappings that actually exist in these codepages that were for some unknown reason not included in the files published by Unicode.

I only know the best fit encoding maps that are available
on the Unicode site.

If I read your comment correctly, you are saying that MS has
moved away from the standard code pages towards something
else - perhaps even something other than the best fit encodings
listed on the Unicode site ?

Do you have some references for this ?

Note that the Windows code page codecs implemented in Python
are all based on the Unicode mapping files and those were
created by MS.

>> https://msdn.microsoft.com/en-us/library/windows/desktop/dd374130%28v=vs.85%29.aspx
>>
>> unfortunately uses the above mentioned best fit encodings,
>> but this can and should be switched off by specifying the
>> WC_NO_BEST_FIT_CHARS for anything that requires validation
>> or needs to be interoperable:
> 
> Specifying this flag (and MB_ERR_INVALID_CHARS in the other direction) in fact does not disable the mappings we are discussing.

Interesting. The CP1252 mapping clearly defines 0x80 to map
to undefined, whereas the bestfit1252 maps it to 0x0081:

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
http://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt

Same for the example you gave for CP932:

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT
http://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit932.txt

So at least following the documentation you'd expect the function
to implement the regular mappings.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Jan 19 2018)
>>> Python Projects, Coaching and Consulting ...  http://www.egenix.com/
>>> Python Database Interfaces ...           http://products.egenix.com/
>>> Plone/Zope Database Interfaces ...           http://zope.egenix.com/
________________________________________________________________________

::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/
                      http://www.malemburg.com/



More information about the Python-ideas mailing list