[Python-ideas] Windows Best Fit Encodings
M.-A. Lemburg
mal at egenix.com
Fri Jan 19 13:18:06 EST 2018
Hi Steve,
do you know of a definite resource for Windows code pages
on MSDN or another official MS website ?
I tried to find some links, but only got these ancient
ones:
https://msdn.microsoft.com/en-us/library/cc195054.aspx
(this version of cp1252 doesn't even have the euro sign yet)
Thanks,
--
Marc-Andre Lemburg
eGenix.com
Professional Python Services directly from the Experts (#1, Jan 19 2018)
>>> Python Projects, Coaching and Consulting ... http://www.egenix.com/
>>> Python Database Interfaces ... http://products.egenix.com/
>>> Plone/Zope Database Interfaces ... http://zope.egenix.com/
________________________________________________________________________
::: We implement business ideas - efficiently in both time and costs :::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/
http://www.malemburg.com/
On 19.01.2018 18:17, M.-A. Lemburg wrote:
> On 19.01.2018 17:24, Random832 wrote:
>> On Fri, Jan 19, 2018, at 08:30, M.-A. Lemburg wrote:
>>>> Someone did discover that Microsoft's current implementations of the
>>>> windows-* encodings matches the WHAT-WG spec, rather than the Unicode
>>>> spec that Microsoft originally wrote.
>>>
>>> No, MS implements somethings called "best fit encodings"
>>> and these are different than what WHATWG uses.
>>
>> NO. I made this absolutely clear in my previous message, best fit mappings can be clearly distinguished from regular mappings by the behavior of the native conversion functions with certain argument flags (the mapping of 0xA0 to some private use character in cp932, for example, is a best-fit mapping in the decoding direction - but is treated as a regular mapping for encoding purposes), and the mapping of 0x81 to U+0081 in cp1252 etc is NOT a best fit mapping or in any way different from the rest of the mappings.
>>
>> We are not talking about implementing the best fit mappings. We are talking about real regular mappings that actually exist in these codepages that were for some unknown reason not included in the files published by Unicode.
>
> I only know the best fit encoding maps that are available
> on the Unicode site.
>
> If I read your comment correctly, you are saying that MS has
> moved away from the standard code pages towards something
> else - perhaps even something other than the best fit encodings
> listed on the Unicode site ?
>
> Do you have some references for this ?
>
> Note that the Windows code page codecs implemented in Python
> are all based on the Unicode mapping files and those were
> created by MS.
>
>>> https://msdn.microsoft.com/en-us/library/windows/desktop/dd374130%28v=vs.85%29.aspx
>>>
>>> unfortunately uses the above mentioned best fit encodings,
>>> but this can and should be switched off by specifying the
>>> WC_NO_BEST_FIT_CHARS for anything that requires validation
>>> or needs to be interoperable:
>>
>> Specifying this flag (and MB_ERR_INVALID_CHARS in the other direction) in fact does not disable the mappings we are discussing.
>
> Interesting. The CP1252 mapping clearly defines 0x80 to map
> to undefined, whereas the bestfit1252 maps it to 0x0081:
>
> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
> http://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt
>
> Same for the example you gave for CP932:
>
> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT
> http://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit932.txt
>
> So at least following the documentation you'd expect the function
> to implement the regular mappings.
>
More information about the Python-ideas
mailing list