[Python-ideas] Windows Best Fit Encodings

Sat Jan 20 04:21:07 EST 2018

On Sat, Jan 20, 2018, at 02:01, Steve Dower wrote:
> On 20Jan2018 0518, M.-A. Lemburg wrote:
> > do you know of a definite resource for Windows code pages
> > on MSDN or another official MS website ?

I don't know what happened to this page, but I was able to find better-looking codepage tables at
http://web.archive.org/web/20160314211032/https://msdn.microsoft.com/en-us/goglobal/bb964654

Older versions at:
web.archive.org/web/*/http://www.microsoft.com:80/globaldev/reference/WinCP.asp
web.archive.org/web/*/http://www.microsoft.com:80/globaldev/reference/WinCP.mspx

See also, still live:
https://www.microsoft.com/typography/unicode/cscp.htm
(this has 0xCA in the graphical table for cp1255, the other does not)

> 
> I don't know of anything sorry, and my quick search didn't turn up 
> anything public. But I can at least confirm that the internal table for 
> cp1252 has the same undefined characters as on unicode.org
>, so 
> presumably if MultiByteToWideChar is mapping those to "best fit" 
> characters it's only because the flag has been passed.

I'm passing MB_ERR_INVALID_CHARS. And is this just as true for cp1255 0xCA as for the control characters? MultiByteToWideChar doesn't even *have* a flag for "best fit".

I was not able to identify any combination of flags that can be passed to either function on Windows 7 that would cause e.g. 0x81 in cp1252 to be treated any differently from any other character.

The C_1252.NLS file appears to consist of:

28 bytes of header
512 bytes WCHAR[256] of mappings e.g.
0000010c: 7800 7900 7a00 7b00 7c00 7d00 7e00 7f00  x.y.z.{.|.}.~...
0000011c: ac20 8100 1a20 9201 1e20 2620 2020 2120  . ... ... &   !
0000012c: c602 3020 6001 3920 5201 8d00 7d01 8f00  ..0 `.9 R...}...
0000013c: 9000 1820 1920 1c20 1d20 2220 1320 1420  ... . . . " . .
0000014c: dc02 2221 6101 3a20 5301 9d00 7e01 7801  .."!a.: S...~.x.
0000015c: a000 a100 a200 a300 a400 a500 a600 a700  ................
Six zero bytes
BYTE[65536] apparently of the best fit mappings, e.g.
000002a2: 3f81 3f3f 3f3f 3f3f 3f3f 3f3f 3f8d 3f8f  ?.???????????.?.
000002b2: 903f 3f3f 3f3f 3f3f 3f3f 3f3f 3f9d 3f3f  .????????????.??
00000312: f0f1 f2f3 f4f5 f6f7 f8f9 fafb fcfd feff  ................
00000322: 4161 4161 4161 4363 4363 4363 4363 4464  AaAaAaCcCcCcCcDd

I don't see where the file format even has room to identify characters as invalid (or how WideCharToMultiByte disables the best fit mappings, unless it's by checking the result against the WCHAR[256] table), though CP1253 and CP1255 seem to manage it. The ones in those codepages that do return an error are mapped (if the flag is not passed in, and in the NLS file tables) to private use characters U+F8xx.

> As far as I can 
> tell, Microsoft has not been secretly redefining any encodings.

Not so much redefining as holding back these characters from the published definition. I was being a bit overly dramatic with the 'for some unknown reason' bit, it seems obvious the reason is they wanted to reserve the ability to add new characters in the future, as they did for the Euro sign. And there's nothing wrong with that, per se, though it's unfortunate that their own conversion functions can't treat these bytes as errors.

Looking at the actual files, it looks like the ones in the "best fit" directory are in a format used internally by Microsoft (at a glance, they seem to contain enough information to generate the .NLS files, including stuff like the question marks in the header and the structure of DBCS tables), and the ones in the other mappings directory are sanitized and converted to more or less the same format as the other mappings.

(As for 1255 0xCA, the comment in the best fit file suggests that it was unclear what hebrew vowel point it was meant to be)