latin1 and cp1252 inconsistent?

Fri Nov 16 17:33:59 EST 2012

On Fri, Nov 16, 2012 at 2:44 PM,  <buck at yelp.com> wrote:
> Latin1 has a block of 32 undefined characters.

These characters are not undefined.  0x80-0x9f are the C1 control
codes in Latin-1, much as 0x00-0x1f are the C0 control codes, and
their Unicode mappings are well defined.

http://tools.ietf.org/html/rfc1345

> Windows-1252 (aka cp1252) fills in 27 of these characters but leaves five undefined: 0x81, 0x8D, 0x8F, 0x90, 0x9D

In CP 1252, these codes are actually undefined.

http://msdn.microsoft.com/en-us/goglobal/cc305145.aspx

> Also, the html5 standard says:
>
> When a user agent [browser] would otherwise use a character encoding given in the first column [ISO-8859-1, aka latin1] of the following table to either convert content to Unicode characters or convert Unicode characters to bytes, it must instead use the encoding given in the cell in the second column of the same row [windows-1252, aka cp1252].
>
> http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#character-encodings-0
>
>
> The current implementation of windows-1252 isn't usable for this purpose (a replacement of latin1), since it will throw an error in cases that latin1 would succeed.

You can use a non-strict error handling scheme to prevent the error.

>>> b'hello \x81 world'.decode('cp1252')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\python33\lib\encodings\cp1252.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position
6: character maps to <undefined>

>>> b'hello \x81 world'.decode('cp1252', 'replace')
'hello \ufffd world'
>>> b'hello \x81 world'.decode('cp1252', 'ignore')
'hello  world'