latin1 and cp1252 inconsistent?
nobody at nowhere.com
Sat Nov 17 01:33:14 CET 2012
On Fri, 16 Nov 2012 13:44:03 -0800, buck wrote:
> When a user agent [browser] would otherwise use a character encoding given
> in the first column [ISO-8859-1, aka latin1] of the following table to
> either convert content to Unicode characters or convert Unicode characters
> to bytes, it must instead use the encoding given in the cell in the second
> column of the same row [windows-1252, aka cp1252].
It goes on to say:
The requirement to treat certain encodings as other encodings according
to the table above is a willful violation of the W3C Character Model
specification, motivated by a desire for compatibility with legacy
IOW: Microsoft's "embrace, extend, extinguish" strategy has been too
successful and now we have to deal with it. If HTML content is tagged as
using ISO-8859-1, it's more likely that it's actually Windows-1252 content
generated by someone who doesn't know the difference.
Given that the only differences between the two are for code points which
are in the C1 range (0x80-0x9F), which should never occur in HTML, parsing
ISO-8859-1 as Windows-1252 should be harmless.
If you need to support either, you can parse it as ISO-8859-1 then
explicitly convert C1 codes to their Windows-1252 equivalents as a
post-processing step, e.g. using the .translate() method.
More information about the Python-list