[Python-ideas] Add "htmlcharrefreplace" error handler

Fri Jun 14 11:25:21 CEST 2013

On 2013-06-14, at 10:49 , Antoine Pitrou wrote:
> On Fri, 14 Jun 2013 09:44:09 +0200
> "M.-A. Lemburg" <mal at egenix.com> wrote:
>> 
>>> IMHO character references (named or numerical) should never be used in
>>> HTML (with the exception of " > and <).
>>> They exist mainly for three reasons:
>>> 1) provide a way to include characters that are not available in the
>>> used encoding (e.g. if you are using an obsolete encoding like
>>> windows-1252 but still want to use "fancy" characters);
>>> 2) to keep the HTML source ASCII-only;
>> 
>> This is the main reason for using them. HTML's default encoding
>> is Latin-1, unlike XML.
> 
> I'd like to know which good reasons there are to not use utf-8 for HTML
> pages in 2013.
> "Keeping the HTML source ASCII-only" is just silly IMO, and it doesn't
> warrant special support in Python's codec error handlers.

As far as I know M.A. is technically wrong, there is no such thing as
a default HTML encoding (browsers have their own possibly configurable[0]
defaults with "proprietary" heuristics, but no HTML spec defines
any kind of default only a sequence of encoding extraction before
falling back on heuristics).

Most browsers tend to fall back on windows-1252 (not ASCII and not latin1,
in fact they'll usually coerce explicit ascii or latin1 requests
to windows-1252 internally) because that's what is often encountered
(historically anyway) when no encoding is specified anywhere at all.

A UTF-8 default is a stupid idea (for browsers) if it breaks more content
than it makes available.

[0] in Firefox's settings, Content > Fonts [Advanced] > Default Character Encoding