[Python-ideas] Add "htmlcharrefreplace" error handler

Fri Jun 14 01:37:39 CEST 2013

Hi,

On Tue, Jun 11, 2013 at 5:49 PM, Serhiy Storchaka <storchaka at gmail.com> wrote:
> I propose to add "htmlcharrefreplace" error handler which is similar to
> "xmlcharrefreplace" error handler but use html entity names if possible.
>
>>>> '∀ x∈ℜ'.encode('ascii', 'xmlcharrefreplace')
> b'∀ x∈ℜ'
>>>> '∀ x∈ℜ'.encode('ascii', 'htmlcharrefreplace')
> b'∀ x∈ℜ'
>

Do you have any use cases for this, or is it just for completeness
since we already have xmlcharrefreplace?

IMHO character references (named or numerical) should never be used in
HTML (with the exception of " > and <).
They exist mainly for three reasons:
1) provide a way to include characters that are not available in the
used encoding (e.g. if you are using an obsolete encoding like
windows-1252 but still want to use "fancy" characters);
2) to keep the HTML source ASCII-only;
3) to specify a character by name if it's not possible to enter it
directly (e.g. you don't know the keys combinations);

1) is not a problem if you are using the UTF encodings, and if you
aren't (and you have unencodable chars) you are doing it wrong;
2) might still be valid for some situations, but in 2014 I would
expect software to deal decently with non-ASCII text;
3) is not a concern for this case, since we already have the character
we want and we aren't entering them manually;

I would therefore prefer to leave this to specific functions in the
html package, rather than adding a new error handler, so I'm -0.5 on
this (I would be -1 if it wasn't for the fact that if we want this to
work with any encoding, an error handler is indeed the simpler
solution).

I also want to avoid the situation where users don't know what they
are doing and start putting entities everywhere just to be "safe"
(since this will offer a convenient way to do it), and they might also
stick with obsolete encodings just because they can use this
"workaround".

Best Regards,
Ezio Melotti

> Possible implementation:
>
> import codecs
> from html.entities import codepoint2name
>
> def htmlcharrefreplace_errors(exc):
>     if not isinstance(exc, UnicodeEncodeError):
>         raise exc
>     try:
>         replace = r'&%s;' % codepoint2name[ord(exc.object[exc.start])]
>     except KeyError:
>         return codecs.xmlcharrefreplace_errors(exc)
>     return replace, exc.start + 1
>
> codecs.register_error('htmlcharrefreplace', htmlcharrefreplace_errors)
>
> Even if do not register this handler from the start, it may be worth to
> provide htmlcharrefreplace_errors() in the html or html.entities module.
>