[Python-ideas] Add "htmlcharrefreplace" error handler

Fri Jun 14 17:20:15 CEST 2013

On 14/06/13 19:22, Antoine Pitrou wrote:
> On Fri, 14 Jun 2013 19:06:55 +1000
> Steven D'Aprano <steve at pearwood.info> wrote:
>> On 14/06/13 18:49, Antoine Pitrou wrote:
>>> "Keeping the HTML source ASCII-only" is just silly IMO,
>>
>> Surely no sillier than "keep the Python std lib source ASCII-only".
>
> Or than drawing stupid analogies. Do you understand the difference
> between source code and hypertext documents?

Of course I do. I don't believe that the differences are as important as the similarities. Both are text. Both are expected to be read by human beings, at least sometimes. Both may be edited in an editor, or otherwise passed through some tool, that does not handle non-ASCII text correctly, causing corruption. Both may contain characters which the author has no way of entering directly.

The similarities are far more important than the differences.

>>> and it doesn't
>>> warrant special support in Python's codec error handlers.
>>
>> We're talking about this as if it were a major change. Doesn't this count as a trivial addition? The only question in my mind is, "Are the HTML char ref rules different enough from the XML rules that Python should provide both?"
>
> It's not trivial, it's additional C code in an important part of the
> language (unicode and codecs).

Or, it's 17 lines of Python. Something like this is a good start:

import codecs
from html.entities import codepoint2name

def htmlcharrefreplace_errors(exc):
     c = exc.object[exc.start]
     try:
         entity = codepoint2name[ord(c)]
     except KeyError:
         n = ord(c)
         if n <= 0xFFFF:
             replace = "\\u%04x"
         else:
             replace = "\\U%08x"
         replace = replace % n
     else:
         replace = "&{};".format(entity)
     return replace, exc.start + 1

codecs.register_error('htmlcharrefreplace', htmlcharrefreplace_errors)

Is this the point where someone now argues that it's too trivial to bother putting in the standard library?

This is not new syntax. It's not a new builtin. Even if it is written in C, the code itself is not likely to be significantly more complex than the existing xmlcharrefreplace error handler, which is under 100 lines of C. (The hard part is likely to be keeping the list of entities.) There's no backwards compatibility issues to worry about. It doesn't add a new programming idiom to the standard library. There's unlikely to be much in the way of bike-shedding about either functionality or syntax. It's merely a new error handler, with well-defined semantics and an obvious name. That's what I meant by "a trivial addition".

> And I haven't seen you propose a patch (when was your last patch, by
> the way?).

Does it matter? Do you think that *only* those who have contributed patches are capable of recognising a good, useful piece of functionality when they see it?

Putting people down because they have not contributed to the std lib as often as you is not open, considerate or respectful, nor is it welcoming to newcomers. Even those who are not prolific at submitting patches can contribute good ideas, and the ability of someone to write C code does not necessarily mean that they can judge good or bad ideas. Just look at PHP.

-- 
Steven