[Python-ideas] Add "htmlcharrefreplace" error handler

Fri Jun 14 12:11:01 CEST 2013

On 14.06.2013 11:43, Antoine Pitrou wrote:
> On Fri, 14 Jun 2013 11:38:46 +0200
> "M.-A. Lemburg" <mal at egenix.com> wrote:
>> On 14.06.2013 10:49, Antoine Pitrou wrote:
>>> On Fri, 14 Jun 2013 09:44:09 +0200
>>> "M.-A. Lemburg" <mal at egenix.com> wrote:
>>>>
>>>>> IMHO character references (named or numerical) should never be used in
>>>>> HTML (with the exception of " > and <).
>>>>> They exist mainly for three reasons:
>>>>> 1) provide a way to include characters that are not available in the
>>>>> used encoding (e.g. if you are using an obsolete encoding like
>>>>> windows-1252 but still want to use "fancy" characters);
>>>>> 2) to keep the HTML source ASCII-only;
>>>>
>>>> This is the main reason for using them. HTML's default encoding
>>>> is Latin-1, unlike XML.
>>>
>>> I'd like to know which good reasons there are to not use utf-8 for HTML
>>> pages in 2013.
>>> "Keeping the HTML source ASCII-only" is just silly IMO, and it doesn't
>>> warrant special support in Python's codec error handlers.
>>
>> Ezio and I gave reasons, but you've cut them away ;-)
> 
> Uh, no, you cut Ezio's own rebuttals to those reasons.
> Ezio's point still stands: named HTML character references have a use
> for *manual* entering of HTML text (though of course they are
> cumbersome), but that doesn't warrant a codec error handler which by
> construction is used for *automatic* generation of HTML text.

I'm not sure I follow. I've definitely had use cases for the
proposed error handler in the past and have written my own
set of tools to do such conversions.

Now instead of everyone writing their own little helper, it's
better to have a single implementation in the stdlib.

I think you are forgetting that the output of such a codec
is not necessarily always meant for sending over the wire
to some browser. It may well be used for creating data which
then has to be manipulated by other tools or humans.

One of the reasons we keep the Python stdlib (mostly) ASCII
is exactly that: to not create problems when editing source
files in editors having different character set configurations.

The same notion can be applied to HTML text.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jun 14 2013)
>>> Python Projects, Consulting and Support ...   http://www.egenix.com/
>>> mxODBC.Zope/Plone.Database.Adapter ...       http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________
2013-07-01: EuroPython 2013, Florence, Italy ...           17 days to go
2013-07-16: Python Meeting Duesseldorf ...                 32 days to go

::::: Try our mxODBC.Connect Python Database Interface for free ! ::::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/