Mailman 3 Add "htmlcharrefreplace" error handler - Python-ideas

Add "htmlcharrefreplace" error handler

Serhiy Storchaka

11 Jun 2013 11 Jun '13

2:49 p.m.

I propose to add "htmlcharrefreplace" error handler which is similar to "xmlcharrefreplace" error handler but use html entity names if possible.

...

...
...
'∀ x∈ℜ'.encode('ascii', 'xmlcharrefreplace') b'∀ x∈ℜ' '∀ x∈ℜ'.encode('ascii', 'htmlcharrefreplace') b'∀ x∈ℜ'

Possible implementation: import codecs from html.entities import codepoint2name def htmlcharrefreplace_errors(exc): if not isinstance(exc, UnicodeEncodeError): raise exc try: replace = r'&%s;' % codepoint2name[ord(exc.object[exc.start])] except KeyError: return codecs.xmlcharrefreplace_errors(exc) return replace, exc.start + 1 codecs.register_error('htmlcharrefreplace', htmlcharrefreplace_errors) Even if do not register this handler from the start, it may be worth to provide htmlcharrefreplace_errors() in the html or html.entities module.

Show replies by date

Paul Moore

11 Jun 11 Jun

2:53 p.m.

On 11 June 2013 15:49, Serhiy Storchaka <storchaka@gmail.com> wrote:

...

I propose to add "htmlcharrefreplace" error handler which is similar to "xmlcharrefreplace" error handler but use html entity names if possible.

+1. This is usually what I want when I use xmlcharrefreplace. The implementation is simple, but I was unaware of the ability to add my own error handlers, so having this in the stdlib would improve discoverability a lot. Paul

M.-A. Lemburg

3:04 p.m.

On 11.06.2013 16:49, Serhiy Storchaka wrote:

...

I propose to add "htmlcharrefreplace" error handler which is similar to "xmlcharrefreplace" error handler but use html entity names if possible.

...
...
...
'∀ x∈ℜ'.encode('ascii', 'xmlcharrefreplace') b'∀ x∈ℜ' '∀ x∈ℜ'.encode('ascii', 'htmlcharrefreplace') b'∀ x∈ℜ'

Possible implementation:

import codecs from html.entities import codepoint2name

def htmlcharrefreplace_errors(exc): if not isinstance(exc, UnicodeEncodeError): raise exc try: replace = r'&%s;' % codepoint2name[ord(exc.object[exc.start])] except KeyError: return codecs.xmlcharrefreplace_errors(exc) return replace, exc.start + 1

codecs.register_error('htmlcharrefreplace', htmlcharrefreplace_errors)

Even if do not register this handler from the start, it may be worth to provide htmlcharrefreplace_errors() in the html or html.entities module.

+1 on that one as well :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 11 2013)

...

...
...
Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

2013-07-01: EuroPython 2013, Florence, Italy ... 20 days to go 2013-07-16: Python Meeting Duesseldorf ... 35 days to go ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

Serhiy Storchaka

3:29 p.m.

11.06.13 17:49, Serhiy Storchaka написав(ла):

...

I propose to add "htmlcharrefreplace" error handler which is similar to "xmlcharrefreplace" error handler but use html entity names if possible.

Or it should be named "htmlentityreplace"?

M.-A. Lemburg

3:38 p.m.

On 11.06.2013 17:29, Serhiy Storchaka wrote:

...

11.06.13 17:49, Serhiy Storchaka написав(ла):

...
I propose to add "htmlcharrefreplace" error handler which is similar to "xmlcharrefreplace" error handler but use html entity names if possible.

Or it should be named "htmlentityreplace"?

Yes, since that's the more accurate and intuitive name. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 11 2013)

...

...
...
Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

Steven D'Aprano

3:52 p.m.

On 12/06/13 01:38, M.-A. Lemburg wrote:

...

On 11.06.2013 17:29, Serhiy Storchaka wrote:

...
11.06.13 17:49, Serhiy Storchaka написав(ла):

...
I propose to add "htmlcharrefreplace" error handler which is similar to "xmlcharrefreplace" error handler but use html entity names if possible.

Or it should be named "htmlentityreplace"?

Yes, since that's the more accurate and intuitive name.

Intuitive, perhaps, but I'm not sure about accurate. According to Wikipedia: [quote] Although in popular usage character references are often called "entity references" or even "entities", this usage is wrong.[citation needed] A character reference is a reference to a character, not to an entity. Entity reference refers to the content of a named entity. An entity declaration is created by using the <!ENTITY name "value"> syntax in a document type definition (DTD) or XML schema. [end quote] https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_referenc... -- Steven

M.-A. Lemburg

4:33 p.m.

On 11.06.2013 17:52, Steven D'Aprano wrote:

...

On 12/06/13 01:38, M.-A. Lemburg wrote:

...
On 11.06.2013 17:29, Serhiy Storchaka wrote:

...
11.06.13 17:49, Serhiy Storchaka написав(ла):

...
I propose to add "htmlcharrefreplace" error handler which is similar to "xmlcharrefreplace" error handler but use html entity names if possible.

Or it should be named "htmlentityreplace"?

Yes, since that's the more accurate and intuitive name.

Intuitive, perhaps, but I'm not sure about accurate. According to Wikipedia:

[quote] Although in popular usage character references are often called "entity references" or even "entities", this usage is wrong.[citation needed] A character reference is a reference to a character, not to an entity. Entity reference refers to the content of a named entity. An entity declaration is created by using the <!ENTITY name "value"> syntax in a document type definition (DTD) or XML schema. [end quote]

https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_referenc...

I think the HTML standard is the correct reference here, not some "citation needed" comment ;-) In HTML4, the official name is "character entity references". http://www.w3.org/TR/1998/REC-html40-19980424/charset.html#h-5.3.2 In the HTML5 draft they are now called "named character references". http://www.w3.org/TR/html5/syntax.html#character-references The Python module is called html.entities, so let's stick with that. BTW: Just like with the Unicode names, a lot of code points outside the ASCII range do not have a character entity reference. I guess those should be replaced with numeric character references: http://www.w3.org/TR/1998/REC-html40-19980424/charset.html#h-5.3.1 Note: It's not clear whether HTML allows numeric character references outside the base plane. In theory it should be possible, but whether browsers and other tools can actually handle non-BMP 𝒞 is not obvious. It works in recent Firefox and SeaMonkey. Some examples: http://stackoverflow.com/questions/5567249/what-are-the-most-common-non-bmp-... -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 11 2013)

...

...
...
Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

Ethan Furman

4:18 p.m.

On 06/11/2013 07:49 AM, Serhiy Storchaka wrote:

...

I propose to add "htmlcharrefreplace" error handler which is similar to "xmlcharrefreplace" error handler but use html entity names if possible.

...
...
...
'∀ x∈ℜ'.encode('ascii', 'xmlcharrefreplace') b'∀ x∈ℜ' '∀ x∈ℜ'.encode('ascii', 'htmlcharrefreplace') b'∀ x∈ℜ'

Possible implementation:

import codecs from html.entities import codepoint2name

def htmlcharrefreplace_errors(exc): if not isinstance(exc, UnicodeEncodeError): raise exc try: replace = r'&%s;' % codepoint2name[ord(exc.object[exc.start])] except KeyError: return codecs.xmlcharrefreplace_errors(exc) return replace, exc.start + 1

codecs.register_error('htmlcharrefreplace', htmlcharrefreplace_errors)

Even if do not register this handler from the start, it may be worth to provide htmlcharrefreplace_errors() in the html or html.entities module.

+1 for the idea and the name of 'htmlcharrefreplace'. -- ~Ethan~

Ezio Melotti

13 Jun 13 Jun

11:37 p.m.

Hi, On Tue, Jun 11, 2013 at 5:49 PM, Serhiy Storchaka <storchaka@gmail.com> wrote:

...

I propose to add "htmlcharrefreplace" error handler which is similar to "xmlcharrefreplace" error handler but use html entity names if possible.

...
...
...
'∀ x∈ℜ'.encode('ascii', 'xmlcharrefreplace') b'∀ x∈ℜ' '∀ x∈ℜ'.encode('ascii', 'htmlcharrefreplace') b'∀ x∈ℜ'

Do you have any use cases for this, or is it just for completeness since we already have xmlcharrefreplace? IMHO character references (named or numerical) should never be used in HTML (with the exception of " > and <). They exist mainly for three reasons: 1) provide a way to include characters that are not available in the used encoding (e.g. if you are using an obsolete encoding like windows-1252 but still want to use "fancy" characters); 2) to keep the HTML source ASCII-only; 3) to specify a character by name if it's not possible to enter it directly (e.g. you don't know the keys combinations); 1) is not a problem if you are using the UTF encodings, and if you aren't (and you have unencodable chars) you are doing it wrong; 2) might still be valid for some situations, but in 2014 I would expect software to deal decently with non-ASCII text; 3) is not a concern for this case, since we already have the character we want and we aren't entering them manually; I would therefore prefer to leave this to specific functions in the html package, rather than adding a new error handler, so I'm -0.5 on this (I would be -1 if it wasn't for the fact that if we want this to work with any encoding, an error handler is indeed the simpler solution). I also want to avoid the situation where users don't know what they are doing and start putting entities everywhere just to be "safe" (since this will offer a convenient way to do it), and they might also stick with obsolete encodings just because they can use this "workaround". Best Regards, Ezio Melotti

...

Possible implementation:

import codecs from html.entities import codepoint2name

def htmlcharrefreplace_errors(exc): if not isinstance(exc, UnicodeEncodeError): raise exc try: replace = r'&%s;' % codepoint2name[ord(exc.object[exc.start])] except KeyError: return codecs.xmlcharrefreplace_errors(exc) return replace, exc.start + 1

codecs.register_error('htmlcharrefreplace', htmlcharrefreplace_errors)

Even if do not register this handler from the start, it may be worth to provide htmlcharrefreplace_errors() in the html or html.entities module.

M.-A. Lemburg

14 Jun 14 Jun

7:44 a.m.

On 14.06.2013 01:37, Ezio Melotti wrote:

...

Hi,

On Tue, Jun 11, 2013 at 5:49 PM, Serhiy Storchaka <storchaka@gmail.com> wrote:

...
I propose to add "htmlcharrefreplace" error handler which is similar to "xmlcharrefreplace" error handler but use html entity names if possible.

...
...
...
'∀ x∈ℜ'.encode('ascii', 'xmlcharrefreplace') b'∀ x∈ℜ' '∀ x∈ℜ'.encode('ascii', 'htmlcharrefreplace') b'∀ x∈ℜ'

Do you have any use cases for this, or is it just for completeness since we already have xmlcharrefreplace?

The purpose is the same, but in a different, also very common context. As for use cases, you already pointed out quite a few below and I'm adding a few more.

...

IMHO character references (named or numerical) should never be used in HTML (with the exception of " > and <). They exist mainly for three reasons: 1) provide a way to include characters that are not available in the used encoding (e.g. if you are using an obsolete encoding like windows-1252 but still want to use "fancy" characters); 2) to keep the HTML source ASCII-only;

This is the main reason for using them. HTML's default encoding is Latin-1, unlike XML.

...

3) to specify a character by name if it's not possible to enter it directly (e.g. you don't know the keys combinations);

They exist for the same reason you have named Unicode characters: to make it obvious which character you are using without having to rely on a specific encoding. Another reason to use them is that a user might not have the needed fonts to display the characters in question. And in some cases, you also need to use the references to escape certain characters from being interpreted using their HTML meaning, e.g. & and the ones you've given above. But that's not the use case for the error handler. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 14 2013)

...

...
...
Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

2013-07-01: EuroPython 2013, Florence, Italy ... 17 days to go 2013-07-16: Python Meeting Duesseldorf ... 32 days to go ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

Antoine Pitrou

8:49 a.m.

On Fri, 14 Jun 2013 09:44:09 +0200 "M.-A. Lemburg" <mal@egenix.com> wrote:

...

...
IMHO character references (named or numerical) should never be used in HTML (with the exception of " > and <). They exist mainly for three reasons: 1) provide a way to include characters that are not available in the used encoding (e.g. if you are using an obsolete encoding like windows-1252 but still want to use "fancy" characters); 2) to keep the HTML source ASCII-only;

This is the main reason for using them. HTML's default encoding is Latin-1, unlike XML.

I'd like to know which good reasons there are to not use utf-8 for HTML pages in 2013. "Keeping the HTML source ASCII-only" is just silly IMO, and it doesn't warrant special support in Python's codec error handlers. Regards Antoine.

Steven D'Aprano

9:06 a.m.

On 14/06/13 18:49, Antoine Pitrou wrote:

...

"Keeping the HTML source ASCII-only" is just silly IMO,

Surely no sillier than "keep the Python std lib source ASCII-only".

...

and it doesn't warrant special support in Python's codec error handlers.

We're talking about this as if it were a major change. Doesn't this count as a trivial addition? The only question in my mind is, "Are the HTML char ref rules different enough from the XML rules that Python should provide both?" -- Steven

Antoine Pitrou

9:22 a.m.

On Fri, 14 Jun 2013 19:06:55 +1000 Steven D'Aprano <steve@pearwood.info> wrote:

...

On 14/06/13 18:49, Antoine Pitrou wrote:

...
"Keeping the HTML source ASCII-only" is just silly IMO,

Surely no sillier than "keep the Python std lib source ASCII-only".

Or than drawing stupid analogies. Do you understand the difference between source code and hypertext documents?

...

...
and it doesn't warrant special support in Python's codec error handlers.

We're talking about this as if it were a major change. Doesn't this count as a trivial addition? The only question in my mind is, "Are the HTML char ref rules different enough from the XML rules that Python should provide both?"

It's not trivial, it's additional C code in an important part of the language (unicode and codecs). And I haven't seen you propose a patch (when was your last patch, by the way?). Regards Antoine.

Stefan Drees

9:37 a.m.

On 2013-06-14 11:22 CEST, Antoine Pitrou wrote:

...

On Fri, 14 Jun 2013 19:06:55 +1000 Steven D'Aprano ... wrote:

...
On 14/06/13 18:49, Antoine Pitrou wrote:

...
"Keeping the HTML source ASCII-only" is just silly IMO,

Surely no sillier than "keep the Python std lib source ASCII-only".

<ignored level="suggested"/> the difference between source code and hypertext documents?

still in 2013, if you upload documents to at least one standardizing organization and you use utf-8 as author you are fine, as long it only uses ASCII characters ;-) Any umlaut or other typographically utf-8'd slipping in, ends up as broken latin-1 rendering. It will take many more years I presume until the chain of submitted documents and servers serving the received versions is really utf-8 safe.

...

...
...
and it doesn't warrant special support in Python's codec error handlers.

We're talking about this as if it were a major change. Doesn't this count as a trivial addition? The only question in my mind is, "Are the HTML char ref rules different enough from the XML rules that Python should provide both?"

It's not trivial, it's additional C code in an important part of the language (unicode and codecs).

And I haven't seen you propose a patch <ignored level="suggested"/>.

could we try to refrain from some b.t.w.'s \? (using trigraph-safe question mark encoding, in case some tool has trigraphs still turned on :-?) All the ebst, Stefan.

Antoine Pitrou

10:07 a.m.

On Fri, 14 Jun 2013 11:37:28 +0200 Stefan Drees <stefan@drees.name> wrote:

...

...
We're talking about this as if it were a major change. Doesn't this count as a trivial addition? The only question in my mind is, "Are the HTML char ref rules different enough from the XML rules that Python should provide both?"

It's not trivial, it's additional C code in an important part of the language (unicode and codecs).

And I haven't seen you propose a patch <ignored level="suggested"/>.

could we try to refrain from some b.t.w.'s \? (using trigraph-safe question mark encoding, in case some tool has trigraphs still turned on :-?)

We could, but in this case, this was pretty much warranted. Steven suggested that a change was "trivial", so it's only fair to wonder on which grounds he can cast such a judgement (e.g. what his authority is). python-ideas may sometimes feel like a nice soapbox, but the end goal is still to have code (or docs, PEPs, etc.) to check in. People will naturally be judged, though mostly tacitly, on their contribution track record (or absence thereof). Regards Antoine.

Stefan Drees

10:37 a.m.

On 2013-06-14 12:07 CEST, Antoine Pitrou wrote:

...

On Fri, 14 Jun 2013 11:37:28 +0200 Stefan Drees <stefan@drees.name> wrote:

...
...
We're talking about this as if it were a major change. Doesn't this count as a trivial addition? The only question in my mind is, "Are the HTML char ref rules different enough from the XML rules that Python should provide both?"

It's not trivial, it's additional C code in an important part of the language (unicode and codecs).

And I haven't seen you propose a patch <ignored level="suggested"/>.

could we try to refrain from some b.t.w.'s \? (using trigraph-safe question mark encoding, in case some tool has trigraphs still turned on :-?)

We could, but in this case, this was pretty much warranted. Steven suggested that a change was "trivial", so it's only fair to wonder on which grounds he can cast such a judgement (e.g. what his authority is).

me, with the sun shining outside and the summer finally arriving (again) I suggest, that for such a purpose (I won't judge on it!) and in my opinion and experience the first part "And I haven't seen you propose a patch" would have been fully sufficient, wouldn't it? Additional bad feelings possibly rooted in former experiences, behaviors and inside different areas might also be better handled in a short friendly private mail exchange, I guess.

...

python-ideas may sometimes feel like a nice soapbox, but the end goal is still to have code (or docs, PEPs, etc.) to check in. People will naturally be judged, though mostly tacitly, on their contribution track record (or absence thereof).

Well, this is not python-dev, right :-?) Now for something completely different and coming back to an anti-relevance claim, the one that challenged the use case of "even automates constructing HTML need to resort to ASCII" I think I gave a nice anecdotal counter example[1] out of the wild, where the producer has not sufficient control over the final nodes of the publication chain. References: [1]: http://mail.python.org/pipermail/python-ideas/2013-June/021399.html Now back to my soapbox - the kids are already far down the hill ... ;-) All the best, Stefan.

Steven D'Aprano

3:20 p.m.

On 14/06/13 19:22, Antoine Pitrou wrote:

...

On Fri, 14 Jun 2013 19:06:55 +1000 Steven D'Aprano <steve@pearwood.info> wrote:

...
On 14/06/13 18:49, Antoine Pitrou wrote:

...
"Keeping the HTML source ASCII-only" is just silly IMO,

Surely no sillier than "keep the Python std lib source ASCII-only".

Or than drawing stupid analogies. Do you understand the difference between source code and hypertext documents?

Of course I do. I don't believe that the differences are as important as the similarities. Both are text. Both are expected to be read by human beings, at least sometimes. Both may be edited in an editor, or otherwise passed through some tool, that does not handle non-ASCII text correctly, causing corruption. Both may contain characters which the author has no way of entering directly. The similarities are far more important than the differences.

...

...
...
and it doesn't warrant special support in Python's codec error handlers.

We're talking about this as if it were a major change. Doesn't this count as a trivial addition? The only question in my mind is, "Are the HTML char ref rules different enough from the XML rules that Python should provide both?"

It's not trivial, it's additional C code in an important part of the language (unicode and codecs).

Or, it's 17 lines of Python. Something like this is a good start: import codecs from html.entities import codepoint2name def htmlcharrefreplace_errors(exc): c = exc.object[exc.start] try: entity = codepoint2name[ord(c)] except KeyError: n = ord(c) if n <= 0xFFFF: replace = "\\u%04x" else: replace = "\\U%08x" replace = replace % n else: replace = "&{};".format(entity) return replace, exc.start + 1 codecs.register_error('htmlcharrefreplace', htmlcharrefreplace_errors) Is this the point where someone now argues that it's too trivial to bother putting in the standard library? This is not new syntax. It's not a new builtin. Even if it is written in C, the code itself is not likely to be significantly more complex than the existing xmlcharrefreplace error handler, which is under 100 lines of C. (The hard part is likely to be keeping the list of entities.) There's no backwards compatibility issues to worry about. It doesn't add a new programming idiom to the standard library. There's unlikely to be much in the way of bike-shedding about either functionality or syntax. It's merely a new error handler, with well-defined semantics and an obvious name. That's what I meant by "a trivial addition".

...

And I haven't seen you propose a patch (when was your last patch, by the way?).

Does it matter? Do you think that *only* those who have contributed patches are capable of recognising a good, useful piece of functionality when they see it? Putting people down because they have not contributed to the std lib as often as you is not open, considerate or respectful, nor is it welcoming to newcomers. Even those who are not prolific at submitting patches can contribute good ideas, and the ability of someone to write C code does not necessarily mean that they can judge good or bad ideas. Just look at PHP. -- Steven

Serhiy Storchaka

3:37 p.m.

14.06.13 18:20, Steven D'Aprano написав(ла):

...

On 14/06/13 19:22, Antoine Pitrou wrote:

...
It's not trivial, it's additional C code in an important part of the language (unicode and codecs).

Or, it's 17 lines of Python. Something like this is a good start:

import codecs from html.entities import codepoint2name

def htmlcharrefreplace_errors(exc): c = exc.object[exc.start] try: entity = codepoint2name[ord(c)] except KeyError: n = ord(c) if n <= 0xFFFF: replace = "\\u%04x" else: replace = "\\U%08x" replace = replace % n

Actually '&#%d;' % n. See also my sample implementation in original post which reuses xmlcharrefreplace_errors.

Steven D'Aprano

3:50 p.m.

On 15/06/13 01:37, Serhiy Storchaka wrote:

...

14.06.13 18:20, Steven D'Aprano написав(ла):

...
On 14/06/13 19:22, Antoine Pitrou wrote:

...
It's not trivial, it's additional C code in an important part of the language (unicode and codecs).

Or, it's 17 lines of Python. Something like this is a good start: [...] Actually '&#%d;' % n. See also my sample implementation in original post which reuses xmlcharrefreplace_errors.

So you did. I'm sorry for the noise, I missed your original implementation. -- Steven

Antoine Pitrou

3:54 p.m.

On Sat, 15 Jun 2013 01:20:15 +1000 Steven D'Aprano <steve@pearwood.info> wrote:

...

On 14/06/13 19:22, Antoine Pitrou wrote:

...
On Fri, 14 Jun 2013 19:06:55 +1000 Steven D'Aprano <steve@pearwood.info> wrote:

...
On 14/06/13 18:49, Antoine Pitrou wrote:

...
"Keeping the HTML source ASCII-only" is just silly IMO,

Surely no sillier than "keep the Python std lib source ASCII-only".

Or than drawing stupid analogies. Do you understand the difference between source code and hypertext documents?

Of course I do. I don't believe that the differences are as important as the similarities. Both are text. Both are expected to be read by human beings, at least sometimes.

HTML is expected to be viewed through a browser. Reading raw HTML is the exception, not the norm. Moreover, CPython's source code is supposed to be written and commented in English, meaning there's no opportunity for non-ASCII characters. However, note that *arbitrary* Python code can happily contain non-ASCII characters (including in identifiers).

...

Both may be edited in an editor, or otherwise passed through some tool, that does not handle non-ASCII text correctly, causing corruption.

Well, I'm personally ok with letting users of such incompetent tools deal with it on their own. Python needn't fix all problems in the computing world.

...

Is this the point where someone now argues that it's too trivial to bother putting in the standard library?

I'm not arguing against putting it in the standard library, I'm arguing against making it a built-in error handler. (and IMO it's not too trivial)

...

...
And I haven't seen you propose a patch (when was your last patch, by the way?).

Does it matter?

In an open source project which is ultimately driven by code contributions, yes, it does matter quite a bit. Also, in contrast with *other* open source projects, users of Python don't have the excuse of being non-programmers to block them from contributing.

...

Do you think that *only* those who have contributed patches are capable of recognising a good, useful piece of functionality when they see it?

No, but certainly they are better able to judge whichever is "trivial" or not; and how desirable it is *for them* to accept the additional maintenance burden (since you aren't the one doing any maintenance, again). Regards Antoine.

Masklinn

9:25 a.m.

On 2013-06-14, at 10:49 , Antoine Pitrou wrote:

...

On Fri, 14 Jun 2013 09:44:09 +0200 "M.-A. Lemburg" <mal@egenix.com> wrote:

...
...
IMHO character references (named or numerical) should never be used in HTML (with the exception of " > and <). They exist mainly for three reasons: 1) provide a way to include characters that are not available in the used encoding (e.g. if you are using an obsolete encoding like windows-1252 but still want to use "fancy" characters); 2) to keep the HTML source ASCII-only;

This is the main reason for using them. HTML's default encoding is Latin-1, unlike XML.

I'd like to know which good reasons there are to not use utf-8 for HTML pages in 2013. "Keeping the HTML source ASCII-only" is just silly IMO, and it doesn't warrant special support in Python's codec error handlers.

As far as I know M.A. is technically wrong, there is no such thing as a default HTML encoding (browsers have their own possibly configurable[0] defaults with "proprietary" heuristics, but no HTML spec defines any kind of default only a sequence of encoding extraction before falling back on heuristics). Most browsers tend to fall back on windows-1252 (not ASCII and not latin1, in fact they'll usually coerce explicit ascii or latin1 requests to windows-1252 internally) because that's what is often encountered (historically anyway) when no encoding is specified anywhere at all. A UTF-8 default is a stupid idea (for browsers) if it breaks more content than it makes available. [0] in Firefox's settings, Content > Fonts [Advanced] > Default Character Encoding

M.-A. Lemburg

9:58 a.m.

On 14.06.2013 11:25, Masklinn wrote:

...

On 2013-06-14, at 10:49 , Antoine Pitrou wrote:

...
On Fri, 14 Jun 2013 09:44:09 +0200 "M.-A. Lemburg" <mal@egenix.com> wrote:

...
...
IMHO character references (named or numerical) should never be used in HTML (with the exception of " > and <). They exist mainly for three reasons: 1) provide a way to include characters that are not available in the used encoding (e.g. if you are using an obsolete encoding like windows-1252 but still want to use "fancy" characters); 2) to keep the HTML source ASCII-only;

This is the main reason for using them. HTML's default encoding is Latin-1, unlike XML.

I'd like to know which good reasons there are to not use utf-8 for HTML pages in 2013. "Keeping the HTML source ASCII-only" is just silly IMO, and it doesn't warrant special support in Python's codec error handlers.

As far as I know M.A. is technically wrong, there is no such thing as a default HTML encoding (browsers have their own possibly configurable[0] defaults with "proprietary" heuristics, but no HTML spec defines any kind of default only a sequence of encoding extraction before falling back on heuristics).

AFAIK, this was first defined in HTML 2.0, perhaps even earlier: http://tools.ietf.org/html/draft-ietf-html-spec-05#section-6.1 http://tools.ietf.org/html/draft-ietf-html-spec-05#section-9.5 It's still part of HTML 4.0: http://www.w3.org/TR/html401/sgml/intro.html HTTP also uses Latin-1 as default: http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.7.1 But this is getting off-topic. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 14 2013)

...

...
...
Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

M.-A. Lemburg

9:38 a.m.

On 14.06.2013 10:49, Antoine Pitrou wrote:

...

On Fri, 14 Jun 2013 09:44:09 +0200 "M.-A. Lemburg" <mal@egenix.com> wrote:

...
...
IMHO character references (named or numerical) should never be used in HTML (with the exception of " > and <). They exist mainly for three reasons: 1) provide a way to include characters that are not available in the used encoding (e.g. if you are using an obsolete encoding like windows-1252 but still want to use "fancy" characters); 2) to keep the HTML source ASCII-only;

This is the main reason for using them. HTML's default encoding is Latin-1, unlike XML.

I'd like to know which good reasons there are to not use utf-8 for HTML pages in 2013. "Keeping the HTML source ASCII-only" is just silly IMO, and it doesn't warrant special support in Python's codec error handlers.

Ezio and I gave reasons, but you've cut them away ;-) Note that error handlers can be registered in the codec registry. You don't need to add support for them to each and every codec, so the added code is minimal. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 14 2013)

...

...
...
Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

Antoine Pitrou

9:43 a.m.

On Fri, 14 Jun 2013 11:38:46 +0200 "M.-A. Lemburg" <mal@egenix.com> wrote:

...

On 14.06.2013 10:49, Antoine Pitrou wrote:

...
On Fri, 14 Jun 2013 09:44:09 +0200 "M.-A. Lemburg" <mal@egenix.com> wrote:

...
...
IMHO character references (named or numerical) should never be used in HTML (with the exception of " > and <). They exist mainly for three reasons: 1) provide a way to include characters that are not available in the used encoding (e.g. if you are using an obsolete encoding like windows-1252 but still want to use "fancy" characters); 2) to keep the HTML source ASCII-only;

This is the main reason for using them. HTML's default encoding is Latin-1, unlike XML.

I'd like to know which good reasons there are to not use utf-8 for HTML pages in 2013. "Keeping the HTML source ASCII-only" is just silly IMO, and it doesn't warrant special support in Python's codec error handlers.

Ezio and I gave reasons, but you've cut them away ;-)

Uh, no, you cut Ezio's own rebuttals to those reasons. Ezio's point still stands: named HTML character references have a use for *manual* entering of HTML text (though of course they are cumbersome), but that doesn't warrant a codec error handler which by construction is used for *automatic* generation of HTML text. Regards Antoine.

M.-A. Lemburg

10:11 a.m.

On 14.06.2013 11:43, Antoine Pitrou wrote:

...

On Fri, 14 Jun 2013 11:38:46 +0200 "M.-A. Lemburg" <mal@egenix.com> wrote:

...
On 14.06.2013 10:49, Antoine Pitrou wrote:

...
On Fri, 14 Jun 2013 09:44:09 +0200 "M.-A. Lemburg" <mal@egenix.com> wrote:

...
...
IMHO character references (named or numerical) should never be used in HTML (with the exception of " > and <). They exist mainly for three reasons: 1) provide a way to include characters that are not available in the used encoding (e.g. if you are using an obsolete encoding like windows-1252 but still want to use "fancy" characters); 2) to keep the HTML source ASCII-only;

This is the main reason for using them. HTML's default encoding is Latin-1, unlike XML.

I'd like to know which good reasons there are to not use utf-8 for HTML pages in 2013. "Keeping the HTML source ASCII-only" is just silly IMO, and it doesn't warrant special support in Python's codec error handlers.

Ezio and I gave reasons, but you've cut them away ;-)

Uh, no, you cut Ezio's own rebuttals to those reasons. Ezio's point still stands: named HTML character references have a use for *manual* entering of HTML text (though of course they are cumbersome), but that doesn't warrant a codec error handler which by construction is used for *automatic* generation of HTML text.

I'm not sure I follow. I've definitely had use cases for the proposed error handler in the past and have written my own set of tools to do such conversions. Now instead of everyone writing their own little helper, it's better to have a single implementation in the stdlib. I think you are forgetting that the output of such a codec is not necessarily always meant for sending over the wire to some browser. It may well be used for creating data which then has to be manipulated by other tools or humans. One of the reasons we keep the Python stdlib (mostly) ASCII is exactly that: to not create problems when editing source files in editors having different character set configurations. The same notion can be applied to HTML text. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 14 2013)

...

...
...
Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

Paul Moore

10:43 a.m.

On 14 June 2013 11:11, M.-A. Lemburg <mal@egenix.com> wrote:

...

I'm not sure I follow. I've definitely had use cases for the proposed error handler in the past and have written my own set of tools to do such conversions.

Just as an extra data point, I have also had need for this functionality in the past. It is sometimes possible to use xmlcharrefreplace as an alternative, but having the "named" entities in the output is often useful for debugging, if nothing else. The technicalities of HTML/HTTP encodings are not so much the issue here. Much of the output of programs that would use this functionality, while ultimately intended for consumption on the web, is often read in a text editor as part of debugging and review, if nothing else. For that purpose, readable output is very useful. And sticking to ASCII, while not essential, certainly helps in an environment like Windows where UTF-8 is *not* universal (whether it should be is really not the point here). Paul

Stefan Drees

10:57 a.m.

On 2013-06-14 12:43 CEST, Paul Moore wrote:

...

On 14 June 2013 11:11, M.-A. Lemburg ...wrote:

I'm not sure I follow. I've definitely had use cases for the proposed error handler in the past and have written my own set of tools to do such conversions.

Just as an extra data point, I have also had need for this functionality in the past. It is sometimes possible to use xmlcharrefreplace as an alternative, but having the "named" entities in the output is often useful for debugging, if nothing else.

The technicalities of HTML/HTTP encodings are not so much the issue here. Much of the output of programs that would use this functionality, while ultimately intended for consumption on the web, is often read in a text editor as part of debugging and review, if nothing else. For that purpose, readable output is very useful. And sticking to ASCII, while not essential, certainly helps in an environment like Windows where UTF-8 is *not* universal (whether it should be is really not the point here).

just to add to this: I have grown a hard wired reflex when handing over program source files to admins for deployment in windows operating system driven HTML/HTTP environments to: Ensure the admin has an editor at hand to check that the utf-8 clean encoded text files she received do not suddenly become BOM-ed under the radar just because the admin changed some local file path in a config file or the like and subsequently stored it "subconsciously". The time otherwise lost in hunting mystery effects counts in days but feels like weeks ... And yes, I often have to deliver utf-8 files to "ease" the HTML/HTTP handling chain, but in debugging situations IMO it seems to be good to easily resort a pure ASCII representation without writing extra routines for it. All the best, Stefan.

Alexander Belopolsky

11:17 a.m.

On Fri, Jun 14, 2013 at 6:11 AM, M.-A. Lemburg <mal@egenix.com> wrote:

...

I think you are forgetting that the output of such a codec is not necessarily always meant for sending over the wire to some browser. It may well be used for creating data which then has to be manipulated by other tools or humans.

+1 On top of that, even HTML that is sent over the wire to a browser may end up being read by a human. It is for a good reason that every browser has a view source option more or less readily available.

Antoine Pitrou

11:20 a.m.

On Fri, 14 Jun 2013 07:17:00 -0400 Alexander Belopolsky <alexander.belopolsky@gmail.com> wrote:

...

On Fri, Jun 14, 2013 at 6:11 AM, M.-A. Lemburg <mal@egenix.com> wrote:

...
I think you are forgetting that the output of such a codec is not necessarily always meant for sending over the wire to some browser. It may well be used for creating data which then has to be manipulated by other tools or humans.

+1

On top of that, even HTML that is sent over the wire to a browser may end up being read by a human. It is for a good reason that every browser has a view source option more or less readily available.

If you want to *read* HTML (not write it), then you certainly want the original Unicode characters, not the garbled HTML entities meant to represent them. Regards Antoine;

Stefan Drees

11:31 a.m.

On 2013-06-14.06 13:20, Antoine Pitrou wrote:

...

On Fri, 14 Jun 2013 07:17:00 -0400 Alexander Belopolsky...wrote:

...
On Fri, Jun 14, 2013 at 6:11 AM, M.-A. Lemburg ... wrote:

...
I think you are forgetting that the output of such a codec is not necessarily always meant for sending over the wire to some browser. It may well be used for creating data which then has to be manipulated by other tools or humans.

+1

On top of that, even HTML that is sent over the wire to a browser may end up being read by a human. It is for a good reason that every browser has a view source option more or less readily available.

If you want to *read* HTML (not write it), then you certainly want the original Unicode characters, not the garbled HTML entities meant to represent them.

yes when everything just works and as a consumer, but then as the producers we are :-) in the midst of a review session a debugging attempt or when seeking a workaround, the view ascii source level of about any platform comes in quite handy ... All the best, Stefan.

Antoine Pitrou

11:35 a.m.

On Fri, 14 Jun 2013 13:31:43 +0200 Stefan Drees <stefan@drees.name> wrote:

...

On 2013-06-14.06 13:20, Antoine Pitrou wrote:

...
On Fri, 14 Jun 2013 07:17:00 -0400 Alexander Belopolsky...wrote:

...
On Fri, Jun 14, 2013 at 6:11 AM, M.-A. Lemburg ... wrote:

...
I think you are forgetting that the output of such a codec is not necessarily always meant for sending over the wire to some browser. It may well be used for creating data which then has to be manipulated by other tools or humans.

+1

On top of that, even HTML that is sent over the wire to a browser may end up being read by a human. It is for a good reason that every browser has a view source option more or less readily available.

If you want to *read* HTML (not write it), then you certainly want the original Unicode characters, not the garbled HTML entities meant to represent them.

yes when everything just works and as a consumer, but then as the producers we are :-) in the midst of a review session a debugging attempt or when seeking a workaround, the view ascii source level of about any platform comes in quite handy ...

Perhaps it does, but that's not a reason to add an error handler to Python. If you want debug output, you should write your own debug routines (or, you can simply display the HTML's repr()). So I still agree with Ezio: the function may be useful as part of the stdlib, but it doesn't have to be an encoding error handler. Regards Antoine.

Stefan Drees

11:53 a.m.

On 2013-06-14 13:35 CEST, Antoine Pitrou wrote:

...

On Fri, 14 Jun 2013 13:31:43 +0200 Stefan Drees ... wrote:

...
On 2013-06-14.06 13:20, Antoine Pitrou wrote:

...
On Fri, 14 Jun 2013 07:17:00 -0400 Alexander Belopolsky...wrote:

...
On Fri, Jun 14, 2013 at 6:11 AM, M.-A. Lemburg ... wrote:

...
I think you are forgetting that the output of such a codec is not necessarily always meant for sending over the wire to some browser. It may well be used for creating data which then has to be manipulated by other tools or humans.

+1

On top of that, even HTML that is sent over the wire to a browser may end up being read by a human. It is for a good reason that every browser has a view source option more or less readily available.

If you want to *read* HTML (not write it), then you certainly want the original Unicode characters, not the garbled HTML entities meant to represent them.

yes when everything just works and as a consumer, but then as the producers we are :-) in the midst of a review session a debugging attempt or when seeking a workaround, the view ascii source level of about any platform comes in quite handy ...

Perhaps it does, but that's not a reason to add an error handler to Python. If you want debug output, you should write your own debug routines (or, you can simply display the HTML's repr()).

So I still agree with Ezio: the function may be useful as part of the stdlib, but it doesn't have to be an encoding error handler.

+1 based on that summarizing evaluation ... surprisingly I will have to continue writing my own debug routines ;-) All the best, Stefan.

Alexander Belopolsky

12:09 p.m.

On Fri, Jun 14, 2013 at 7:35 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:

...

So I still agree with Ezio: the function may be useful as part of the stdlib, but it doesn't have to be an encoding error handler.

I don't understand why this functionality should be implemented as anything but an encoding error handler. It can still be implemented in the html package which would either register it itself or export a handler that applications would need to register by calling codecs.register_error(). A more user-friendly solution would be to pre-register a light-weight handler that would not import html.entities and possible most of its own implementation until the first use.

Antoine Pitrou

12:21 p.m.

On Fri, 14 Jun 2013 08:09:49 -0400 Alexander Belopolsky <alexander.belopolsky@gmail.com> wrote:

...

On Fri, Jun 14, 2013 at 7:35 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:

...
So I still agree with Ezio: the function may be useful as part of the stdlib, but it doesn't have to be an encoding error handler.

I don't understand why this functionality should be implemented as anything but an encoding error handler. It can still be implemented in the html package which would either register it itself or export a handler that applications would need to register by calling codecs.register_error().

Making registration manual would indeed be a better fit for the intended use cases, IMO. I don't think such a specialized function belongs to the built-in set of error handlers. Regards Antoine.

Amaury Forgeot d'Arc

12:39 p.m.

2013/6/14 Antoine Pitrou <solipsis@pitrou.net>

...

Making registration manual would indeed be a better fit for the intended use cases, IMO. I don't think such a specialized function belongs to the built-in set of error handlers.

By the way, why is it necessary to register? Since an error handler is defined by its callback function, we could allow functions for the "errors" parameter. -- Amaury Forgeot d'Arc

Alexander Belopolsky

1:20 p.m.

On Fri, Jun 14, 2013 at 8:39 AM, Amaury Forgeot d'Arc <amauryfa@gmail.com> wrote:

...

By the way, why is it necessary to register? Since an error handler is defined by its callback function, we could allow functions for the "errors" parameter.

+1 In fact, it is not necessary to register the codecs either. We could allow any namespace that defines CodecInfo attributes or a getregentry function. The users would them be able to write from encodings import ascii x.encode(ascii) instead of x.encode("ascii"). The benefit is that most IDEs would provide auto-completion and as you type error checking and the resulting program will not have a hidden import masquerading as a builtin call.

M.-A. Lemburg

1:55 p.m.

On 14.06.2013 14:39, Amaury Forgeot d'Arc wrote:

...

2013/6/14 Antoine Pitrou <solipsis@pitrou.net>

...
Making registration manual would indeed be a better fit for the intended use cases, IMO. I don't think such a specialized function belongs to the built-in set of error handlers.

By the way, why is it necessary to register? Since an error handler is defined by its callback function, we could allow functions for the "errors" parameter.

For the same reason we register modules in sys.modules: to be able to reference them by name, rather than by object. Also note that codecs expect to get the error parameter as string to keep the API simple and to make short-cuts easy to implement in the code (esp. in the C implementations). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 14 2013)

...

...
...
Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

M.-A. Lemburg

1:57 p.m.

On 14.06.2013 15:55, M.-A. Lemburg wrote:

...

On 14.06.2013 14:39, Amaury Forgeot d'Arc wrote:

...
2013/6/14 Antoine Pitrou <solipsis@pitrou.net>

...
Making registration manual would indeed be a better fit for the intended use cases, IMO. I don't think such a specialized function belongs to the built-in set of error handlers.

By the way, why is it necessary to register? Since an error handler is defined by its callback function, we could allow functions for the "errors" parameter.

For the same reason we register modules in sys.modules: to be able to reference them by name, rather than by object.

Also note that codecs expect to get the error parameter as string to keep the API simple and to make short-cuts easy to implement in the code (esp. in the C implementations).

Here's the PEP: http://www.python.org/dev/peps/pep-0293/ -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 14 2013)

...

...
...
Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

Amaury Forgeot d'Arc

2:31 p.m.

2013/6/14 M.-A. Lemburg <mal@egenix.com>

...

...
...
By the way, why is it necessary to register? Since an error handler is defined by its callback function, we could allow functions for the "errors" parameter.

For the same reason we register modules in sys.modules: to be able to reference them by name, rather than by object.

Also note that codecs expect to get the error parameter as string to keep the API simple and to make short-cuts easy to implement in the code (esp. in the C implementations).

Here's the PEP: http://www.python.org/dev/peps/pep-0293/

yes, I can understand the argument: "As this requires changes to lots of C prototypes, this approach was rejected." A callable "errors" would have avoided this whole discussion: Implement some htmlcharrefreplace function in htmllib.py, don't register it at all, and let users do .encode('ascii', htmllib.htmlcharrefreplace) or implement their own without any global change to the codecs registry. import.c was once rewritten to accept PyObject everywhere, maybe unicode codecs could have a double API as well? Yes, it's a lot of work. -- Amaury Forgeot d'Arc

Andrew Barnert

15 Jun 15 Jun

12:13 a.m.

From: Amaury Forgeot d'Arc <amauryfa@gmail.com> Sent: Friday, June 14, 2013 7:31 AM

...

2013/6/14 M.-A. Lemburg <mal@egenix.com>

...
...
By the way, why is it necessary to register?

...
Since an error handler is defined by its callback function, we could allow functions for the "errors" parameter.

For the same reason we register modules in sys.modules: to be able to reference them by name, rather than by object.

Also note that codecs expect to get the error parameter as string to keep the API simple and to make short-cuts easy to implement in the code (esp. in the C implementations).

The simplicity argument is pretty clear. Everywhere the docs/docstrings/comments explain how errors strings work, they'd also have to explain that it can be a callable instead, and that callables don't have to be passed to PyCodec_LookupError/codecs.lookup_error but can (which will return the argument as-is), and … Less seriously, it would make the analogy between the codec registry and the error handler registry weaker (therefore a bit more to learn), and it would make it a bit harder to distinguish in code between the pre-looked-up string-or-callable PyObject * and the post-looked-up callable PyObject * (something you don't even have to think about today). But I'm not sure it really saved any effort in implementing codecs. Conceivably, someone could take advantage of the string value of the errors, but everything I can find in a quick skim of _codecmodule.c and unicodeobject.c and everything I could find online does one of three things: (a) ignore it, (b) if (error) handler = PyCodec_LookupError(error), or (c) pass error along untouched to another function which does one of the above. So really, almost all code both in the stdlib and out would be the same, except that the ones implemented in C would be parsing an "O" arg instead of a "z".

...

...
Here's the PEP: http://www.python.org/dev/peps/pep-0293/

The PEP doesn't actually explain the rationale for why it doesn't use a more complicated string-or-callable API like the one I described above. Which is perfectly reasonable. Nobody asked for it until more than a decade later, and I'm not sure how good an idea it is. Borrowing a time machine to add code people will ask for years later is impressive; borrowing a time machine to add explanations for why they won't be able to have it when they ask years later would just be silly.

...

import.c was once rewritten to accept PyObject everywhere, maybe unicode codecs could have a double API as well? Yes, it's a lot of work.

I don't think changing PyCodec*/_codecs/codecs is that much work. (M.-A. Lemburg can correct me if I'm wrong.) The big problem isn't the fact that the API that every codec—including third-party codecs—must implement has to change. Which means you end up needing two different codec interfaces, two different registries (or one dual-type registry), etc. And I think that parallel system might have to stick around until Py4k, or at least for quite a few 3.x versions. Plus, you have to think through the API. Does Python or C-API code need to be able to distinguish old-style and new-style codecs? (If not, what happens when you pass an error by callable to what turns out to be an old-style codec? "TypeError" seems like the obvious answer, but then it's not really true that you can pass a callable as an error handler, unless you have some out-of-band knowledge about the codec you're going to be using.) Also: while nearly any third-party codec written in Python would just magically work as a new-API codec, "nearly" isn't good enough. And there's no way to test. Which means all such existing codecs have to be treated as old-API codecs, which sucks. In other words, even though I don't think it would actually take much work, and I like the idea, I can't see any way of fleshing out the idea that wouldn't make me hate it. Except for the obvious one: wait until py4k and just break the PyCodec* and codec-implementation interfaces.

Steven D'Aprano

14 Jun 14 Jun

3:32 p.m.

On 14/06/13 22:39, Amaury Forgeot d'Arc wrote:

...

2013/6/14 Antoine Pitrou <solipsis@pitrou.net>

...
Making registration manual would indeed be a better fit for the intended use cases, IMO. I don't think such a specialized function belongs to the built-in set of error handlers.

By the way, why is it necessary to register? Since an error handler is defined by its callback function, we could allow functions for the "errors" parameter.

In another post, I wrote: "There's unlikely to be much in the way of bike-shedding about either functionality or syntax." I spoke too soon :-( -- Steven

Serhiy Storchaka

15 Jun 15 Jun

6:06 a.m.

14.06.13 15:39, Amaury Forgeot d'Arc написав(ла):

...

By the way, why is it necessary to register? Since an error handler is defined by its callback function, we could allow functions for the "errors" parameter.

Could you please open a new topic for this discussion? Sometimes I feel regret that callables are not applicable as error handlers, but I understand that there are reasons for that.

Serhiy Storchaka

6:20 a.m.

14.06.13 15:21, Antoine Pitrou написав(ла):

...

Making registration manual would indeed be a better fit for the intended use cases, IMO. I don't think such a specialized function belongs to the built-in set of error handlers.

I agree with you. The dependence of interpreter core from the html.entities module doesn't look very good.

Alexander Belopolsky

14 Jun 14 Jun

11:33 a.m.

On Fri, Jun 14, 2013 at 7:20 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:

...

On Fri, 14 Jun 2013 07:17:00 -0400 Alexander Belopolsky <alexander.belopolsky@gmail.com> wrote: ..

...
On top of that, even HTML that is sent over the wire to a browser may

end

...

...
up being read by a human. ..

If you want to *read* HTML (not write it), then you certainly want the original Unicode characters, not the garbled HTML entities meant to represent them.

Not necessarily. More often than not the reason to reach for the "View Source" menu item is that the page you are looking at is garbled. In this case it is frustrating to see similarly garbled source or a stream of #NNNNs.

Serhiy Storchaka

3:09 p.m.

14.06.13 11:49, Antoine Pitrou написав(ла):

...

I'd like to know which good reasons there are to not use utf-8 for HTML pages in 2013.

Russian text requires 2 bytes per character in utf-8 (not counting spaces, punctuation and markup) and only 1 byte per character in any special encoding (cp1251/cp866/koi8-r). Same for other European non latin-based alphabets. Some old databases contains data in one of this 8-bit encoding and generating html page in the same encoding does not requires encoding/decoding at all.

...

"Keeping the HTML source ASCII-only" is just silly IMO, and it doesn't warrant special support in Python's codec error handlers.

"xmlcharrefreplace" is so good as "htmlentityreplace" and even better for this purpose.

Antoine Pitrou

3:25 p.m.

On Fri, 14 Jun 2013 18:09:16 +0300 Serhiy Storchaka <storchaka@gmail.com> wrote:

...

14.06.13 11:49, Antoine Pitrou написав(ла):

...
I'd like to know which good reasons there are to not use utf-8 for HTML pages in 2013.

Russian text requires 2 bytes per character in utf-8 (not counting spaces, punctuation and markup) and only 1 byte per character in any special encoding (cp1251/cp866/koi8-r). Same for other European non latin-based alphabets.

And even latin-based (e.g. latin-1), but if you really care about this, it's certainly more efficient to compress your HTTP response than trying to save space at the character level.

...

Some old databases contains data in one of this 8-bit encoding and generating html page in the same encoding does not requires encoding/decoding at all.

If it doesn't require encoding/decoding, how are you going to specify an encoding error handler? Regards Antoine.

Serhiy Storchaka

15 Jun 15 Jun

6:16 a.m.

14.06.13 18:25, Antoine Pitrou написав(ла):

...

On Fri, 14 Jun 2013 18:09:16 +0300 Serhiy Storchaka <storchaka@gmail.com> wrote:

...
14.06.13 11:49, Antoine Pitrou написав(ла):

...
I'd like to know which good reasons there are to not use utf-8 for HTML pages in 2013.

Russian text requires 2 bytes per character in utf-8 (not counting spaces, punctuation and markup) and only 1 byte per character in any special encoding (cp1251/cp866/koi8-r). Same for other European non latin-based alphabets.

And even latin-based (e.g. latin-1), but if you really care about this, it's certainly more efficient to compress your HTTP response than trying to save space at the character level.

In languages with latin-based alphabet usually only small part of characters are non-ascii. A utf-8 encoding adds only 5-10% to size.

...

...
Some old databases contains data in one of this 8-bit encoding and generating html page in the same encoding does not requires encoding/decoding at all.

If it doesn't require encoding/decoding, how are you going to specify an encoding error handler?

Main part of the page can generated without encoding, but small part can contain encoded text.

Serhiy Storchaka

14 Jun 14 Jun

3 p.m.

14.06.13 02:37, Ezio Melotti написав(ла):

...

On Tue, Jun 11, 2013 at 5:49 PM, Serhiy Storchaka <storchaka@gmail.com> wrote:

...
I propose to add "htmlcharrefreplace" error handler which is similar to "xmlcharrefreplace" error handler but use html entity names if possible.

...
...
...
'∀ x∈ℜ'.encode('ascii', 'xmlcharrefreplace') b'∀ x∈ℜ' '∀ x∈ℜ'.encode('ascii', 'htmlcharrefreplace') b'∀ x∈ℜ'

Do you have any use cases for this, or is it just for completeness since we already have xmlcharrefreplace?

In fact, there is no *need* in the "htmlentityreplace" error handler. "xmlcharrefreplace" is enough in most cases, it is faster and its scope is wider. "htmlentityreplace" is only desired for more human readable html. Perhaps it is not worth to register this error handler by default, but I see some people desire it in the stdlib. With regard to non utf-8 encodings of html, of course there are reasons for their use.

3973

Age (days ago)

3977

Last active (days ago)

List overview

Download

47 comments

12 participants

participants (12)

Alexander Belopolsky
Amaury Forgeot d'Arc
Andrew Barnert
Antoine Pitrou
Ethan Furman
Ezio Melotti
M.-A. Lemburg
Masklinn
Paul Moore
Serhiy Storchaka
Stefan Drees
Steven D'Aprano

Add "htmlcharrefreplace" error handler

Stefan Drees

Stefan Drees

Stefan Drees

Stefan Drees

Stefan Drees

tags

participants (12)