[Python-ideas] Add "htmlcharrefreplace" error handler
M.-A. Lemburg
mal at egenix.com
Tue Jun 11 18:33:22 CEST 2013
On 11.06.2013 17:52, Steven D'Aprano wrote:
> On 12/06/13 01:38, M.-A. Lemburg wrote:
>> On 11.06.2013 17:29, Serhiy Storchaka wrote:
>>> 11.06.13 17:49, Serhiy Storchaka написав(ла):
>>>> I propose to add "htmlcharrefreplace" error handler which is similar to
>>>> "xmlcharrefreplace" error handler but use html entity names if possible.
>>>
>>> Or it should be named "htmlentityreplace"?
>>
>> Yes, since that's the more accurate and intuitive name.
>
> Intuitive, perhaps, but I'm not sure about accurate. According to Wikipedia:
>
> [quote]
> Although in popular usage character references are often called "entity references" or even
> "entities", this usage is wrong.[citation needed] A character reference is a reference to a
> character, not to an entity. Entity reference refers to the content of a named entity. An entity
> declaration is created by using the <!ENTITY name "value"> syntax in a document type definition
> (DTD) or XML schema.
> [end quote]
>
>
> https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
I think the HTML standard is the correct reference here, not some
"citation needed" comment ;-)
In HTML4, the official name is "character entity references".
http://www.w3.org/TR/1998/REC-html40-19980424/charset.html#h-5.3.2
In the HTML5 draft they are now called "named character references".
http://www.w3.org/TR/html5/syntax.html#character-references
The Python module is called html.entities, so let's stick with
that.
BTW: Just like with the Unicode names, a lot of code points outside the
ASCII range do not have a character entity reference.
I guess those should be replaced with numeric character references:
http://www.w3.org/TR/1998/REC-html40-19980424/charset.html#h-5.3.1
Note: It's not clear whether HTML allows numeric character references
outside the base plane. In theory it should be possible, but whether
browsers and other tools can actually handle non-BMP 𝒞 is
not obvious. It works in recent Firefox and SeaMonkey.
Some examples:
http://stackoverflow.com/questions/5567249/what-are-the-most-common-non-bmp-unicode-characters-in-actual-use
--
Marc-Andre Lemburg
eGenix.com
Professional Python Services directly from the Source (#1, Jun 11 2013)
>>> Python Projects, Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
________________________________________________________________________
2013-07-01: EuroPython 2013, Florence, Italy ... 20 days to go
2013-07-16: Python Meeting Duesseldorf ... 35 days to go
::::: Try our mxODBC.Connect Python Database Interface for free ! ::::::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/
More information about the Python-ideas
mailing list