[Python-Dev] Re: [ python-Patches-590682 ] New codecs: html, asciihtml

M.-A. Lemburg mal@lemburg.com
Mon, 05 Aug 2002 17:01:34 +0200

Oren Tirosh wrote:
> On Mon, Aug 05, 2002 at 03:57:10PM +0200, Fredrik Lundh wrote:
>>>and the decimal character reference isn't really more useful than
>>>the named entity reference.
>>really?  converting a decimal character reference to a unicode character
>>is trivial, but how do you convert a named entity reference to a unicode
>>character?  (look it up in the htmlentitydefs?)
>>here's a trivial piece of code that converts the entitydefs dictionary to
>>a entity->unicode mapping:
>>    entitydefs_unicode = {}
>>    for entity, char in entitydefs.items():
>>        if char[:2] == "&#":
>>            char = unichr(int(char[2:-1]))
>>        else:
>>            char = unicode(char, "iso-8859-1")
>>        entitydefs_unicode[entity] = char
> Sure it's trivial but why should I be forced to do this conversion? 

Maybe because users of htmlentitydefs don't want to pay for
the extra table even though they don't use it ?

 > I'm
> sorry if I didn't explain myself so well. What I meant is not that the
> entitydefs dictionary is useless but that decimal character references are
> not useful by themselves - they are just another intermediate form.  Why
> does the dictionary convert from "α" to "α" and not to the
> fully decoded form which is the single unicode character u'\u03b1'?

Because that only works for Unicode and not all applications
are written to work with Unicode. The table maps entities to
Latin-1 which is HTML's default encoding.

> I can't think of a case where numeric references are really useful by
> themselves and not as some intermediate form.  Browsers understand
> "α" and "α" equally well. Humans find the named references
> easier to understand. Processing programs can't understand "α"
> without first isolating the digits and converting them to a number. 
> About case sensitivity you're right - smashing case does lose some
> information. If a parser needs to understand sloppy manually-generated
> HTML with tags like ">" it should be a little smarter than that.

That is application specific. The htmlentitydefs were generated
from the HTML spec files themselves, so they provide the basics
needed to work from. It's easy enough for you to write a function
which translates the basic table into anything you need.

Marc-Andre Lemburg
CEO eGenix.com Software GmbH
eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                               http://www.egenix.com/
Python Software:                    http://www.egenix.com/files/python/