[Python-Dev] Re: [ python-Patches-590682 ] New codecs: html, asciihtml

Oren Tirosh oren-py-d@hishome.net
Mon, 5 Aug 2002 17:47:03 +0300

On Mon, Aug 05, 2002 at 03:57:10PM +0200, Fredrik Lundh wrote:
> > and the decimal character reference isn't really more useful than
> > the named entity reference.
> really?  converting a decimal character reference to a unicode character
> is trivial, but how do you convert a named entity reference to a unicode
> character?  (look it up in the htmlentitydefs?)
> here's a trivial piece of code that converts the entitydefs dictionary to
> a entity->unicode mapping:
>     entitydefs_unicode = {}
>     for entity, char in entitydefs.items():
>         if char[:2] == "&#":
>             char = unichr(int(char[2:-1]))
>         else:
>             char = unicode(char, "iso-8859-1")
>         entitydefs_unicode[entity] = char

Sure it's trivial but why should I be forced to do this conversion? I'm
sorry if I didn't explain myself so well. What I meant is not that the
entitydefs dictionary is useless but that decimal character references are
not useful by themselves - they are just another intermediate form.  Why
does the dictionary convert from "α" to "α" and not to the
fully decoded form which is the single unicode character u'\u03b1'?

I can't think of a case where numeric references are really useful by
themselves and not as some intermediate form.  Browsers understand
"α" and "α" equally well. Humans find the named references
easier to understand. Processing programs can't understand "α"
without first isolating the digits and converting them to a number. 

About case sensitivity you're right - smashing case does lose some
information. If a parser needs to understand sloppy manually-generated
HTML with tags like ">" it should be a little smarter than that.