[Python-Dev] Re: [ python-Patches-590682 ] New codecs: html, asciihtml

Mon, 5 Aug 2002 17:47:03 +0300

On Mon, Aug 05, 2002 at 03:57:10PM +0200, Fredrik Lundh wrote:
> > and the decimal character reference isn't really more useful than
> > the named entity reference.
>
> really?  converting a decimal character reference to a unicode character
> is trivial, but how do you convert a named entity reference to a unicode
> character?  (look it up in the htmlentitydefs?)
>
> here's a trivial piece of code that converts the entitydefs dictionary to
> a entity->unicode mapping:
>
>     entitydefs_unicode = {}
>     for entity, char in entitydefs.items():
>         if char[:2] == "&#":
>             char = unichr(int(char[2:-1]))
>         else:
>             char = unicode(char, "iso-8859-1")
>         entitydefs_unicode[entity] = char

Sure it's trivial but why should I be forced to do this conversion? I'm
sorry if I didn't explain myself so well. What I meant is not that the
entitydefs dictionary is useless but that decimal character references are
not useful by themselves - they are just another intermediate form.  Why
does the dictionary convert from "&alpha;" to "&#945;" and not to the
fully decoded form which is the single unicode character u'\u03b1'?

I can't think of a case where numeric references are really useful by
themselves and not as some intermediate form.  Browsers understand
"&alpha;" and "&#945;" equally well. Humans find the named references
easier to understand. Processing programs can't understand "&#945;"
without first isolating the digits and converting them to a number. 

About case sensitivity you're right - smashing case does lose some
information. If a parser needs to understand sloppy manually-generated
HTML with tags like "&GT;" it should be a little smarter than that.

	Oren