[Python-Dev] Re: [ python-Patches-590682 ] New codecs: html, asciihtml

Fredrik Lundh fredrik@pythonware.com
Mon, 5 Aug 2002 15:57:10 +0200

Oren Tirosh wrote:

> In its current form I find htmlentitydefs.py pretty useless.

I use it a lot, and find it reasonably useful.  sure beats typing in
the HTML character tables myself, or writing a DTD parser.

> Names in the input in arbitrary case will not match the MixedCase
> keys in the entitydefs dictionary

people who use oddball characters may prefer to keep uppercase
letters separate from lowercase letters.  if I type "Link=F6ping" using
a named entity, I don't want it to come out as "Link=D6ping".

if you don't care, nothing stops you from using  the "lower" string

> and the decimal character reference isn't really more useful than
> the named entity reference.

really?  converting a decimal character reference to a unicode character
is trivial, but how do you convert a named entity reference to a unicode
character?  (look it up in the htmlentitydefs?)

here's a trivial piece of code that converts the entitydefs dictionary to
a entity->unicode mapping:

    entitydefs_unicode =3D {}
    for entity, char in entitydefs.items():
        if char[:2] =3D=3D "&#":
            char =3D unichr(int(char[2:-1]))
            char =3D unicode(char, "iso-8859-1")
        entitydefs_unicode[entity] =3D char