unescape HTML entities

Klaus Alexander Seistrup klaus at seistrup.dk
Sat Oct 28 17:40:05 EDT 2006


Rares Vernica wrote:

> How can I unescape HTML entities like " "?
>
> I know about xml.sax.saxutils.unescape() but it only deals with
> "&", "<", and ">".
>
> Also, I know about htmlentitydefs.entitydefs, but not only this 
> dictionary is the opposite of what I need, it does not have 
> " ".

How about something like:

#v+
#!/usr/bin/env/python
'''dehtml.py'''

import re
import htmlentitydef

myrx = re.compile('&(' + '|'.join(htmlentitydefs.name2codepoint.keys()) + ');')

def dehtml(s):
    return re.sub(
        myrx,
        lambda m: unichr(htmlentitydefs.name2codepoint[m.group(1)]),
        s
    )
# end def dehtml

if __name__ == '__main__':
    import sys
    print dehtml(sys.stdin.read()).encode('utf-8')
# end if

#v-

E.g.:

#v+

$ echo 'frække frølår' | ./dehtml.py
frække frølår
$ 

#v-

-- 
Klaus Alexander Seistrup
Copenhagen, Denmark, EU
http://klaus.seistrup.dk/



More information about the Python-list mailing list