unescape HTML entities
Klaus Alexander Seistrup
klaus at seistrup.dk
Sat Oct 28 17:40:05 EDT 2006
Rares Vernica wrote:
> How can I unescape HTML entities like " "?
>
> I know about xml.sax.saxutils.unescape() but it only deals with
> "&", "<", and ">".
>
> Also, I know about htmlentitydefs.entitydefs, but not only this
> dictionary is the opposite of what I need, it does not have
> " ".
How about something like:
#v+
#!/usr/bin/env/python
'''dehtml.py'''
import re
import htmlentitydef
myrx = re.compile('&(' + '|'.join(htmlentitydefs.name2codepoint.keys()) + ');')
def dehtml(s):
return re.sub(
myrx,
lambda m: unichr(htmlentitydefs.name2codepoint[m.group(1)]),
s
)
# end def dehtml
if __name__ == '__main__':
import sys
print dehtml(sys.stdin.read()).encode('utf-8')
# end if
#v-
E.g.:
#v+
$ echo 'frække frølår' | ./dehtml.py
frække frølår
$
#v-
--
Klaus Alexander Seistrup
Copenhagen, Denmark, EU
http://klaus.seistrup.dk/
More information about the Python-list
mailing list