[Tutor] Html entities, beautiful soup and unicode
Gerard Flanagan
grflanagan at gmail.com
Tue Jan 19 11:29:49 CET 2010
andy wrote:
> Hi people
>
> I'm using beautiful soup to rip the uk headlines from the uk bbc page.
> This works rather well but there is the problem of html entities which
> appear in the xml feed.
> Is there an elegant/simple way to convert them into the "standard"
> output? By this I mean £ going to  ? or do i have to use regexp?
> and where does unicode fit into all of this?
>
import re
# Fredrik Lundh, http://effbot.org/zone/re-sub.html
def unescape(text):
def fixup(m):
text = m.group(0)
if text[:2] == "&#":
# character reference
try:
if text[:3].lower() == "&#x":
return unichr(int(text[3:-1], 16))
else:
return unichr(int(text[2:-1]))
except ValueError:
pass
else:
# named entity
import htmlentitydefs
try:
text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
except KeyError:
pass
return text # leave as is
return re.sub("&#?\w+;", fixup, text)
print unescape('£')
£
~
More information about the Tutor
mailing list