[Tutor] Html entities, beautiful soup and unicode
grflanagan at gmail.com
Tue Jan 19 11:29:49 CET 2010
> Hi people
> I'm using beautiful soup to rip the uk headlines from the uk bbc page.
> This works rather well but there is the problem of html entities which
> appear in the xml feed.
> Is there an elegant/simple way to convert them into the "standard"
> output? By this I mean £ going to Â ? or do i have to use regexp?
> and where does unicode fit into all of this?
# Fredrik Lundh, http://effbot.org/zone/re-sub.html
text = m.group(0)
if text[:2] == "&#":
# character reference
if text[:3].lower() == "&#x":
return unichr(int(text[3:-1], 16))
# named entity
text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
return text # leave as is
return re.sub("&#?\w+;", fixup, text)
More information about the Tutor