[Tutor] Html entities, beautiful soup and unicode

Tue Jan 19 11:29:49 CET 2010

andy wrote:
> Hi people
> 
> I'm using beautiful soup to rip the uk headlines from the uk bbc page.
> This works rather well but there is the problem of html entities which
> appear in the xml feed.
> Is there an elegant/simple way to convert them into the "standard"
> output? By this I mean &#163; going to Â ? or do i have to use regexp?
> and where does unicode fit into all of this?
> 

import re

# Fredrik Lundh, http://effbot.org/zone/re-sub.html
def unescape(text):
     def fixup(m):
         text = m.group(0)
         if text[:2] == "&#":
             # character reference
             try:
                 if text[:3].lower() == "&#x":
                     return unichr(int(text[3:-1], 16))
                 else:
                     return unichr(int(text[2:-1]))
             except ValueError:
                 pass
         else:
             # named entity
             import htmlentitydefs
             try:
                 text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
             except KeyError:
                 pass
         return text # leave as is
     return re.sub("&#?\w+;", fixup, text)

print unescape('&#163;')

£


~