[Tutor] Html entities, beautiful soup and unicode

Tue Jan 19 11:59:09 CET 2010

On Tue, 19 Jan 2010 08:49:27 +0100
andy <cheesman at titan.physx.u-szeged.hu> wrote:

> Hi people
> 
> I'm using beautiful soup to rip the uk headlines from the uk bbc page.
> This works rather well but there is the problem of html entities which
> appear in the xml feed.
> Is there an elegant/simple way to convert them into the "standard"
> output? By this I mean &#163; going to Â ? or do i have to use regexp?
> and where does unicode fit into all of this?

Ha, ha!
What do you mean exactly, convert them into the "standard" output? What form do you expect, and to do what?
Maybe your aim is to replace number-coded html entities in a python string by real characters in a given format, to be able to output them. Then one way may be to use a simple regex and replace with a custom function. Eg:

import re

def rep(result):
    string = result.group()                   # "&#xxx;"
    n = int(string[2:-1])
    uchar = unichr(n)                         # matching unicode char
    # for you dest format may be iso-8859-2 ?
    return unicode.encode(uchar, "utf-8")     # format-encoded char

source = "xxx&#161;xxx&#194;xxx&#255;xxx"
pat = re.compile("""&#\d+;""")
print pat.sub(rep, source)

Denis
________________________________

la vita e estrany

http://spir.wikidot.com/