[Tutor] Convert XML codes to "normal" text?

Eric Dorsey dorseye at gmail.com
Wed Mar 4 20:41:45 CET 2009


Senthil,

That worked like a charm, thank you for the help! Now my Snipt's are
actually legible :)


On Wed, Mar 4, 2009 at 12:01 AM, Senthil Kumaran <orsenthil at gmail.com>wrote:

> On Wed, Mar 4, 2009 at 11:13 AM, Eric Dorsey <dorseye at gmail.com> wrote:
> > I know, for example, that the &gt; code means >, but what I don't know is
> > how to convert it in all my data to show properly? I
>
> Feedparser returns the output in html only so except html tags and
> entities in the output.
> What you want is to Unescape HTML entities (
> http://effbot.org/zone/re-sub.htm#unescape-html )
>
> import feedparser
> import re, htmlentitydefs
>
> def unescape(text):
>    def fixup(m):
>        text = m.group(0)
>        if text[:2] == "&#":
>            # character reference
>            try:
>                if text[:3] == "&#x":
>                    return unichr(int(text[3:-1], 16))
>                else:
>                    return unichr(int(text[2:-1]))
>            except ValueError:
>                pass
>        else:
>            # named entity
>            try:
>                text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
>            except KeyError:
>                pass
>        return text # leave as is
>    return re.sub("&#?\w+;", fixup, text)
>
>
> d = feedparser.parse('http://snipt.net/dorseye/feed')
>
> x=0
> for i in d['entries']:
>     print unescape(d['entries'][x].title)
>    print unescape(d['entries'][x].summary)
>    print
>    x+=1
>
>
>
> HTH,
> Senthil
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20090304/d7a47d34/attachment-0001.htm>


More information about the Tutor mailing list