[Tutor] Convert XML codes to "normal" text?

Senthil Kumaran orsenthil at gmail.com
Wed Mar 4 08:01:04 CET 2009


On Wed, Mar 4, 2009 at 11:13 AM, Eric Dorsey <dorseye at gmail.com> wrote:
> I know, for example, that the &gt; code means >, but what I don't know is
> how to convert it in all my data to show properly? I

Feedparser returns the output in html only so except html tags and
entities in the output.
What you want is to Unescape HTML entities (
http://effbot.org/zone/re-sub.htm#unescape-html )

import feedparser
import re, htmlentitydefs

def unescape(text):
    def fixup(m):
        text = m.group(0)
        if text[:2] == "&#":
            # character reference
            try:
                if text[:3] == "&#x":
                    return unichr(int(text[3:-1], 16))
                else:
                    return unichr(int(text[2:-1]))
            except ValueError:
                pass
        else:
            # named entity
            try:
                text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
            except KeyError:
                pass
        return text # leave as is
    return re.sub("&#?\w+;", fixup, text)


d = feedparser.parse('http://snipt.net/dorseye/feed')

x=0
for i in d['entries']:
    print unescape(d['entries'][x].title)
    print unescape(d['entries'][x].summary)
    print
    x+=1



HTH,
Senthil


More information about the Tutor mailing list