[Tutor] Convert XML codes to "normal" text?
Eric Dorsey
dorseye at gmail.com
Wed Mar 4 20:41:45 CET 2009
Senthil,
That worked like a charm, thank you for the help! Now my Snipt's are
actually legible :)
On Wed, Mar 4, 2009 at 12:01 AM, Senthil Kumaran <orsenthil at gmail.com>wrote:
> On Wed, Mar 4, 2009 at 11:13 AM, Eric Dorsey <dorseye at gmail.com> wrote:
> > I know, for example, that the > code means >, but what I don't know is
> > how to convert it in all my data to show properly? I
>
> Feedparser returns the output in html only so except html tags and
> entities in the output.
> What you want is to Unescape HTML entities (
> http://effbot.org/zone/re-sub.htm#unescape-html )
>
> import feedparser
> import re, htmlentitydefs
>
> def unescape(text):
> def fixup(m):
> text = m.group(0)
> if text[:2] == "&#":
> # character reference
> try:
> if text[:3] == "&#x":
> return unichr(int(text[3:-1], 16))
> else:
> return unichr(int(text[2:-1]))
> except ValueError:
> pass
> else:
> # named entity
> try:
> text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
> except KeyError:
> pass
> return text # leave as is
> return re.sub("&#?\w+;", fixup, text)
>
>
> d = feedparser.parse('http://snipt.net/dorseye/feed')
>
> x=0
> for i in d['entries']:
> print unescape(d['entries'][x].title)
> print unescape(d['entries'][x].summary)
> print
> x+=1
>
>
>
> HTH,
> Senthil
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20090304/d7a47d34/attachment-0001.htm>
More information about the Tutor
mailing list