RSS feed issues, or how to read each item exactly once
nagle at animats.com
Sat Mar 21 21:12:45 CET 2009
I've been using the "feedparser" module, and it turns out that
some RSS feeds don't quite do RSS right.
For the Reuters RSS feed, about once every fifteen minutes, the "Etag"
changes, even if there are no new stories. I've been logging this in
a program of mine:
WARNING: Feed "http://feeds.reuters.com/reuters/topNews?format=xml": Etag
changed from "YH2PzNGiblDEe3z0hw2T2PLelCs"
to "uGI/GLFvX9zQ+o4cdU2pFAetbEE" but no new content.
Etags are just an optimization, so that's not too serious. But
there are worse problems.
Sometimes the item ID for a story changes, although the story text
didn't. When a story stays on the Reuters feed for more than a day, it gets
a new ID each day.
Then, sometimes a higher priority story pushes an old story out of the
ten stories returned in the feed. But the higher priority story may disappear
from a later feed cycle, and the old story may come back.
So you can't actually trust those fields, and have to back them up with
checks of your own if you want exactly one copy of each item. It's
something that "feedparser" should perhaps do.
More information about the Python-list