Annoying feedparser issues
nagle at animats.com
Sat May 16 03:28:20 CEST 2009
This really isn't the fault of the "feedparser" module, but it's
I have an application which needs to read each new item from a feed
as it shows up, as efficiently as possible, because it's monitoring multiple
feeds. I want exactly one copy of each item as it comes in.
In theory, this is easy. Each time the feed is polled, pass in the
timestamp and ID from the previous poll, and if nothing has changed,
a 304 status should come back.
Results are spotty. It mostly works for Reuters. It doesn't work
for Twitter at all; Twitter updates the timestamp even when nothing changes.
So items are routinely re-read. (That has to be costing Twitter a huge
amount of bandwidth from useless polls.)
Some sites have changing feed etags because they're using multiple
servers and a load balancer. These can be recognized because the same
etags will show up again after a change.
Items can supposedly be unduplicated by using the "etag" value.
This almost works, but it's tricker than one might think. On some feeds,
an item might go away, yet come back in a later feed. This happens with
news feeds from major news sources, because they have priorities that
don't show up in RSS. High priority stories might push a low priority story
off the feed, but it may come back later. Also, every night at 00:00, some
feeds like Reuters re-number everything. The only thing that works reliably
is comparing the story text.
More information about the Python-list