Annoying feedparser issues

Tue May 19 10:12:12 EDT 2009

John Nagle <nagle at animats.com> writes:

>     This really isn't the fault of the "feedparser" module, but it's
> worth mentioning.
>
>     I have an application which needs to read each new item from a feed
> as it shows up, as efficiently as possible, because it's monitoring multiple
> feeds.  I want exactly one copy of each item as it comes in.
>
>     In theory, this is easy.  Each time the feed is polled, pass in the
> timestamp and ID from the previous poll, and if nothing has changed,
> a 304 status should come back.
>
>     Results are spotty.  It mostly works for Reuters.  It doesn't work
> for Twitter at all; Twitter updates the timestamp even when nothing changes.
> So items are routinely re-read.  (That has to be costing Twitter a huge
> amount of bandwidth from useless polls.)
>
>     Some sites have changing feed etags because they're using multiple
> servers and a load balancer. These can be recognized because the same
> etags will show up again after a change.
>
>     Items can supposedly be unduplicated by using the "etag" value.
> This almost works, but it's tricker than one might think.  On some feeds,
> an item might go away, yet come back in a later feed.  This happens with
> news feeds from major news sources, because they have priorities that
> don't show up in RSS.  High priority stories might push a low priority story
> off the feed, but it may come back later.  Also, every night at 00:00, some
> feeds like Reuters re-number everything.  The only thing that works reliably
> is comparing the story text.
>
> 					John Nagle

I can't really offer much help, but I feel your pain.  I had to write
something of a similar system for a large company once and it hurt.
They mixed different formats with different protocols and man it was
something in the end.  The law of fuzzy inputs makes this stuff tough.

It may help to create a hash from the first x number of bytes of the
article text.  Then cache all the hashes in a local dbm-style
database.  We used berkely, but it doesn't really matter.  Whatever
way you can generate and store a keyed signature will allow you to do
a quick look up and see if you've already processed that article.