[Tutor] feedparser in python

Alan Gauld alan.gauld at yahoo.co.uk
Tue Apr 30 03:47:42 EDT 2019


On 30/04/2019 00:23, nathan tech wrote:

> The results were as follows:
> 
>      tim( a url): 2.9 seconds
> 
>      tim(the downoaded file(: 1.8 seconds
> 
> 
> That tells me that roughly 1.1 seconds is network related, fair enough.

Or about 30% of the time.
Since the network element will increase as data
size increases as will the parse time it may be
a near linear relationship. Only more extensive
tests would tell.

> entire thing again, they all say use ETAG and Modified, but my feeds 
> never, have them.
> 
> I've tried feeds from several sources, and none have them in the http 
> header.

Have you looked at the headers to see what they do have?

> To that end, that is why I mentioned in the previous email about .date, 
> because that seemed the most likely, but even that failed.

Again you tell us that something failed. But don't say
how it failed. Do you mean that date did not exist?
Why did you think it would if you had already inspected
the headers?

Can you share some actual code that you used to check
these fields? And sow us the actual headers you are
reading?

> 1, download a feed to the computer.
> 
> 2. Occasionally, check the website to see if the donloaded feed is out 
> of date if it is, redownload it.

Seems a good plan. You just need to identify when changes occur.

Even better would be if the sites provided a web API to access
the data programmatically, but of course few sites do that...


> I did think about using threading for this, for example:

> user sees downloaded feed data only, in the background, the program 
> checks for updates on each feed, and the user may see them gradually 
> start to update.
> 
> This would work, in that execution would not fail at any time, but it 
> seems... clunky, to me I suppose? And rather data jheavy for the end 
> user, especially if, as you suggest, a feed is 10 MB in size.

Only data heavy if you download everything. If you only do the
headers and you only have a relatively few feeds its a good scheme.

As an alternative is there anything in the feed body that identifies
its creation date? Could you change your parsing mechanism to
parse the data as it arrives and stop if the date/time has not
changed? That minimises the download data.

> Furthering to that, how many threads is safe?

You have a lot of I/O going on so you could run quite a few threads
without blocking issues. How many feeds do you watch? Logic
would say have one thread per feed.

But how real time does this really need to be? Would it be
terrible if updates were, say 1 minute late? If that's the case
a single threaded solution may be fine. (and much simpler)
I'd certainly focus on a single threaded solution initially. Get it
working first then think about performance tuning.


-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos




More information about the Tutor mailing list