[Tutor] feedparser in python

nathan tech nathan-tech at hotmail.com
Mon Apr 29 19:23:36 EDT 2019


Hi there,

After reading your email, I did some further investigation,

I first did this test:

     import feedparser

     import time

     def tim(url):

      k=time.time()

      feedparser.parse(url)

      return time.time()-k

The results were as follows:

     tim( a url): 2.9 seconds

     tim(the downoaded file(: 1.8 seconds


That tells me that roughly 1.1 seconds is network related, fair enough.

I admit, I've not tried etree, as I was specificly working with RSS 
feeds but will aim to do so soon.

My specific problem here is that, everywhere  when I look on how to 
check to see if an rss feed has been updated, without downloading the 
entire thing again, they all say use ETAG and Modified, but my feeds 
never, have them.

I've tried feeds from several sources, and none have them in the http 
header.

To that end, that is why I mentioned in the previous email about .date, 
because that seemed the most likely, but even that failed.

My goal is thus:

1, download a feed to the computer.

2. Occasionally, check the website to see if the donloaded feed is out 
of date if it is, redownload it.

Ultimately, I want to complete step 2 without downloading the *entire* 
feed again, though.


I did think about using threading for this, for example:

program loads,

user sees downloaded feed data only, in the background, the program 
checks for updates on each feed, and the user may see them gradually 
start to update.

This would work, in that execution would not fail at any time, but it 
seems... clunky, to me I suppose? And rather data jheavy for the end 
user, especially if, as you suggest, a feed is 10 MB in size.


Furthering to that, how many threads is safe?

Should I have my main thread, plus 4 feeds updating at once? 5? 20000?

Any help is appreciated.

Thanks

Nate

On 29/04/2019 08:43, Alan Gauld via Tutor wrote:
> On 29/04/2019 01:26, nathan tech wrote:
>
>> Most recently, I have started work using feedparser.
> I've never heard of it let alone used it so there may
> be another forum where you can get specific answers.
> But let me ask...
>
>> I noticed, almost straight away, it's a  bit slow.
> How do you measure slow? What speed did you expect?
> What other xml parsers have you tried? etree for example?
> How much faster was it compared to feedparser?
>
>> For instance:
>>
>>       url="http://feeds.bbci.co.uk/news/rss.xml"
>>       f1=feedparser.parse(url)
> So it looks like the parer is doing more than just
> parsing it is also fetching the data over the net.
> How long does that take? Could it be a slow connection
> or server?
>
> Can you try parsing a feed stored on the local
> machine to eliminate that portion of the work?
> Is it much faster? If so its the network causing the issue.
>
>> On some feeds, this can take a few seconds, on the talk python to me
>> feed, it takes almost 10!
> How big is the feed? If its many megabytes then 10s might
> not be too bad.
>
>> This, obviously, is not ideal when running a program which checks for
>> updates every once in a while. Talk about slooooow!
> When I talk about "sloooooow" I'm thinking about
> something that takes a long time relative to how long
> it would take me manually. If downloading and parsing
> these feeds by hand would take you 5 minutes per feed
> then 10s is quite fast...
>
> But if parsing by hand takes 30s then 10s would indeed
> be sloooow.
>
>> Similarly, this doesn't seem to work:
>>
>>       f2=feedparser.parse(url, f.headers["date"])
> define "doesn't work"?
> Does the PC crash? Does it not fetch the data?
> Does it fail to find "date"?
> Do you get an error message - if so what?
>
>> What am I doing wrong?
> No idea, you haven't given us enough information.
>


More information about the Tutor mailing list