Fastest way to retrieve and write html contents to file
DFS
nospam at dfs.com
Mon May 2 03:37:09 EDT 2016
On 5/2/2016 2:27 AM, Stephen Hansen wrote:
> On Sun, May 1, 2016, at 10:59 PM, DFS wrote:
>> startTime = time.clock()
>> for i in range(loops):
>> r = urllib2.urlopen(webpage)
>> f = open(webfile,"w")
>> f.write(r.read())
>> f.close
>> endTime = time.clock()
>> print "Finished urllib2 in %.2g seconds" %(endTime-startTime)
>
> Yeah on my system I get 1.8 out of this, amounting to 0.18s.
You get 1.8 seconds total for the 10 loops? That's less than half as
fast as my results. Surprising.
> I'm again going back to the point of: its fast enough. When comparing
> two small numbers, "twice as slow" is meaningless.
Speed is always meaningful.
I know python is relatively slow, but it's a cool, concise, powerful
language. I'm extremely impressed by how tight the code can get.
> You have an assumption you haven't answered, that downloading a 10 meg
> file will be twice as slow as downloading this tiny file. You haven't
> proven that at all.
True. And it has been my assumption - tho not with 10MB file.
> I suspect you have a constant overhead of X, and in this toy example,
> that makes it seem twice as slow. But when downloading a file of size,
> you'll have the same constant factor, at which point the difference is
> irrelevant.
Good point. Test below.
> If you believe otherwise, demonstrate it.
http://www.usdirectory.com/ypr.aspx?fromform=qsearch&qs=ga&wqhqn=2&qc=Atlanta&rg=30&qhqn=restaurant&sb=zipdisc&ap=2
It's a 58854 byte file when saved to disk (smaller file was 3546 bytes),
so this is 16.6x larger. So I would expect python to linearly run in
16.6 * 0.88 = 14.6 seconds.
10 loops per run
1st run
$ python timeGetHTML.py
Finished urllib in 8.5 seconds
Finished urllib2 in 5.6 seconds
Finished requests in 7.8 seconds
Finished pycurl in 6.5 seconds
wait a couple minutes, then 2nd run
$ python timeGetHTML.py
Finished urllib in 5.6 seconds
Finished urllib2 in 5.7 seconds
Finished requests in 5.2 seconds
Finished pycurl in 6.4 seconds
It's a little more than 1/3 of my estimate - so good news.
(when I was doing these tests, some of the python results were 0.75
seconds - way too fast, so I checked and no data was written to file,
and I couldn't even open the webpage with a browser. Looks like I had
been temporarily blocked from the site. After a couple minutes, I was
able to access it again).
I noticed urllib and curl returned the html as is, but urllib2 and
requests added enhancements that should make the data easier to parse.
Based on speed and functionality and documentation, I believe I'll be
using the requests HTTP library (I will actually be doing a small amount
of web scraping).
VBScript
1st run: 7.70 seconds
2nd run: 5.38
3rd run: 7.71
So python matches or beats VBScript at this much larger file. Kewl.
More information about the Python-list
mailing list