website catcher

Mike Meyer mwm at
Sun Jul 3 22:24:22 CEST 2005

"jwaixs" <jwaixs at> writes:

> If I should put the parsedwebsites in, for example, a tablehash it will
> be at least 5 times faster than just putting it in a file that needs to
> be stored on a slow harddrive. Memory is a lot faster than harddisk
> space. And if there would be a lot of people asking for a page all of
> them have to open that file. if that are 10 requests in 5 minutes
> there's no real worry. If they are more that 10 request per second you
> really have a big problem and the framework would probably crash or
> will run uber slow. That's why I want to open the file only one time
> and keep it saved in the memory of the server where it don't need to be
> opened each time some is asking for it.

While Diez gave you some good reasons not to worry about this, and had
some great advice, he missed one important reason you shouldn't worry
about this:

Your OS almost certainly has a disk cache.

This means that if you get 10 requests for a page in a second, the
first one will come off the disk and wind up in the OS disk cache. The
next nine requests will get the pages from the OS disk cache, and not
go to the disk at all.

When you keep these pages in memory yourself, you're basically
declaring that they are so important that you don't trust the OS to
cache them properly. The exact details of how your using extra memory
interact with the disk cache vary with the OS, but there's a fair
chance that you're cutting down on the amount of disk cache the system
will have available.

In the end, if the OS disagrees with you about how important your
pages are, it will win. Your pages will get paged out to disk, and
have to be read back from disk even though you have them stored in
memory. With extra overhead in the form of an interrupt when your
process tries to access the swapped out page, at that.

A bunch of very smart people have spent a lot of time making modern
operating systems perform well. Worrying about things that it is
already worrying about is generally a waste of time - a clear case of
premature optimization.

Mike Meyer <mwm at>
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.

More information about the Python-list mailing list