[Tutor] MemoryError !!! Help Required

Mon Apr 7 18:40:51 CEST 2008

Hint: MemoryError suggests that his dicts have filled up his address
space (probably). 1-3GB on Linux, 2?GB on Windows. At least for 32bit
versions.

So storing the whole URL in memory is probably out of question, storing
it only in some form of files might be slightly slow, so one compromise
would be to store sets of hashes for URLs that have been seen. This way
unseen URLs could be recognized without a visit to the disc, and URLs
where the hashvalue is in the seen set, could be queued at the end, and
all urls to one hostname could be checked all at the same time when
reading the file.

Andreas

Am Montag, den 07.04.2008, 09:36 -0500 schrieb W W:
> I don't have a lot of experience, but I would suggest dictionaries
> (which use hash values).
> 
> A possible scenario would be somthing similar to Andreas'
> 
> visited = dict()
> 
> url = "http://www.monty.com"
>  file = "/spam/holyhandgrenade/three.html"
> 
> visited[url] = file
> 
> unvisited = dict()
> 
> url = "http://www.bringoutyourdead.org"
>  file = "/fleshwound.html"
> 
> unvisited[url] = file
> 
> url = "http://129.29.3.59"
> file = "foo.php"
> 
> unvisited[url] = file
> 
> (of course, functions, loops, etc. would clear up some repetitions)
> 
> Now that I think about it... It would probably work better to keep the
> visited urls in a dict (assuming that list is smaller) and the
> unvisited ones in a FIFO file, though I'm not 100% sure on that.
> 
> If you were simply unconcerned with speed, you could easily keep both
> lists stored as csv files, and load each to compare against each URL,
> 
> for each url in newurl:
>     try visited[url]:
>     except KeyError:
>         #This means the URL hasn't been visited
> 
> that's probably the easiest way to compare dict values. A possible
> good idea, if you were going that route (reading each file) is to
> create a dir for each 1st char in the url (after http://, and a
> separate one for
> http://www. since those are the most common, and yes some sites like
> www.uca.edu don't allow http://uca.edu).
> 
> Good luck!
> -Wayne
> 
> On 4/7/08, Andreas Kostyrka <andreas at kostyrka.org> wrote:
> >
> >  Am Montag, den 07.04.2008, 00:32 -0500 schrieb Luke Paireepinart:
> >
> > > devj wrote:
> >  > > Hi,
> >  > > I am making a web crawler using Python.To avoid dupliacy of urls,i have to
> >  > > maintain lists of downloaded urls and to-be-downloaded urls ,of which the
> >  > > latter grows exponentially,resulting in a MemoryError exception .What are
> >  > > the possible ways to avoid this ??
> >  > >
> >  > get more RAM, store the list on your hard drive, etc. etc.
> >  > Why are you trying to do this?  Are you sure you can't use existing
> >  > tools for this such as wget?
> >  > -Luke
> >
> >
> > Also traditional solutions involve e.g. remembering a hash value.
> >
> >  Plus if you go for a simple file based solution, you probably should
> >  store it by hostname, e.g.:
> >  http://123.45.67.87/abc/def/text.html => file("127/45/67/87",
> >  "w").write("/abc/def/text.html")
> >  (guess you need to run os.makedirs as needed :-P)
> >
> >  These makes it scaleable (by not storying to many files in one
> >  directory, and by leaving out the common element so the files are
> >  smaller and faster to read), while keeping the code relative simple.
> >
> >  Another solution would be shelve, but you have to keep in mind that if
> >  you are unlucky you might loose the database. (Some of the DBs that
> >  anydbm might not survive power loss, or other problems to well)
> >
> >
> >  Andreas
> >
> >
> >  > _______________________________________________
> >  > Tutor maillist  -  Tutor at python.org
> >  > http://mail.python.org/mailman/listinfo/tutor
> >
> > _______________________________________________
> >  Tutor maillist  -  Tutor at python.org
> >  http://mail.python.org/mailman/listinfo/tutor
> >
> >
> >
> 
> 
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Dies ist ein digital signierter Nachrichtenteil
Url : http://mail.python.org/pipermail/tutor/attachments/20080407/6f3abd84/attachment.pgp