[Tutor] MemoryError !!! Help Required

Mon Apr 7 16:36:40 CEST 2008

I don't have a lot of experience, but I would suggest dictionaries
(which use hash values).

A possible scenario would be somthing similar to Andreas'

visited = dict()

url = "http://www.monty.com"
 file = "/spam/holyhandgrenade/three.html"

visited[url] = file

unvisited = dict()

url = "http://www.bringoutyourdead.org"
 file = "/fleshwound.html"

unvisited[url] = file

url = "http://129.29.3.59"
file = "foo.php"

unvisited[url] = file

(of course, functions, loops, etc. would clear up some repetitions)

Now that I think about it... It would probably work better to keep the
visited urls in a dict (assuming that list is smaller) and the
unvisited ones in a FIFO file, though I'm not 100% sure on that.

If you were simply unconcerned with speed, you could easily keep both
lists stored as csv files, and load each to compare against each URL,

for each url in newurl:
    try visited[url]:
    except KeyError:
        #This means the URL hasn't been visited

that's probably the easiest way to compare dict values. A possible
good idea, if you were going that route (reading each file) is to
create a dir for each 1st char in the url (after http://, and a
separate one for
http://www. since those are the most common, and yes some sites like
www.uca.edu don't allow http://uca.edu).

Good luck!
-Wayne

On 4/7/08, Andreas Kostyrka <andreas at kostyrka.org> wrote:
>
>  Am Montag, den 07.04.2008, 00:32 -0500 schrieb Luke Paireepinart:
>
> > devj wrote:
>  > > Hi,
>  > > I am making a web crawler using Python.To avoid dupliacy of urls,i have to
>  > > maintain lists of downloaded urls and to-be-downloaded urls ,of which the
>  > > latter grows exponentially,resulting in a MemoryError exception .What are
>  > > the possible ways to avoid this ??
>  > >
>  > get more RAM, store the list on your hard drive, etc. etc.
>  > Why are you trying to do this?  Are you sure you can't use existing
>  > tools for this such as wget?
>  > -Luke
>
>
> Also traditional solutions involve e.g. remembering a hash value.
>
>  Plus if you go for a simple file based solution, you probably should
>  store it by hostname, e.g.:
>  http://123.45.67.87/abc/def/text.html => file("127/45/67/87",
>  "w").write("/abc/def/text.html")
>  (guess you need to run os.makedirs as needed :-P)
>
>  These makes it scaleable (by not storying to many files in one
>  directory, and by leaving out the common element so the files are
>  smaller and faster to read), while keeping the code relative simple.
>
>  Another solution would be shelve, but you have to keep in mind that if
>  you are unlucky you might loose the database. (Some of the DBs that
>  anydbm might not survive power loss, or other problems to well)
>
>
>  Andreas
>
>
>  > _______________________________________________
>  > Tutor maillist  -  Tutor at python.org
>  > http://mail.python.org/mailman/listinfo/tutor
>
> _______________________________________________
>  Tutor maillist  -  Tutor at python.org
>  http://mail.python.org/mailman/listinfo/tutor
>
>
>

-- 
To be considered stupid and to be told so is more painful than being
called gluttonous, mendacious, violent, lascivious, lazy, cowardly:
every weakness, every vice, has found its defenders, its rhetoric, its
ennoblement and exaltation, but stupidity hasn't. - Primo Levi