Too big of a list? and other problems

Mon May 22 21:05:40 EDT 2006

"Brian" <bnblazer at gmail.com> wrote in message
news:1148343552.037687.214420 at j55g2000cwa.googlegroups.com...
> First off, I am sorry for cluttering this group with my inept
> questions, but I am stuck again despite a few hours of hair pulling.
>

Don't apologize for getting stuck, especially after you have made an honest
effort at solving your own problems.

> I have a function (below) that takes a list of html pages that have
> images on them (not porn but boats).  This function then (supposedly)
> goes through and extracts the links to those images and puts them into
> a list, appending with each iteration of the for loop.  The list of
> html pages is 82 items long and each page has multiple image links.
> When the function gets to item 77 or so, the list gets all funky.
> Sometimes it goes empty, and others it is a much more abbreviated list
> than I expect - it should have roughly 750 image links.
>
> When I looked at it while running, it appears as if my regex is
> actually appending a tuple (I think) of the results it finds to the
> list.  My best guess is that the list is getting too big and croaks.

750 elements is really pretty modest in the universe of Python lists.  This
should not be an issue.

> Since one of the objects of the function is also to be able to count
> the items in the list, I am getting some strange errors there as well.
>
> Here is the code:
>
> def countPics(linkList):
>     foundPics = []
>     count = 0
>     for link in linkList:
>         picPage =
> urllib.urlopen("http://continuouswave.com/whaler/cetacea/" +
>                                  link)
>         count = count +1
>         print 'got page', count
>         html = picPage.read()
>         picPage.close()
>         pics = re.compile(r"images/.*\.jpeg")
>         foundPics.append(pics.findall(html))
>         #print len(foundPics)
>     print "found", len(foundPics), "pictures"
>     print foundPics
>
> Again, I sincerely appreciate the answers, time and patience this group
> is giving me.
>
> Thank you for any help you can provide in showing me where I am going
> wrong.
> Brian
>

I'm not overly familiar with the workings of re.findall so I ran these
statements on the Python command line:

>>> r = re.compile("A.B")
>>> print r.findall("SLDKJFOIWUEAJBLJEQUSAUBSLJF:SDFA_B")
['AJB', 'AUB', 'A_B']
>>> print list(r.findall("SLDKJFOIWUEAJBLJEQUSAUBSLJF:SDFA_B"))
['AJB', 'AUB', 'A_B']
>>> print r.findall("SLDKJFOIWUEAJBLJEQUSAUBSLJF:SDF")
['AJB', 'AUB']
>>> print r.findall("SLDKJFOIWUEAJBLJEQUSSLJF:SDF")
['AJB']
>>> print type(r.findall("SLDKJFOIWUEAJBLJEQUSSLJF:SDF"))
<type 'list'>

Everything looks just like one would expect.

A minor nit is that you *don't* have to compile your pics regexp in the body
of the loop.  Move the

        pics = re.compile(r"images/.*\.jpeg")

statement to before the start of the for loop - you can safely reuse it on
each successive web page without having to recompile (this is the purpose of
compiling re's in the first place - otherwise, you could just call
re.findall(r"images/.*\.jpeg",html).  Compiling the regexp saves some
processing in the body of the loop.)  But this should not account for your
described odd behavior.

How is countPics being called?  Are you accidentally calling it multiple
times?  This would explain why the list of found pics goes back to zero
(since you reset it at the start of the function).

-- Paul