python tags on websites timeout problem

Lee Harr missive at
Mon Jul 21 00:36:41 CEST 2003

In article <cdac0350.0307191527.755df3e1 at>, jeff wrote:
> Hiya
> im trying to pull tags off a website using python ive got a few things
> running that have the potential to work its just i cant get them to
> becuase  of certain errors?
> basically i dont what to download the images and all the stuff just
> the html and then work from there, i think its timing out because its
> trying to downlaod the images as well which i dont what to do as this
> would decrease the speed of what im trying to achieve, the URL used is
> only that for an example

A web page is made up of many separate components. When you
"download a webpage" you generally are fetching the HTML code,
and you will not get any images unless you specifically
download those by their own URLs.

> this is my source
> --------------------------------------------------------------------------------
> #!/usr/bin/env python
> import re
> import urllib
> file = urllib.urlretrieve(""
> , "temp1.tmp")

Two things:

Don't use the name "file" as the name of your variable, as that
is now the standard way to access a file (used instead of open)

Why save the file and then read it back in?

I might do something like...

text = urllib.urlopen('')
for line in text.readlines():
    print line

> # searching the file content line by line:
> keyword = re.compile(r"</a>")
> for line in text:
>     result = (line)
>     if result:
>        print, ":", line,

There are no parentheses in your regex, so I do not
think you will ever have a group(1)

>>> import re
>>> keyword = re.compile(r"</a>")
>>> x = 'abc </a> def'
>>> z =
>>> z.groups()
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
IndexError: no such group

>>> keyword = re.compile(r"(</a>)")

> --------------------------------------------------------------------------------
> and these are the errors im getting

> C:\Python22>python
> Traceback (most recent call last):
>   File "", line 5, in ?
>     file = urllib.urlretrieve("
> 8&oe=UTF-8&q=rabbit" , "temp1.tmp")

Is this newline (between image and 8 really there?  Maybe
there is a problem with the URL...

>   File "C:\PYTHON22\lib\", line 80, in urlretrieve
>     return _urlopener.retrieve(url, filename, reporthook, dat
>   File "C:\PYTHON22\lib\", line 210, in retrieve
>     fp =, data)
>   File "C:\PYTHON22\lib\", line 178, in open
>     return getattr(self, name)(url)
>   File "C:\PYTHON22\lib\", line 292, in open_http
>     h.endheaders()
>   File "C:\PYTHON22\lib\", line 695, in endheaders
>     self._send_output()
>   File "C:\PYTHON22\lib\", line 581, in _send_outpu
>     self.send(msg)
>   File "C:\PYTHON22\lib\", line 548, in send
>     self.connect()
>   File "C:\PYTHON22\lib\", line 532, in connect
>     raise socket.error, msg
> --------------------------------------------------------------------------------

I think maybe you just are not getting any response at
all from your try to fetch.  Can you get any other URL ?
Maybe google is watching user-agent strings to try to keep
spiders out of their pages?

More information about the Python-list mailing list