Is it possible to download only the <head> of a web page?
Gabriel Genellina
gagsl-py2 at yahoo.com.ar
Fri Sep 5 00:21:56 EDT 2008
En Thu, 04 Sep 2008 18:53:33 -0300, Fredrik Lundh <fredrik at pythonware.com>
escribi�:
> Rex wrote:
>
>> I am writing a script that executes a bunch of queries through a form
>> on a website and reads the results. I am only interested in the
>> <title> section in the <head> of each web page. Currently, each page
>> the server returns is about 100kb and contains a bunch of HTML and
>> Javascript, all of which I don't need; I don't want to waste bandwidth
>> or consume too much of the server's resources. I just need the <title>
>> string.
>
> you need to issue a GET request to get the HTML head section, which
> almost always means that the server will build the entire page before
> sending it to you (so it can set content-length etc).
>
> you can save on network traffic by parsing the data as it arrives, and
> stopping when you've gotten the TITLE element:
>
> http://effbot.org/librarybook/sgmllib.htm
Another alternative would be to estimate the size it takes to reach to the
<title> tag, and issue a GET with a Range header. The server will -very
likely- have to build the entire page, but won't attempt to send more
bytes than requested. (In case the requested size is not enough, one can
issue another GET asking for more data)
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35
--
Gabriel Genellina
More information about the Python-list
mailing list