Python web client anyone?

Ng Pheng Siong ngps at
Mon Oct 15 16:14:34 CEST 2001

According to Paul Rubin  <phr-n2001d at>:
> What I *really* want is to be able to easily find link objects
> (anchor tags) based on the anchor text, which LWP for some reason
> doesn't do, but DOM extraction would be a start.  By "anchor text" I
> mean the text in <a href=blah.html>this is the anchor text</a>.  The
> client should be able to find some "underlined" text on the page it
> retrieves, and "click" on the linked document.

Surely, you find the tags by parsing "<a href=blah.html>" (sic), not by
looking for "this is the anchor text"?

> I may not have read the htmllib docs carefuly enough but it looks more
> intended for formatting/displaying HTML than parsing it.  Are your
> DOM extensions available?

htmllib parses fine enough. Here's a demo from M2Crypto. It seems to work,
too. ;-)

"""M2Crypto.SSL.Session client demo: This program requests a URL from 
a HTTPS server, saves the negotiated SSL session id, parses the HTML 
returned by the server, then requests each HREF in a separate thread 
using the saved SSL session id.

Copyright (c) 1999-2000 Ng Pheng Siong. All rights reserved."""

RCS_id='$Id:,v 1.2 2000/09/11 14:52:29 ngps Exp ngps $'

from M2Crypto import Err, Rand, SSL, X509, threading
m2_threading = threading; del threading

import formatter, getopt, htmllib, sys
from threading import Thread
from socket import gethostname

def handler(sslctx, host, port, href, recurs=0, sslsess=None):

    s = SSL.Connection(sslctx)
    if sslsess:
        s.connect((host, port))
        s.connect((host, port))
        sslsess = s.get_session()
    #print sslsess.as_text()

    if recurs:
        p = htmllib.HTMLParser(formatter.NullFormatter())

    f = s.makefile("rw")

    while 1:
        data =
        if not data:
        if recurs:

    if recurs:


    if recurs:
        for a in p.anchorlist:
            req = 'GET %s HTTP/1.0\r\n\r\n' % a
            thr = Thread(target=handler, 
                        args=(sslctx, host, port, req, recurs-1, sslsess))
            print "Thread =", thr.getName()

if __name__ == '__main__':

    Rand.load_file('../randpool.dat', -1) 

    host = ''
    port = 443
    req = '/'

    optlist, optarg = getopt.getopt(sys.argv[1:], 'h:p:r:')
    for opt in optlist:
        if '-h' in opt:
            host = opt[1]
        elif '-p' in opt:
            port = int(opt[1])
        elif '-r' in opt:
            req = opt[1]
    ctx = SSL.Context('sslv3')
    ctx.set_verify(SSL.verify_none, 10)
    req = 'GET %s HTTP/1.0\r\n\r\n' % req

    start = Thread(target=handler, args=(ctx, host, port, req, 1))
    print "Thread =", start.getName()

