[Tutor] Titles from a web page

Alan Gauld alan.gauld at btinternet.com
Thu May 5 10:13:11 CEST 2011


"louis leichtnam" <l.leichtnam at gmail.com> wrote

> I'm trying to write a program that looks in a webpage in find the 
> titles of
> a subsection of the page:
>
> Can you help me out? I tried using regular expression but I keep 
> hitting
> walls and I don't know what to do...

Regular expressions are the wrong tool for parsing HTML unless
you are searching for something very simple.

There is an html parser in the Python standard library (*) that you
can use if the HTML is reasonably well formed. If its sloppy you
would be better with something like BeautifulSoup or lxml.

If the page is written in XHTML then you could also use the
element tree module which is designed for XML parsing.

(*)In fact there are two! - htmllib and HTMLParser. The former is more
powerful but more complex. Some brief examples can be found
in my tutor here:

http://www.alan-g.me.uk/tutor/tutwebc.htm

Note, the topic is not complete, the last few sections are
placeholders only...

HTH,

Alan G. 




More information about the Tutor mailing list