Extract Title from HTML documents
Walter Dörwald
walter at livinglogic.de
Fri Nov 5 04:59:43 EST 2004
Nickolay Kolev wrote:
> Hi all,
>
> I am looking for a way to extract the titles of HTML documents. I have
> made an honest attempt at doing it, and it even works. Is there an
> easier (faster / more efficient / clearer) way?
You might try XIST (http://www.livinglogic.de/Python/xist):
---
from ll.xist import parsers, xfind
from ll.xist.ns import html
e = parsers.parseFile("test.html", tidy=True)
print unicode(xfind.first(e//html.title))
---
(This uses libxml2's HTML parser internally).
Bye,
Walter Dörwald
More information about the Python-list
mailing list