Parsing HTML, extracting text and changing attributes.

Jay Loden python at jayloden.com
Mon Jun 18 18:36:49 CEST 2007


Neil Cerutti wrote:
> You could get good results, and save yourself some effort, using
> links or lynx with the command line options to dump page text to
> a file. Python would still be needed to automate calling links or
> lynx on all your documents.

OP was looking for a way to parse out part of the file and apply classes to certain types of tags. Using lynx/links wouldn't help, since the output of links or lynx is going to end up as plain text and the desire isn't to strip all the formatting. 

Someone else mentioned lxml but as I understand it lxml will only work if it's valid XHTML that they're working with. Assuming it's not (since real-world HTML almost never is), perhaps BeautifulSoup will fare better. 

http://www.crummy.com/software/BeautifulSoup/documentation.html

-Jay



More information about the Python-list mailing list