[Tutor] Parsing HTML file

Chris Heisel chris at heisel.org
Thu Dec 11 16:12:56 EST 2003


Hi,

I'm working on a Python script that will go through a series of 
directories and parse some HTML files.

I'd like to be able to read the HTML and extract certain components and 
put them into a MySQL database.

For instance, in these files there will be a document title like this:
<h2 class="header">This is the documents header</h2>

There would be content marked like this:
<!--START CONTENT-->
<p>Some content</p>
<p>Some more content</p>
<h4>A sub head</h4>
<p>Again</p>
<!--END CONTENT-->

I'm wondering what the best way to approach this problem is?

I was reading up on htmllib and HTMLParser. Should I use them or do some 
regexp searches of the files for "<h2 class="header">*</h2>"?

If I should use htmllib and HTMLParser any suggestions on their use?

I gather than I can set event handlers for say, an <h2>, tag, but can I 
set event handlers for classes, like <h2 class="header">, or for blocks 
of commments like <!--START CONTENT--> and <!--END CONTENT-->

In a perferct world I would have gotten all this data in an XML format, 
that would make my life easier, but the files are already there in HTML 
and I've got to figure out how to extract some of the semantic content 
and stuff it into a MySQL DB...

Many, many thanks in advance for your help,

Chris





More information about the Tutor mailing list