[Tutor] Parsing HTML file
Chris Heisel
chris at heisel.org
Thu Dec 11 16:12:56 EST 2003
Hi,
I'm working on a Python script that will go through a series of
directories and parse some HTML files.
I'd like to be able to read the HTML and extract certain components and
put them into a MySQL database.
For instance, in these files there will be a document title like this:
<h2 class="header">This is the documents header</h2>
There would be content marked like this:
<!--START CONTENT-->
<p>Some content</p>
<p>Some more content</p>
<h4>A sub head</h4>
<p>Again</p>
<!--END CONTENT-->
I'm wondering what the best way to approach this problem is?
I was reading up on htmllib and HTMLParser. Should I use them or do some
regexp searches of the files for "<h2 class="header">*</h2>"?
If I should use htmllib and HTMLParser any suggestions on their use?
I gather than I can set event handlers for say, an <h2>, tag, but can I
set event handlers for classes, like <h2 class="header">, or for blocks
of commments like <!--START CONTENT--> and <!--END CONTENT-->
In a perferct world I would have gotten all this data in an XML format,
that would make my life easier, but the files are already there in HTML
and I've got to figure out how to extract some of the semantic content
and stuff it into a MySQL DB...
Many, many thanks in advance for your help,
Chris
More information about the Tutor
mailing list