Help Parsing an HTML File
Stefan Behnel
stefan_ml at behnel.de
Sat Feb 16 02:41:45 EST 2008
egonslokar at gmail.com wrote:
> I have a single unicode file that has descriptions of hundreds of
> objects. The file fairly resembles HTML-EXAMPLE pasted below.
>
> I need to parse the file in such a way to extract data out of the html
> and to come up with a tab separated file that would look like OUTPUT-
> FILE below.
>
> =====OUTPUT-FILE=====
> /please note that the first line of the file contains column headers/
> ------Tab Separated Output File Begin------
> H1 H2 DIV Segment1 Segment2 Segment3
> RoséH1-1 RoséH2-1 RoséDIV-1 RoséSegmentDIV1-1 RoséSegmentDIV2-1
> ------Tab Separated Output File End------
>
> =====HTML-EXAMPLE=====
> ------HTML Example Begin------
> <html>
>
> <h1>RoséH1-1</h1>
> <h2>RoséH2-1</h2>
> <div>RoséDIV-1</div>
> <div "segment1">RoséSegmentDIV1-1</div><br>
> <div "segment2">RoséSegmentDIV2-1</div><br>
> <div "segment3">RoséSegmentDIV3-1</div><br>
> <br>
> <br>
>
> </html>
> ------HTML Example End------
Now, what ugly markup is that? You will never manage to get any HTML compliant
parser return the "segmentX" stuff in there. I think your best bet is really
going for pyparsing or regular expressions (and I actually recommend pyparsing
here).
Stefan
More information about the Python-list
mailing list