Help Parsing an HTML File
Mike Driscoll
kyosohma at gmail.com
Fri Feb 15 17:06:42 EST 2008
On Feb 15, 3:28 pm, egonslo... at gmail.com wrote:
> Hello Python Community,
>
> It'd be great if someone could provide guidance or sample code for
> accomplishing the following:
>
> I have a single unicode file that has descriptions of hundreds of
> objects. The file fairly resembles HTML-EXAMPLE pasted below.
>
> I need to parse the file in such a way to extract data out of the html
> and to come up with a tab separated file that would look like OUTPUT-
> FILE below.
>
> Any tips, advice and guidance is greatly appreciated.
>
> Thanks,
>
> Egon
>
> =====OUTPUT-FILE=====
> /please note that the first line of the file contains column headers/
> ------Tab Separated Output File Begin------
> H1 H2 DIV Segment1 Segment2 Segment3
> RoséH1-1 RoséH2-1 RoséDIV-1 RoséSegmentDIV1-1 RoséSegmentDIV2-1
> RoséSegmentDIV3-1
> PinkH1-2 PinkH2-2 PinkDIV2-2 PinkSegmentDIV1-2 No-Value No-Value
> BlackH1-3 BlackH2-3 BlackDIV2-3 BlackSegmentDIV1-3 No-Value No-Value
> YellowH1-4 YellowH2-4 YellowDIV2-4 YellowSegmentDIV1-4
> YellowSegmentDIV2-4 No-Value
> ------Tab Separated Output File End------
>
> =====HTML-EXAMPLE=====
> ------HTML Example Begin------
> <html>
>
> <h1>RoséH1-1</h1>
> <h2>RoséH2-1</h2>
> <div>RoséDIV-1</div>
> <div "segment1">RoséSegmentDIV1-1</div><br>
> <div "segment2">RoséSegmentDIV2-1</div><br>
> <div "segment3">RoséSegmentDIV3-1</div><br>
> <br>
> <br>
>
> <h1>PinkH1-2</h1>
> <h2>PinkH2-2</h2>
> <div>PinkDIV2-2</div>
> <div "segment1">PinkSegmentDIV1-2</div><br>
> <br>
> <comment></comment>
>
> <h1>BlackH1-3</h1>
> <h2>BlackH2-3</h2>
> <div>BlackDIV2-3</div>
> <div "segment1">BlackSegmentDIV1-3</div><br>
>
> <h1>YellowH1-4</h1>
> <h2>YellowH2-4</h2>
> <div>YellowDIV2-4</div>
> <div "segment1">YellowSegmentDIV1-4</div><br>
> <div "segment2">YellowSegmentDIV2-4</div><br>
>
> </html>
> ------HTML Example End------
Pyparsing, ElementTree and lxml are all good candidates as well.
BeautifulSoup takes care of malformed html though.
http://pyparsing.wikispaces.com/
http://effbot.org/zone/element-index.htm
http://codespeak.net/lxml/
Mike
More information about the Python-list
mailing list