Help Parsing an HTML File

egonslokar at egonslokar at
Fri Feb 15 22:28:51 CET 2008

Hello Python Community,

It'd be great if someone could provide guidance or sample code for
accomplishing the following:

I have a single unicode file that has  descriptions of hundreds of
objects. The file fairly resembles HTML-EXAMPLE pasted below.

I need to parse the file in such a way to extract data out of the html
and to come up with a tab separated file that would look like OUTPUT-
FILE below.

Any tips, advice and guidance is greatly appreciated.



/please note that the first line of the file contains column headers/
------Tab Separated Output File Begin------
H1	H2	DIV	Segment1	Segment2	Segment3
RoséH1-1	RoséH2-1	RoséDIV-1	RoséSegmentDIV1-1	RoséSegmentDIV2-1
PinkH1-2	PinkH2-2	PinkDIV2-2	PinkSegmentDIV1-2	No-Value	No-Value
BlackH1-3	BlackH2-3	BlackDIV2-3	BlackSegmentDIV1-3	No-Value	No-Value
YellowH1-4	YellowH2-4	YellowDIV2-4	YellowSegmentDIV1-4
YellowSegmentDIV2-4	No-Value
------Tab Separated Output File End------

------HTML Example Begin------

<div "segment1">RoséSegmentDIV1-1</div><br>
<div "segment2">RoséSegmentDIV2-1</div><br>
<div "segment3">RoséSegmentDIV3-1</div><br>

<div "segment1">PinkSegmentDIV1-2</div><br>

<div "segment1">BlackSegmentDIV1-3</div><br>

<div "segment1">YellowSegmentDIV1-4</div><br>
<div "segment2">YellowSegmentDIV2-4</div><br>

------HTML Example End------

More information about the Python-list mailing list