Help Parsing an HTML File
Peter Otten
__peter__ at web.de
Sat Feb 16 04:10:33 EST 2008
Stefan Behnel wrote:
> egonslokar at gmail.com wrote:
>> I have a single unicode file that has descriptions of hundreds of
>> objects. The file fairly resembles HTML-EXAMPLE pasted below.
>>
>> I need to parse the file in such a way to extract data out of the html
>> and to come up with a tab separated file that would look like OUTPUT-
>> FILE below.
>>
>> =====OUTPUT-FILE=====
>> /please note that the first line of the file contains column headers/
>> ------Tab Separated Output File Begin------
>> H1 H2 DIV Segment1 Segment2 Segment3
>> RoséH1-1 RoséH2-1 RoséDIV-1 RoséSegmentDIV1-1 RoséSegmentDIV2-1
>> ------Tab Separated Output File End------
>>
>> =====HTML-EXAMPLE=====
>> ------HTML Example Begin------
>> <html>
>>
>> <h1>RoséH1-1</h1>
>> <h2>RoséH2-1</h2>
>> <div>RoséDIV-1</div>
>> <div "segment1">RoséSegmentDIV1-1</div><br>
>> <div "segment2">RoséSegmentDIV2-1</div><br>
>> <div "segment3">RoséSegmentDIV3-1</div><br>
>> <br>
>> <br>
>>
>> </html>
>> ------HTML Example End------
>
> Now, what ugly markup is that? You will never manage to get any HTML
> compliant parser return the "segmentX" stuff in there. I think your best
> bet is really going for pyparsing or regular expressions (and I actually
> recommend pyparsing here).
>
> Stefan
In practice the following might be sufficient:
from BeautifulSoup import BeautifulSoup
def chunks(bs):
chunk = []
for tag in bs.findAll(["h1", "h2", "div"]):
if tag.name == "h1":
if chunk:
yield chunk
chunk = []
chunk.append(tag)
if chunk:
yield chunk
def process(filename):
bs = BeautifulSoup(open(filename))
for chunk in chunks(bs):
columns = [tag.string for tag in chunk]
columns += ["No Value"] * (6 - len(columns))
print "\t".join(columns)
if __name__ == "__main__":
process("example.html")
The biggest caveat is that only columns at the end of a row may be left out.
Peter
More information about the Python-list
mailing list