Parsing HTML
Frederic Rentsch
anthra.norell at vtxmail.ch
Wed Feb 14 08:13:05 EST 2007
mtuller wrote:
> Alright. I have tried everything I can find, but am not getting
> anywhere. I have a web page that has data like this:
>
> <tr >
> <td headers="col1_1" style="width:21%" >
> <span class="hpPageText" >LETTER</span></td>
> <td headers="col2_1" style="width:13%; text-align:right" >
> <span class="hpPageText" >33,699</span></td>
> <td headers="col3_1" style="width:13%; text-align:right" >
> <span class="hpPageText" >1.0</span></td>
> <td headers="col4_1" style="width:13%; text-align:right" >
> </tr>
>
> What is show is only a small section.
>
> I want to extract the 33,699 (which is dynamic) and set the value to a
> variable so that I can insert it into a database. I have tried parsing
> the html with pyparsing, and the examples will get it to print all
> instances with span, of which there are a hundred or so when I use:
>
> for srvrtokens in printCount.searchString(printerListHTML):
> print srvrtokens
>
> If I set the last line to srvtokens[3] I get the values, but I don't
> know grab a single line and then set that as a variable.
>
> I have also tried Beautiful Soup, but had trouble understanding the
> documentation, and HTMLParser doesn't seem to do what I want. Can
> someone point me to a tutorial or give me some pointers on how to
> parse html where there are multiple lines with the same tags and then
> be able to go to a certain line and grab a value and set a variable's
> value to that?
>
>
> Thanks,
>
> Mike
>
>
Posted problems rarely provide exhaustive information. It's just not
possible. I have been taking shots in the dark of late suggesting a
stream-editing approach to extracting data from htm files. The
mainstream approach is to use a parser (beautiful soup or pyparsing).
Often times nothing more is attempted than the location and
extraction of some text irrespective of page layout. This can sometimes
be done with a simple regular expression, or with a stream editor if a
regular expression gets too unwieldy. The advantage of the stream editor
over a parser is that it doesn't mobilize an arsenal of unneeded
functionality and therefore tends to be easier, faster and shorter to
implement. The editor's inability to understand structure isn't a
shortcoming when structure doesn't matter and can even be an advantage
in the presence of malformed input that sends a parser on a tough and
potentially hazardous mission for no purpose at all.
SE doesn't impose the study of massive documentation, nor the
memorization of dozens of classes, methods and what not. The following
four lines would solve the OP's problem (provided the post really is all
there is to the problem):
>>> import re, SE # http://cheeseshop.python.org/pypi/SE/2.3
>>> Filter = SE.SE ('<EAT> "~(?i)col[0-9]_[0-9](.|\n)*?/td>~==SOME
SPLIT MARK"')
>>> r = re.compile ('(?i)(col[0-9]_[0-9])(.|\n)*?([0-9,]+)</span')
>>> for line in Filter (s).split ('SOME SPLIT MARK'):
print r.search (line).group (1, 3)
('col2_1', '33,699')
('col3_1', '0')
('col4_1', '7,428')
-----------------------------------------------------------------------
Input:
>>> s = '''
<td headers="col1_1" style="width:21%" >
<span class="hpPageText" >LETTER</span></td>
<td headers="col2_1" style="width:13%; text-align:right" >
<span class="hpPageText" >33,699</span></td>
<td headers="col3_1" style="width:13%; text-align:right" >
<span class="hpPageText" >1.0</span></td>
<td headers="col5_1" style="width:13%; text-align:right" >
<span class="hppagetext" >7,428</span></td>
</tr>'''
The SE object handles file input too:
>>> for line in Filter ('file_name', '').split ('SOME SPLIT MARK'): #
'' commands string output
print r.search (line).group (1, 3)
More information about the Python-list
mailing list