Pyparsing: Non-greedy matching?
Peter Fein
pfein at pobox.com
Thu Dec 30 19:56:48 EST 2004
I'm trying to use pyparsing write a screenscraper. I've got some
arbitrary HTML text I define as opener & closer. In between is the HTML
data I want to extract. However, the data may contain the same
characters as used in the closer (but not the exact same text,
obviously). I'd like to get the *minimal* amount of data between these.
Here's an example (whitespace may differ):
from pyparsing import *
test=r"""<tr class="tableTopSpace"><td></td></tr>
<tr class="tableTitleDark"><td class="tableTitleDark">Job
Information</td></tr><tr><td><table width="100%" border="0"
cellspacing="3"><tr>
<td width="110" valign="top"><div align="right"><strong>Job Title:
</strong></div></td>
<td class="ccDisplayCell">Big Old <B
STYLE="background-color:#FFEF95">Head Honcho</B> Boss Man</td></tr>
<tr>
<td width="110" valign="top"><div align="right"><strong>Employer:
</strong></div></td>
<td width="200" nowrap class="ccDisplayCell"><table><tr><td colspan="2"
valign="top">Global Megacorp</td></tr></table></td><td>
<script>
function escapecomp(){
}
"""
data=Combine(OneOrMore(Word(printables)), adjacent=False,
joinString=" ")
title_open=Literal(r"""<td width="110" valign="top"><div
align="right"><strong>Job Title: </strong></div></td>
<td class="ccDisplayCell">""")
title_open.suppress()
title_close=Literal(r"""</td>""")
title_close.suppress()
title=title_open + data + title_close
title2=title_open + (data | title_close)
>>> title.scanString(test).next()
Traceback (most recent call last):
File "<stdin>", line 1, in ?
StopIteration
>>> title2.scanString(test).next()
((['<td width="110" valign="top"><div align="right"><strong>Job Title:\n
</strong></div></td>\n<td class="ccDisplayCell">', 'Big Old <B
STYLE="background-color:#FFEF95">Head Honcho</B> Boss Man</td> </tr>
<tr> <td width="110" valign="top"><div align="right"><strong>Employer:
</strong></div></td> <td width="200" nowrap
class="ccDisplayCell"><table><tr><td colspan="2" valign="top">Global
Megacorp</td></tr></table></td> <td> <script> function escapecomp(){
}'], {}), 182, 656)
>>>
I'd expected title to work, but it doesn't match at all. ;( In other
test variants, title2 gives extra stuff at the end though not
necessarily to the end of the string (due to unprintable characters,
perhaps).
I want a ParseResult more like:
['<td width="110" valign="top"><div align="right"><strong>Job Title:\n
</strong></div></td>\n<td class="ccDisplayCell">', 'Big Old <B
STYLE="background-color:#FFEF95">Head Honcho</B> Boss Man, '</td>']
I sort of understand why title2 works as it does (the OneOrMore just
slurps up everything), but for the life of me I can't figure out how to
fix it. ;) Is there a way of writing something similar to RE's ".*?" ?
--Pete
--
Peter Fein pfein at pobox.com 773-575-0694
Basically, if you're not a utopianist, you're a schmuck. -J. Feldman
More information about the Python-list
mailing list