Pyparsing: Non-greedy matching?

Thu Dec 30 19:56:48 EST 2004

I'm trying to use pyparsing write a screenscraper.  I've got some
arbitrary HTML text I define as opener & closer.  In between is the HTML
data I want to extract.  However, the data may contain the same
characters as used in the closer (but not the exact same text,
obviously).  I'd like to get the *minimal* amount of data between these.

Here's an example (whitespace may differ):

from pyparsing import *

test=r"""<tr class="tableTopSpace"><td></td></tr>
<tr class="tableTitleDark"><td class="tableTitleDark">Job
Information</td></tr><tr><td><table width="100%" border="0"
cellspacing="3"><tr>
<td width="110" valign="top"><div align="right"><strong>Job Title:
      </strong></div></td>
<td class="ccDisplayCell">Big Old <B
STYLE="background-color:#FFEF95">Head Honcho</B> Boss Man</td></tr>
<tr>
<td width="110" valign="top"><div align="right"><strong>Employer:
        </strong></div></td>
<td width="200" nowrap class="ccDisplayCell"><table><tr><td colspan="2"
valign="top">Global Megacorp</td></tr></table></td><td>
    <script>
    function escapecomp(){
    }
"""

data=Combine(OneOrMore(Word(printables)), adjacent=False, 
joinString=" ")

title_open=Literal(r"""<td width="110" valign="top"><div
align="right"><strong>Job Title:      </strong></div></td>
<td class="ccDisplayCell">""")
title_open.suppress()

title_close=Literal(r"""</td>""")
title_close.suppress()

title=title_open + data + title_close
title2=title_open + (data | title_close)

>>> title.scanString(test).next()
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
StopIteration

>>> title2.scanString(test).next()
((['<td width="110" valign="top"><div align="right"><strong>Job Title:\n
     </strong></div></td>\n<td class="ccDisplayCell">', 'Big Old <B
STYLE="background-color:#FFEF95">Head Honcho</B> Boss Man</td> </tr>
<tr> <td width="110" valign="top"><div align="right"><strong>Employer:
</strong></div></td> <td width="200" nowrap
class="ccDisplayCell"><table><tr><td colspan="2" valign="top">Global
Megacorp</td></tr></table></td> <td> <script> function escapecomp(){
}'], {}), 182, 656)
>>> 

I'd expected title to work, but it doesn't match at all. ;(  In other
test variants, title2 gives extra stuff at the end though not
necessarily to the end of the string (due to unprintable characters,
perhaps).

I want a ParseResult more like:
['<td width="110" valign="top"><div align="right"><strong>Job Title:\n
     </strong></div></td>\n<td class="ccDisplayCell">', 'Big Old <B
STYLE="background-color:#FFEF95">Head Honcho</B> Boss Man, '</td>']

I sort of understand why title2 works as it does (the OneOrMore just
slurps up everything), but for the life of me I can't figure out how to
fix it. ;) Is there a way of writing something similar to RE's ".*?" ?

--Pete

-- 
Peter Fein                 pfein at pobox.com                 773-575-0694

Basically, if you're not a utopianist, you're a schmuck. -J. Feldman