Saving search results in a dictionary

Lukas Holcik xholcik1 at fi.muni.cz
Fri Jun 18 09:00:11 EDT 2004


Hi Paul and thanks for reply,

Why is the pyparsing module better than re? Just a question I must ask 
before I can use it. Meant with no offense. I found an extra pdf howto on 
python.org about regexps and found out, that there is an object called 
finditer, which could accomplish this task quite easily:

     regexp = re.compile("<a href=\"(?P<href>.*?)\">(?P<pcdata>.*?)</a>", \
 		re.I)
     iterator = regexp.finditer(text)
     for match in iterator:
         dict[match.group("pcdata")] = match.group("href")

---------------------------------------_.)--
|  Lukas Holcik (xholcik1 at fi.muni.cz)  (\=)*
----------------------------------------''--

On Thu, 17 Jun 2004, Paul McGuire wrote:

> "Lukas Holcik" <xholcik1 at fi.muni.cz> wrote in message
> news:Pine.LNX.4.60.0406171557330.16166 at nymfe30.fi.muni.cz...
>> Hi everyone!
>>
>> How can I simply search text for regexps (lets say <a
>> href="(.*?)">(.*?)</a>) and save all URLs(1) and link contents(2) in a
>> dictionary { name : URL}? In a single pass if it could.
>>
>> Or how can I replace the html &entities; in a string
>> "blablabla&blablabal&balbalbal" with the chars they mean using
>> re.sub? I found out they are stored in an dict [from htmlentitydefs import
>> htmlentitydefs]. I though about this functionality:
>>
>> regexp = re.compile("&[a-zA-Z];")
>> regexp.sub(entitydefs[r'\1'], url)
>>
>> but it can't work, because the r'...' must eaten directly by the sub, and
>> cannot be used so independently ( at least I think so). Any ideas? Thanks
>> in advance.
>>
>> -i
>>
>> ---------------------------------------_.)--
>> |  Lukas Holcik (xholcik1 at fi.muni.cz)  (\=)*
>> ----------------------------------------''--
> Lukas -
>
> Here is an example script from the upcoming 1.2 release of pyparsing.  It is
> certainly not a one-liner, but it should be fairly easy to follow.  (This
> example makes two passes over the input, but only to show two different
> output styles - the dictionary creation is done in a single pass.)
>
> Download pyparsing at http://pyparsing.sourceforge.net .
>
> -- Paul
>
> # URL extractor
> # Copyright 2004, Paul McGuire
> from pyparsing import Literal,Suppress,CharsNotIn,CaselessLiteral,\
>        Word,dblQuotedString,alphanums
> import urllib
> import pprint
>
> # Define the pyparsing grammar for a URL, that is:
> #    URLlink ::= <a href= URL>linkText</a>
> #    URL ::= doubleQuotedString | alphanumericWordPath
> # Note that whitespace may appear just about anywhere in the link.  Note
> also
> # that it is not necessary to explicitly show this in the pyparsing grammar;
> by
> # default, pyparsing skips over whitespace between tokens.
> linkOpenTag = (Literal("<") + "a" + "href" + "=").suppress() + \
>                ( dblQuotedString | Word(alphanums+"/") ) + \
>                Suppress(">")
> linkCloseTag = Literal("<") + "/" + CaselessLiteral("a") + ">"
> link = linkOpenTag + CharsNotIn("<") + linkCloseTag.suppress()
>
> # Go get some HTML with some links in it.
> serverListPage = urllib.urlopen( "http://www.yahoo.com" )
> htmlText = serverListPage.read()
> serverListPage.close()
>
> # scanString is a generator that loops through the input htmlText, and for
> each
> # match yields the tokens and start and end locations (for this application,
> we
> # are not interested in the start and end values).
> for toks,strt,end in link.scanString(htmlText):
>    print toks.asList()
>
> # Rerun scanString, but this time create a dict of text:URL key-value pairs.
> # Need to reverse the tokens returned by link, using a parse action.
> link.setParseAction( lambda st,loc,toks: [ toks[1], toks[0] ] )
>
> # Create dictionary from list comprehension, assembled from each pair of
> # tokens returned from a matched URL.
> pprint.pprint(
>    dict( [ toks for toks,strt,end in link.scanString(htmlText) ] )
>    )
>
>
>



More information about the Python-list mailing list