[Tutor] Regular expression

Sat Oct 9 04:22:29 CEST 2004

pyparsing is a fairly new parsing module for Python - 
http://pyparsing.sourceforge.net/. With pyparsing you build up a syntax out 
of simple building blocks. I've been wanting to try it out. I found it very 
easy to use. Here is what is looks like with kumar's original example:

 >>> s='[AKT|PI3K][RHOA|BCL:CDC42:IKK:RAC1:RAL:RALBP1]'
 >>> from pyparsing import *

I'll build the parser from the inside out, starting with simple tokens and 
combining them to recognize more and more complex parts of the complete 
string. First I create a parse token to represent the key portion of each 
entry. A keyToken is a run of any number of contiguous letters and numbers:
 >>> keyToken = Word(alphanums)

The scanString() method of a parser element searches a string for anything 
that matches the element. It is a handy way to check that you are on the 
right track. scanString() is a generator function so you have to pass the 
result to list() if you want to print it out. keyToken matches all the 
words in the string:
 >>> list(keyToken.scanString(s))
[((['AKT'], {}), 1, 4), ((['PI3K'], {}), 5, 9), ((['RHOA'], {}), 11, 15), 
((['BCL'], {}), 16, 19), ((['CDC42'], {}), 20, 25), ((['IKK'], {}), 26, 29),
  ((['RAC1'], {}), 30, 34), ((['RAL'], {}), 35, 38), ((['RALBP1'], {}), 39, 
45)]

valueToken will match the pieces of the value lists.. It's the same as 
keyToken, just a run of alphanumeric characters:
 >>> valueToken = Word(alphanums)

Now here is something more interesting - valueList matches one or more 
valueTokens separated by colons:
 >>> valueList = delimitedList(valueToken, delim=':')
 >>> list(valueList.scanString(s))
[((['AKT'], {}), 1, 4), ((['PI3K'], {}), 5, 9), ((['RHOA'], {}), 11, 15), 
((['BCL', 'CDC42', 'IKK', 'RAC1', 'RAL', 'RALBP1'], {}), 16, 45)]

It matches the keys, too, but that's just because we haven't given it 
enough context yet. Notice how the list 'BCL', 'CDC42', etc. has been 
collected for us.

Now let's start putting the key and the valueList together. pyparsing lets 
you do this just by adding parser elements together. You include literal 
elements by adding in the strings that represent them:
 >>> entry = '[' + keyToken + '|' + valueList + ']'
 >>> list(entry.scanString(s))
[((['[', 'AKT', '|', 'PI3K', ']'], {}), 0, 10), ((['[', 'RHOA', '|', 'BCL', 
'CDC42', 'IKK', 'RAC1', 'RAL', 'RALBP1', ']'], {}), 10, 46)]

That's pretty cool! entry separates the key and the valueList. We don't 
really want the literals in the token list, though. We can tell pyparsing 
to suppress them:
 >>> entry = Suppress('[') + keyToken + Suppress('|') + valueList + 
Suppress(']')
 >>> list(entry.scanString(s))
[((['AKT', 'PI3K'], {}), 0, 10), ((['RHOA', 'BCL', 'CDC42', 'IKK', 'RAC1', 
'RAL', 'RALBP1'], {}), 10, 46)]

That looks like we're getting somewhere. Let's add one more rule, to find 
multiple entries:
 >>> entryList = ZeroOrMore(entry)
 >>> list(entryList.scanString(s))
[((['AKT', 'PI3K', 'RHOA', 'BCL', 'CDC42', 'IKK', 'RAC1', 'RAL', 'RALBP1'], 
{}), 0, 46)]

Now we've matched the whole string with a single parser element, but the 
list of tokens is all glommed together again! Not to worry...pyparsing lets 
you define actions associated with each parser element. We can add an 
action to the 'entry' element that pulls out the tokens we want and puts 
them in a dictionary:
 >>> dd = {}
 >>> def processEntry(s, loc, toks):
...     key, value = toks[0], toks[1:]
...     dd[key] = value
...
 >>> entry.setParseAction(processEntry)

processEntry() gets three arguments. The third one contains the tokens that 
match the associated rule. toks is actually a ParseResult object, but it 
acts a lot like a list. We can use the first token as a key and the rest of 
the list as the value for a dictionary.

Finally we use entryList.parseString() to activate the parser and apply the 
parse action:
 >>> entryList.parseString(s)
(['AKT', 'PI3K', 'RHOA', 'BCL', 'CDC42', 'IKK', 'RAC1', 'RAL', 'RALBP1'], {})
 >>> dd
{'RHOA': ['BCL', 'CDC42', 'IKK', 'RAC1', 'RAL', 'RALBP1'], 'AKT': ['PI3K']}

dd is now the dictionary requested by the original poster :-)

Here is the whole program:

from pyparsing import *

s='[AKT|PI3K][RHOA|BCL:CDC42:IKK:RAC1:RAL:RALBP1]'

# Global variables to accumulate results
dd = {}

# Syntax definition
keyToken = Word(alphanums)
valueToken = Word(alphanums)
valueList = delimitedList(valueToken, delim=':')
entry = Suppress('[') + keyToken + Suppress('|') + valueList + Suppress(']')
entryList = ZeroOrMore(entry)

def processEntry(s, loc, toks):
     key, value = toks[0], toks[1:]
     dd[key] = value

entry.setParseAction(processEntry)

entryList.parseString(s)
print dd

By the way delimitedList() is just a shortcut, we could have written this 
with the same result:
 >>> valueList = valueToken + ZeroOrMore(Suppress(':' )+ valueToken)

Kent

At 07:42 PM 10/7/2004 -0700, Chad Crabtree wrote:
>Danny Yoo wrote:
> >We can parse this pretty informally, by using regular expressions.
>But
> >there's also a fairly systematic way we can attack this:  we can go
>all
> >out and use a token/parser approach.  Would you like to hear about
>that?
> >
> >
>I don't know about kumar but I would love to hear about this because
>I've been reading about it but it has not sunk in yet.
>
>__________________________________________________
>Do You Yahoo!?
>Tired of spam?  Yahoo! Mail has the best spam protection around
>http://mail.yahoo.com
>_______________________________________________
>Tutor maillist  -  Tutor at python.org
>http://mail.python.org/mailman/listinfo/tutor