[Tutor] Regular expression
Kent Johnson
kent_johnson at skillsoft.com
Sat Oct 9 04:22:29 CEST 2004
pyparsing is a fairly new parsing module for Python -
http://pyparsing.sourceforge.net/. With pyparsing you build up a syntax out
of simple building blocks. I've been wanting to try it out. I found it very
easy to use. Here is what is looks like with kumar's original example:
>>> s='[AKT|PI3K][RHOA|BCL:CDC42:IKK:RAC1:RAL:RALBP1]'
>>> from pyparsing import *
I'll build the parser from the inside out, starting with simple tokens and
combining them to recognize more and more complex parts of the complete
string. First I create a parse token to represent the key portion of each
entry. A keyToken is a run of any number of contiguous letters and numbers:
>>> keyToken = Word(alphanums)
The scanString() method of a parser element searches a string for anything
that matches the element. It is a handy way to check that you are on the
right track. scanString() is a generator function so you have to pass the
result to list() if you want to print it out. keyToken matches all the
words in the string:
>>> list(keyToken.scanString(s))
[((['AKT'], {}), 1, 4), ((['PI3K'], {}), 5, 9), ((['RHOA'], {}), 11, 15),
((['BCL'], {}), 16, 19), ((['CDC42'], {}), 20, 25), ((['IKK'], {}), 26, 29),
((['RAC1'], {}), 30, 34), ((['RAL'], {}), 35, 38), ((['RALBP1'], {}), 39,
45)]
valueToken will match the pieces of the value lists.. It's the same as
keyToken, just a run of alphanumeric characters:
>>> valueToken = Word(alphanums)
Now here is something more interesting - valueList matches one or more
valueTokens separated by colons:
>>> valueList = delimitedList(valueToken, delim=':')
>>> list(valueList.scanString(s))
[((['AKT'], {}), 1, 4), ((['PI3K'], {}), 5, 9), ((['RHOA'], {}), 11, 15),
((['BCL', 'CDC42', 'IKK', 'RAC1', 'RAL', 'RALBP1'], {}), 16, 45)]
It matches the keys, too, but that's just because we haven't given it
enough context yet. Notice how the list 'BCL', 'CDC42', etc. has been
collected for us.
Now let's start putting the key and the valueList together. pyparsing lets
you do this just by adding parser elements together. You include literal
elements by adding in the strings that represent them:
>>> entry = '[' + keyToken + '|' + valueList + ']'
>>> list(entry.scanString(s))
[((['[', 'AKT', '|', 'PI3K', ']'], {}), 0, 10), ((['[', 'RHOA', '|', 'BCL',
'CDC42', 'IKK', 'RAC1', 'RAL', 'RALBP1', ']'], {}), 10, 46)]
That's pretty cool! entry separates the key and the valueList. We don't
really want the literals in the token list, though. We can tell pyparsing
to suppress them:
>>> entry = Suppress('[') + keyToken + Suppress('|') + valueList +
Suppress(']')
>>> list(entry.scanString(s))
[((['AKT', 'PI3K'], {}), 0, 10), ((['RHOA', 'BCL', 'CDC42', 'IKK', 'RAC1',
'RAL', 'RALBP1'], {}), 10, 46)]
That looks like we're getting somewhere. Let's add one more rule, to find
multiple entries:
>>> entryList = ZeroOrMore(entry)
>>> list(entryList.scanString(s))
[((['AKT', 'PI3K', 'RHOA', 'BCL', 'CDC42', 'IKK', 'RAC1', 'RAL', 'RALBP1'],
{}), 0, 46)]
Now we've matched the whole string with a single parser element, but the
list of tokens is all glommed together again! Not to worry...pyparsing lets
you define actions associated with each parser element. We can add an
action to the 'entry' element that pulls out the tokens we want and puts
them in a dictionary:
>>> dd = {}
>>> def processEntry(s, loc, toks):
... key, value = toks[0], toks[1:]
... dd[key] = value
...
>>> entry.setParseAction(processEntry)
processEntry() gets three arguments. The third one contains the tokens that
match the associated rule. toks is actually a ParseResult object, but it
acts a lot like a list. We can use the first token as a key and the rest of
the list as the value for a dictionary.
Finally we use entryList.parseString() to activate the parser and apply the
parse action:
>>> entryList.parseString(s)
(['AKT', 'PI3K', 'RHOA', 'BCL', 'CDC42', 'IKK', 'RAC1', 'RAL', 'RALBP1'], {})
>>> dd
{'RHOA': ['BCL', 'CDC42', 'IKK', 'RAC1', 'RAL', 'RALBP1'], 'AKT': ['PI3K']}
dd is now the dictionary requested by the original poster :-)
Here is the whole program:
from pyparsing import *
s='[AKT|PI3K][RHOA|BCL:CDC42:IKK:RAC1:RAL:RALBP1]'
# Global variables to accumulate results
dd = {}
# Syntax definition
keyToken = Word(alphanums)
valueToken = Word(alphanums)
valueList = delimitedList(valueToken, delim=':')
entry = Suppress('[') + keyToken + Suppress('|') + valueList + Suppress(']')
entryList = ZeroOrMore(entry)
def processEntry(s, loc, toks):
key, value = toks[0], toks[1:]
dd[key] = value
entry.setParseAction(processEntry)
entryList.parseString(s)
print dd
By the way delimitedList() is just a shortcut, we could have written this
with the same result:
>>> valueList = valueToken + ZeroOrMore(Suppress(':' )+ valueToken)
Kent
At 07:42 PM 10/7/2004 -0700, Chad Crabtree wrote:
>Danny Yoo wrote:
> >We can parse this pretty informally, by using regular expressions.
>But
> >there's also a fairly systematic way we can attack this: we can go
>all
> >out and use a token/parser approach. Would you like to hear about
>that?
> >
> >
>I don't know about kumar but I would love to hear about this because
>I've been reading about it but it has not sunk in yet.
>
>__________________________________________________
>Do You Yahoo!?
>Tired of spam? Yahoo! Mail has the best spam protection around
>http://mail.yahoo.com
>_______________________________________________
>Tutor maillist - Tutor at python.org
>http://mail.python.org/mailman/listinfo/tutor
More information about the Tutor
mailing list