[Tutor] token parser article

Chad Crabtree flaxeater at yahoo.com
Sun Oct 10 05:15:08 CEST 2004


pyparsing is a fairly new parsing module for Python -
http://pyparsing.sourceforge.net/. With pyparsing you build up a
syntax out
of simple building blocks. I've been wanting to try it out. I found
it very
easy to use. Here is what is looks like with kumar's original
example:

>>> s='[AKT|PI3K][RHOA|BCL:CDC42:IKK:RAC1:RAL:RALBP1]'
>>> from pyparsing import *

I'll build the parser from the inside out, starting with simple
tokens and
combining them to recognize more and more complex parts of the
complete
string. First I create a parse token to represent the key portion of
each
entry. A keyToken is a run of any number of contiguous letters and
numbers:
>>> keyToken = Word(alphanums)

The scanString() method of a parser element searches a string for
anything
that matches the element. It is a handy way to check that you are on
the
right track. scanString() is a generator function so you have to pass
the
result to list() if you want to print it out. keyToken matches all
the
words in the string:
>>> list(keyToken.scanString(s))
[((['AKT'], {}), 1, 4), ((['PI3K'], {}), 5, 9), ((['RHOA'], {}), 11,
15),
((['BCL'], {}), 16, 19), ((['CDC42'], {}), 20, 25), ((['IKK'], {}),
26, 29),
  ((['RAC1'], {}), 30, 34), ((['RAL'], {}), 35, 38), ((['RALBP1'],
{}), 39,
45)]

valueToken will match the pieces of the value lists.. It's the same
as
keyToken, just a run of alphanumeric characters:
>>> valueToken = Word(alphanums)

Now here is something more interesting - valueList matches one or
more
valueTokens separated by colons:
>>> valueList = delimitedList(valueToken, delim=':')
>>> list(valueList.scanString(s))
[((['AKT'], {}), 1, 4), ((['PI3K'], {}), 5, 9), ((['RHOA'], {}), 11,
15),
((['BCL', 'CDC42', 'IKK', 'RAC1', 'RAL', 'RALBP1'], {}), 16, 45)]

It matches the keys, too, but that's just because we haven't given it
enough context yet. Notice how the list 'BCL', 'CDC42', etc. has been
collected for us.

Now let's start putting the key and the valueList together. pyparsing
lets
you do this just by adding parser elements together. You include
literal
elements by adding in the strings that represent them:
>>> entry = '[' + keyToken + '|' + valueList + ']'
>>> list(entry.scanString(s))
[((['[', 'AKT', '|', 'PI3K', ']'], {}), 0, 10), ((['[', 'RHOA', '|',
'BCL',
'CDC42', 'IKK', 'RAC1', 'RAL', 'RALBP1', ']'], {}), 10, 46)]

That's pretty cool! entry separates the key and the valueList. We
don't
really want the literals in the token list, though. We can tell
pyparsing
to suppress them:
>>> entry = Suppress('[') + keyToken + Suppress('|') + valueList + 
Suppress(']')
>>> list(entry.scanString(s))
[((['AKT', 'PI3K'], {}), 0, 10), ((['RHOA', 'BCL', 'CDC42', 'IKK',
'RAC1',
'RAL', 'RALBP1'], {}), 10, 46)]

That looks like we're getting somewhere. Let's add one more rule, to
find
multiple entries:
>>> entryList = ZeroOrMore(entry)
>>> list(entryList.scanString(s))
[((['AKT', 'PI3K', 'RHOA', 'BCL', 'CDC42', 'IKK', 'RAC1', 'RAL',
'RALBP1'],
{}), 0, 46)]

Now we've matched the whole string with a single parser element, but
the
list of tokens is all glommed together again! Not to
worry...pyparsing lets
you define actions associated with each parser element. We can add an
action to the 'entry' element that pulls out the tokens we want and
puts
them in a dictionary:
>>> dd = {}
>>> def processEntry(s, loc, toks):
...     key, value = toks[0], toks[1:]
...     dd[key] = value
...
>>> entry.setParseAction(processEntry)

processEntry() gets three arguments. The third one contains the
tokens that
match the associated rule. toks is actually a ParseResult object, but
it
acts a lot like a list. We can use the first token as a key and the
rest of
the list as the value for a dictionary.

Finally we use entryList.parseString() to activate the parser and
apply the
parse action:
>>> entryList.parseString(s)
(['AKT', 'PI3K', 'RHOA', 'BCL', 'CDC42', 'IKK', 'RAC1', 'RAL', 
'RALBP1'], {})
>>> dd
{'RHOA': ['BCL', 'CDC42', 'IKK', 'RAC1', 'RAL', 'RALBP1'], 'AKT':
['PI3K']}

dd is now the dictionary requested by the original poster :-)

Here is the whole program:

from pyparsing import *

s='[AKT|PI3K][RHOA|BCL:CDC42:IKK:RAC1:RAL:RALBP1]'

# Global variables to accumulate results
dd = {}

# Syntax definition
keyToken = Word(alphanums)
valueToken = Word(alphanums)
valueList = delimitedList(valueToken, delim=':')
entry = Suppress('[') + keyToken + Suppress('|') + valueList +
Suppress(']')
entryList = ZeroOrMore(entry)

def processEntry(s, loc, toks):
     key, value = toks[0], toks[1:]
     dd[key] = value

entry.setParseAction(processEntry)

entryList.parseString(s)
print dd

By the way delimitedList() is just a shortcut, we could have written
this
with the same result:
>>> valueList = valueToken + ZeroOrMore(Suppress(':' )+ valueToken)

Kent

At 07:42 PM 10/7/2004 -0700, Chad Crabtree wrote:
>Danny Yoo wrote:
> >We can parse this pretty informally, by using regular expressions.
>But
> >there's also a fairly systematic way we can attack this:  we can
go
>all
> >out and use a token/parser approach.  Would you like to hear about
>that?
> >
> >
>I don't know about kumar but I would love to hear about this because
>I've been reading about it but it has not sunk in yet.
>
>__________________________________________________
>Do You Yahoo!?
>Tired of spam?  Yahoo! Mail has the best spam protection around
>http://mail.yahoo.com
>_______________________________________________
>Tutor maillist  -  Tutor at python.org
>http://mail.python.org/mailman/listinfo/tutor

_______________________________________________
Tutor maillist  -  Tutor at python.org
http://mail.python.org/mailman/listinfo/tutor



__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


More information about the Tutor mailing list