[Tutor] token parser article

Chad Crabtree flaxeater at yahoo.com
Sun Oct 10 05:15:08 CEST 2004

pyparsing is a fairly new parsing module for Python -
http://pyparsing.sourceforge.net/. With pyparsing you build up a
syntax out
of simple building blocks. I've been wanting to try it out. I found
it very
easy to use. Here is what is looks like with kumar's original

>>> from pyparsing import *

I'll build the parser from the inside out, starting with simple
tokens and
combining them to recognize more and more complex parts of the
string. First I create a parse token to represent the key portion of
entry. A keyToken is a run of any number of contiguous letters and
>>> keyToken = Word(alphanums)

The scanString() method of a parser element searches a string for
that matches the element. It is a handy way to check that you are on
right track. scanString() is a generator function so you have to pass
result to list() if you want to print it out. keyToken matches all
words in the string:
>>> list(keyToken.scanString(s))
[((['AKT'], {}), 1, 4), ((['PI3K'], {}), 5, 9), ((['RHOA'], {}), 11,
((['BCL'], {}), 16, 19), ((['CDC42'], {}), 20, 25), ((['IKK'], {}),
26, 29),
  ((['RAC1'], {}), 30, 34), ((['RAL'], {}), 35, 38), ((['RALBP1'],
{}), 39,

valueToken will match the pieces of the value lists.. It's the same
keyToken, just a run of alphanumeric characters:
>>> valueToken = Word(alphanums)

Now here is something more interesting - valueList matches one or
valueTokens separated by colons:
>>> valueList = delimitedList(valueToken, delim=':')
>>> list(valueList.scanString(s))
[((['AKT'], {}), 1, 4), ((['PI3K'], {}), 5, 9), ((['RHOA'], {}), 11,
((['BCL', 'CDC42', 'IKK', 'RAC1', 'RAL', 'RALBP1'], {}), 16, 45)]

It matches the keys, too, but that's just because we haven't given it
enough context yet. Notice how the list 'BCL', 'CDC42', etc. has been
collected for us.

Now let's start putting the key and the valueList together. pyparsing
you do this just by adding parser elements together. You include
elements by adding in the strings that represent them:
>>> entry = '[' + keyToken + '|' + valueList + ']'
>>> list(entry.scanString(s))
[((['[', 'AKT', '|', 'PI3K', ']'], {}), 0, 10), ((['[', 'RHOA', '|',
'CDC42', 'IKK', 'RAC1', 'RAL', 'RALBP1', ']'], {}), 10, 46)]

That's pretty cool! entry separates the key and the valueList. We
really want the literals in the token list, though. We can tell
to suppress them:
>>> entry = Suppress('[') + keyToken + Suppress('|') + valueList + 
>>> list(entry.scanString(s))
[((['AKT', 'PI3K'], {}), 0, 10), ((['RHOA', 'BCL', 'CDC42', 'IKK',
'RAL', 'RALBP1'], {}), 10, 46)]

That looks like we're getting somewhere. Let's add one more rule, to
multiple entries:
>>> entryList = ZeroOrMore(entry)
>>> list(entryList.scanString(s))
[((['AKT', 'PI3K', 'RHOA', 'BCL', 'CDC42', 'IKK', 'RAC1', 'RAL',
{}), 0, 46)]

Now we've matched the whole string with a single parser element, but
list of tokens is all glommed together again! Not to
worry...pyparsing lets
you define actions associated with each parser element. We can add an
action to the 'entry' element that pulls out the tokens we want and
them in a dictionary:
>>> dd = {}
>>> def processEntry(s, loc, toks):
...     key, value = toks[0], toks[1:]
...     dd[key] = value
>>> entry.setParseAction(processEntry)

processEntry() gets three arguments. The third one contains the
tokens that
match the associated rule. toks is actually a ParseResult object, but
acts a lot like a list. We can use the first token as a key and the
rest of
the list as the value for a dictionary.

Finally we use entryList.parseString() to activate the parser and
apply the
parse action:
>>> entryList.parseString(s)
(['AKT', 'PI3K', 'RHOA', 'BCL', 'CDC42', 'IKK', 'RAC1', 'RAL', 
'RALBP1'], {})
>>> dd
{'RHOA': ['BCL', 'CDC42', 'IKK', 'RAC1', 'RAL', 'RALBP1'], 'AKT':

dd is now the dictionary requested by the original poster :-)

Here is the whole program:

from pyparsing import *


# Global variables to accumulate results
dd = {}

# Syntax definition
keyToken = Word(alphanums)
valueToken = Word(alphanums)
valueList = delimitedList(valueToken, delim=':')
entry = Suppress('[') + keyToken + Suppress('|') + valueList +
entryList = ZeroOrMore(entry)

def processEntry(s, loc, toks):
     key, value = toks[0], toks[1:]
     dd[key] = value


print dd

By the way delimitedList() is just a shortcut, we could have written
with the same result:
>>> valueList = valueToken + ZeroOrMore(Suppress(':' )+ valueToken)


At 07:42 PM 10/7/2004 -0700, Chad Crabtree wrote:
>Danny Yoo wrote:
> >We can parse this pretty informally, by using regular expressions.
> >there's also a fairly systematic way we can attack this:  we can
> >out and use a token/parser approach.  Would you like to hear about
> >
> >
>I don't know about kumar but I would love to hear about this because
>I've been reading about it but it has not sunk in yet.
>Do You Yahoo!?
>Tired of spam?  Yahoo! Mail has the best spam protection around
>Tutor maillist  -  Tutor at python.org

Tutor maillist  -  Tutor at python.org

Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 

More information about the Tutor mailing list