[Tutor] Tokenizing Help

Thu Apr 23 16:24:16 CEST 2009

For the given test case, this pyparsing sample parses the data, without
having to anticipate all the possible 2-letter keys.  

from pyparsing import *

integer = Word(nums)
DASH = Literal('-').suppress()
LT = Literal('<').suppress()
GT = Literal('>').suppress()

entrynum = LT + integer + GT
keycode = Word(alphas.upper(),exact=2)
key = GoToColumn(1).suppress() + keycode + DASH
data = Group(key("key") + Empty() + SkipTo(key | entrynum |
StringEnd())("value"))
entry = entrynum("refnum") + OneOrMore(data)("data")

for e in entry.searchString(test):
    print e.refnum
    for dd in e.data:
        print dd.key,':', dd.value
    print

Prints:

['567']
['AU'] : Bibliographical Theory and Practice - Volume 1 - The AU  - Tag 
and its applications
['AB'] : Texts in Library Science

['568']
['AU'] : Bibliographical Theory and Practice - Volume 2 - The
['AB'] : Tag and its applications
['AB'] : Texts in Library Science

['569']
['AU'] : AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  -
['AU'] : AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU
['AB'] : AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  -
['AU'] : AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU
['ZZ'] : Somewhat nonsensical case

If you find that you have to also accept keycodes that consist of a capital
letter followed by a numeric digit (like "B7"), modify the keycode
definition to be:

keycode = Word(alphas.upper(), alphanums.upper(), exact=2)

-- Paul