A nice way to use regex for complicate parsing

Paul McGuire ptmcg at austin.rr.com
Thu Mar 29 11:33:50 EDT 2007


On Mar 29, 9:42 am, Shane Geiger <sgei... at ncee.net> wrote:
> It would be worth learning pyparsing to do this.
>

Thanks to Shane and Steven for the ref to pyparsing.  I also was
struck by this post, thinking "this is pyparsing written in re's and
dicts".

The approach you are taking is *very* much like the thought process I
went through when first implementing pyparsing.  I wanted to easily
compose expressions from other expressions.  In your case, you are
string interpolating using a cumulative dict of prior expressions.
Pyparsing uses various subclasses of the ParserElement class, with
operator definitions for alternation ("|" or "^" depending on non-
greedy vs. greedy), composition ("+"), and negation ("~").  Pyparsing
also uses its own extended results construct, ParseResults, which
supports named results fields, accessible using list indicies, dict
names, or instance names.

Here is the pyparsing treatment of your example (I may not have gotten
every part correct, but my point is more the similarity of our
approaches).  Note the access to the smtp parameters via the Dict
transformer.

-- Paul


from pyparsing import *

# <dotnum> ::= <snum> "." <snum> "." <snum> "." <snum>
intgr = Word(nums)
dotnum = Combine(intgr + "." + intgr + "." + intgr + "." + intgr)

# <dot-string> ::= <string> | <string> "." <dot-string>
string_ = Word(alphanums)
dotstring = Combine(delimitedList(string_,"."))

# <domain> ::=  <element> | <element> "." <domain>
domain = dotnum | dotstring

# <q> ::= any one of the 128 ASCII characters except <CR>, <LF>, quote
("), or backslash (\)
# <x> ::= any one of the 128 ASCII characters (no exceptions)
# <qtext> ::=  "\" <x> | "\" <x> <qtext> | <q> | <q> <qtext>
# <quoted-string> ::=  """ <qtext> """
quotedString = dblQuotedString  # <- just use pre-defined expr from
pyparsing

# <local-part> ::= <dot-string> | <quoted-string>
localpart = (dotstring | quotedString).setResultsName("localpart")

# <mailbox> ::= <local-part> "@" <domain>
mailbox = Combine(localpart + "@" + domain).setResultsName("mailbox")

# <path> ::= "<" [ <a-d-l> ":" ] <mailbox> ">"
# also accept address without <>
path = "<" + mailbox + ">" | mailbox

# esmtp-keyword    ::= (ALPHA / DIGIT) *(ALPHA / DIGIT / "-")
esmtpkeyword = Word(alphanums,alphanums+"-")

# esmtp-value      ::= 1*<any CHAR excluding "=", SP, and all
esmtpvalue = Regex(r'[^= \t\r\n\f\v]*')

# ; syntax and values depend on esmtp-keyword
#                      control characters (US ASCII 0-31inclusive)>
# esmtp-parameter  ::= esmtp-keyword ["=" esmtp-value]
# esmtp-parameter  ::= esmtp-keyword ["=" esmtp-value]
esmtpparameters = Dict(
    ZeroOrMore( Group(esmtpkeyword + Suppress("=") + esmtpvalue) ) )

# esmtp-cmd        ::= inner-esmtp-cmd [SP esmtp-parameters] CR LF
esmtp_addr = path + \
                Optional(esmtpparameters,default=[])\
                .setResultsName("parameters")

for t in tests:
        for keyword in [ 'MAIL FROM:', 'RCPT TO:' ]:
                keylen=len(keyword)
                if t[:keylen].upper()==keyword:
                        t=t[keylen:]
                break

        try:
            match = esmtp_addr.parseString(t)
            print 'MATCH'
            print match.dump()
            # some sample code to access elements of the parameters
"dict"
            if "SIZE" in match.parameters:
                print "SIZE is", match.parameters.SIZE
            print
        except ParseException,pe:
            print 'DONT match', t

prints:
MATCH
['<', ['johnsmith at addresscom'], '>']
- mailbox: ['johnsmith at addresscom']
  - localpart: johnsmith
- parameters: []

MATCH
[['johnsmith at addresscom']]
- mailbox: ['johnsmith at addresscom']
  - localpart: johnsmith
- parameters: []

MATCH
['<', ['johnsmith at addresscom'], '>', ['SIZE', '1234'], ['OTHER',
'foo at bar.com']]
- OTHER: foo at bar.com
- SIZE: 1234
- mailbox: ['johnsmith at addresscom']
  - localpart: johnsmith
- parameters: [['SIZE', '1234'], ['OTHER', 'foo at bar.com']]
  - OTHER: foo at bar.com
  - SIZE: 1234
SIZE is 1234

MATCH
[['johnsmith at addresscom'], ['SIZE', '1234'], ['OTHER', 'foo at bar.com']]
- OTHER: foo at bar.com
- SIZE: 1234
- mailbox: ['johnsmith at addresscom']
  - localpart: johnsmith
- parameters: [['SIZE', '1234'], ['OTHER', 'foo at bar.com']]
  - OTHER: foo at bar.com
  - SIZE: 1234
SIZE is 1234

MATCH
['<', ['"tom at is.a> legal=email"@addresscom'], '>']
- mailbox: ['"tom at is.a> legal=email"@addresscom']
  - localpart: "tom at is.a> legal=email"
- parameters: []

MATCH
[['"tom at is.a> legal=email"@addresscom']]
- mailbox: ['"tom at is.a> legal=email"@addresscom']
  - localpart: "tom at is.a> legal=email"
- parameters: []

MATCH
['<', ['"tom at is.a> legal=email"@addresscom'], '>', ['SIZE', '1234'],
['OTHER', 'foo at bar.com']]
- OTHER: foo at bar.com
- SIZE: 1234
- mailbox: ['"tom at is.a> legal=email"@addresscom']
  - localpart: "tom at is.a> legal=email"
- parameters: [['SIZE', '1234'], ['OTHER', 'foo at bar.com']]
  - OTHER: foo at bar.com
  - SIZE: 1234
SIZE is 1234

MATCH
[['"tom at is.a> legal=email"@addresscom'], ['SIZE', '1234'], ['OTHER',
'foo at bar.com']]
- OTHER: foo at bar.com
- SIZE: 1234
- mailbox: ['"tom at is.a> legal=email"@addresscom']
  - localpart: "tom at is.a> legal=email"
- parameters: [['SIZE', '1234'], ['OTHER', 'foo at bar.com']]
  - OTHER: foo at bar.com
  - SIZE: 1234
SIZE is 1234




More information about the Python-list mailing list