Python and Regular Expressions

Charles C.Sanders at DeleteThis.Bom.GOV.AU
Thu Apr 8 08:02:39 EDT 2010


"Nobody" <nobody at nowhere.com> wrote in message 
news:pan.2010.04.08.10.12.59.594000 at nowhere.com...
> On Wed, 07 Apr 2010 18:25:36 -0700, Patrick Maupin wrote:
>
>>> Regular expressions != Parsers
>>
>> True, but lots of parsers *use* regular expressions in their
>> tokenizers.  In fact, if you have a pure Python parser, you can often
>> get huge performance gains by rearranging your code slightly so that
>> you can use regular expressions in your tokenizer, because that
>> effectively gives you access to a fast, specialized C library that is
>> built into practically every Python interpreter on the planet.
>
> Unfortunately, a typical regexp library (including Python's) doesn't allow
> you to match against a set of regexps, returning the index of which one
> matched. Which is what you really want for a tokeniser.
>
[snip]

Really !,
 I am only a python newbie, but what about ...

import re
rr = [
  ( "id",    '([a-zA-Z][a-zA-Z0-9]*)' ),
  ( "int",   '([+-]?[0-9]+)' ),
  ( "float", '([+-]?[0-9]+\.[0-9]*)' ),
  ( "float", '([+-]?[0-9]+\.[0-9]*[eE][+-]?[0-9]+)' )
]
tlist = [ t[0] for t in rr ]
pat = '^ *(' + '|'.join([ t[1] for t in rr ]) + ') *$'
p = re.compile(pat)

ss = [ ' annc', '1234', 'abcd', '  234sz ', '-1.24e3', '5.' ]
for s in ss:
  m = p.match(s)
  if m:
    ix = [ i-2 for i in range(2,6) if m.group(i) ]
    print "'"+s+"' matches and has type", tlist[ix[0]]
  else:
    print "'"+s+"' does not match"

output:
' annc' matches and has type id
'1234' matches and has type int
'abcd' matches and has type id
'  234sz ' does not match
'-1.24e3' matches and has type float
'5.' matches and has type float

seems to me to match a (small) set of regular expressions and
indirectly return the index of the matched expression, without
doing a sequential loop over the regular expressions.

Of course there is a loop over the reults of the match to determine
which sub-expression matched, but a good regexp library (which
I presume Python has) should match the sub-expressions without
looping over them. The techniques to do this were well known in
the 1970's when the first versons of lex were written.

Not that I would recommend tricks like this. The regular
expression would quickly get out of hand for any non-trivial
list of regular expresssions to match.

Charles






More information about the Python-list mailing list