Python and Regular Expressions
C.Sanders at DeleteThis.Bom.GOV.AU
Thu Apr 8 14:02:39 CEST 2010
"Nobody" <nobody at nowhere.com> wrote in message
news:pan.2010.04.08.10.12.59.594000 at nowhere.com...
> On Wed, 07 Apr 2010 18:25:36 -0700, Patrick Maupin wrote:
>>> Regular expressions != Parsers
>> True, but lots of parsers *use* regular expressions in their
>> tokenizers. In fact, if you have a pure Python parser, you can often
>> get huge performance gains by rearranging your code slightly so that
>> you can use regular expressions in your tokenizer, because that
>> effectively gives you access to a fast, specialized C library that is
>> built into practically every Python interpreter on the planet.
> Unfortunately, a typical regexp library (including Python's) doesn't allow
> you to match against a set of regexps, returning the index of which one
> matched. Which is what you really want for a tokeniser.
I am only a python newbie, but what about ...
rr = [
( "id", '([a-zA-Z][a-zA-Z0-9]*)' ),
( "int", '([+-]?[0-9]+)' ),
( "float", '([+-]?[0-9]+\.[0-9]*)' ),
( "float", '([+-]?[0-9]+\.[0-9]*[eE][+-]?[0-9]+)' )
tlist = [ t for t in rr ]
pat = '^ *(' + '|'.join([ t for t in rr ]) + ') *$'
p = re.compile(pat)
ss = [ ' annc', '1234', 'abcd', ' 234sz ', '-1.24e3', '5.' ]
for s in ss:
m = p.match(s)
ix = [ i-2 for i in range(2,6) if m.group(i) ]
print "'"+s+"' matches and has type", tlist[ix]
print "'"+s+"' does not match"
' annc' matches and has type id
'1234' matches and has type int
'abcd' matches and has type id
' 234sz ' does not match
'-1.24e3' matches and has type float
'5.' matches and has type float
seems to me to match a (small) set of regular expressions and
indirectly return the index of the matched expression, without
doing a sequential loop over the regular expressions.
Of course there is a loop over the reults of the match to determine
which sub-expression matched, but a good regexp library (which
I presume Python has) should match the sub-expressions without
looping over them. The techniques to do this were well known in
the 1970's when the first versons of lex were written.
Not that I would recommend tricks like this. The regular
expression would quickly get out of hand for any non-trivial
list of regular expresssions to match.
More information about the Python-list