Akward code using multiple regexp searches
Andrew Dalke
adalke at mindspring.com
Fri Sep 10 03:01:07 EDT 2004
Topher Cawlfield wrote:
> But a few times already I've found myself
> writing the following bit of awkward code when parsing text files. Can
> anyone suggest a more elegant solution?
>
> rexp1 = re.compile(r'blah(dee)blah')
> rexp2 = re.compile(r'hum(dum)')
> for line in inFile:
> reslt = rexp1.search(line)
> if reslt:
> something = reslt.group(1)
> else:
> reslt = rexp2.search(line)
> if reslt:
> somethingElse = reslt.group(1)
I usually solve this given case with a 'continue'
for line in inFile:
reslt = rexp1.search(line)
if reslt:
something = reslt.group(1)
continue
reslt = rexp2.search(line)
if reslt:
somethingElse = reslt.group(1)
continue
Still more cumbersome than the Perl equivalent.
You could do a trick like this
import re
class Match:
def __init__(self, pattern, flags=0):
self.pat = re.compile(pattern, flags)
self.m = None
def __call__(self, s):
self.m = self.pat.match(s)
return bool(self.m)
def __nonzero__(self):
return bool(self.m)
def group(self, x):
return self.m.group(x)
def start(self, x):
return self.m.start(x)
def end(self, x):
return self.m.end(x)
pat1 = Match("A(.*)")
pat2 = Match("BA(.*)")
pat3 = Match("BB(.*)")
def test(s):
if pat1(s): print "Looks like", pat1.group(1)
elif pat2(s): print "no, it is", pat2.group(1)
elif pat3(s): print "really?", pat3.group(1)
else: print "Never mind."
>>> test("ABCDE")
Looks like BCDE
>>> test("BACDE")
no, it is CDE
>>> test("BBCDE")
really? CDE
>>> test("CBBDE")
Never mind.
>>>
This is much more along the lines of what you want
but it conflates the idea of search object and
match object and makes your code more suspectible
to subtle breaks. Consider
digits = Match("(\s*(\d+)\s*)")
def divisor(s):
if s[:1] == "/":
if digits(s[1:]):
return int(digits.group(2))
raise TypeError("nothing after the /")
# no fraction, use 1 as the divisor
return 1
def fraction(s):
if digits(s):
denom = divisor(s[digits.end(1):])
return int(digits.group(2)), denom
raise TypeError("does not start with a number")
>>> fraction("4/5")
(5, 5)
>>>
But as a Perl programmer you are perhaps used to this
because Perl does the same conflation thus having
the same problems. (I think. It's been a while ...
Nope! The regexp search results appear to be my
variables now. When I started with perl4 all variables
were either global or "dynamically scoped"-ish with
local)
> I'm a little bit worried about doing the following in Python, since I'm
> not sure if the compiler is smart enough to avoid doing each regexp
> search twice:
>
> for line in inFile:
> if rexp1.search(line)
> something = rexp1.search(line).group(1)
> elif rexp2.search(line):
> somethingElse = rexp2.search(line).group(1)
>
> In many cases I am worried about efficiency as these scripts parse a
> couple GB of text!
It isn't smart enough. To make it that smart would require
a lot more work. For example, how does it know that the
implementation of "rexp1.search(line)" always returns the
same value? Or even that "rexp1.search" returns the
same bound method?
Andrew
dalke at dalkescientific.com
More information about the Python-list
mailing list