matching multiple regexs to a single line...
Alexander Sendzimir
lists at battleface.com
Tue Nov 19 18:49:42 EST 2002
# Alex, the reason I used sre is because I feel it is the right module to use
# with Python 2.2.2. Re is a wrapper. However, my knowledge in this area is
# limited and could stand to be corrected.
# I should have stated in my last post, that if speed is an issue,
# then I might not be coding in Python but rather a language that
# compiles to a real processor and not a virtual machine.
# I will say that what I don't like about the approach you propose is that it
# throws information away. Functionally, it leads to brittle code which is hard
# to maintain and can lead to errors which are very hard to track down if one
# is not familiar with the implementation.
# This now said, yours is definitely a fast approach and under the right
# circumstances would be very useful. In my experience, clear writing takes
# precedence over clever implementation (most of the time). There are
# exceptions.
# As for the code itself, the lastindex method counts groups. If
# there are groups within any of the regular expressions defined, then
# lastindex is non-linear with respect to the intended expressions.
# So, it becomes very difficult to determine which outermost expression
# lastindex refers to. Perhaps you can see the maintenance problems
# arising here?
# The code included here produces the following output:
# (abcd(cd){1,4})|(abcd)|(abcd(cd)?)|(ab)|(zh192)
# ab1234 6
# abcd1234 3
# abcdcd 1
# abcdcdcdcd 1
# zh192 7
# abcdefghij 3
# I've labelled the groups below for reference.
# (abcd(cd){1,4})|(abcd)|(abcd(cd)?)|(ab)|(zh192)
# 1 2 3 4 5 6 7
# As you can see some expressions are 'identified' by more than one
# group. This is not desirable and is difficult to maintain. If you
# change the grouping in any of the regular expressions, then you have
# some possibly heavy code changes to make down the line. The approach
# you commented on is more generalized and easier to change with required
# adjustments falling into place.
# There might a few other issues. However, I haven't had time to fully
# explore them.
# Thanks, Alex.
# Alex. ;-)
import re
import sys
lines = [
r'ab1234',
r'abcd1234',
r'abcdcd',
r'abcdcdcdcd',
r'zh192',
r'abcdefghij' ]
regex_patterns = [
r'abcd(cd){1,4}',
r'abcd',
r'abcd(cd)?',
r'ab',
r'zh192' ]
regex_bigpat = '(' + ')|('.join( regex_patterns ) + ')'
print regex_bigpat
regex = re.compile( regex_bigpat )
for line in lines :
mo = regex.match( line )
if mo :
print line, mo.lastindex
else :
print line, 'not matched'
sys.exit( 0 );
More information about the Python-list
mailing list