matching multiple regexs to a single line...

Alexander Sendzimir lists at battleface.com
Tue Nov 19 18:49:42 EST 2002


# Alex, the reason I used sre is because I feel it is the right module to use
# with Python 2.2.2. Re is a wrapper. However, my knowledge in this area is
# limited and could stand to be corrected.

# I should have stated in my last post, that if speed is an issue,
# then I might not be coding in Python but rather a language that
# compiles to a real processor and not a virtual machine.

# I will say that what I don't like about the approach you propose is that it
# throws information away. Functionally, it leads to brittle code which is hard
# to maintain and can lead to errors which are very hard to track down if one
# is not familiar with the implementation.

# This now said, yours is definitely a fast approach and under the right
# circumstances would be very useful. In my experience, clear writing takes
# precedence over clever implementation (most of the time). There are
# exceptions.

# As for the code itself, the lastindex method counts groups. If
# there are groups within any of the regular expressions defined, then
# lastindex is non-linear with respect to the intended expressions.
# So, it becomes very difficult to determine which outermost expression
# lastindex refers to. Perhaps you can see the maintenance problems
# arising here?

# The code included here produces the following output:

#	(abcd(cd){1,4})|(abcd)|(abcd(cd)?)|(ab)|(zh192)
#	ab1234 6
#	abcd1234 3
#	abcdcd 1
#	abcdcdcdcd 1
#	zh192 7
#	abcdefghij 3

# I've labelled the groups below for reference.

# (abcd(cd){1,4})|(abcd)|(abcd(cd)?)|(ab)|(zh192)
# 1    2          3      4    5      6    7

# As you can see some expressions are 'identified' by more than one
# group. This is not desirable and is difficult to maintain. If you
# change the grouping in any of the regular expressions, then you have
# some possibly heavy code changes to make down the line. The approach
# you commented on is more generalized and easier to change with required
# adjustments falling into place.

# There might a few other issues. However, I haven't had time to fully
# explore them.

# Thanks, Alex.

#     Alex.  ;-)


import re
import sys

lines = [
	r'ab1234',
	r'abcd1234',
	r'abcdcd',
	r'abcdcdcdcd',
	r'zh192',
	r'abcdefghij' ]

regex_patterns = [
	r'abcd(cd){1,4}',
	r'abcd',
	r'abcd(cd)?',
	r'ab',
	r'zh192' ]


regex_bigpat = '(' + ')|('.join( regex_patterns ) + ')'
print regex_bigpat

regex = re.compile( regex_bigpat )

for line in lines :
	mo = regex.match( line )
	if mo :
		print line, mo.lastindex
	else :
		print line, 'not matched'


sys.exit( 0 );




More information about the Python-list mailing list