ANN: 'rex', a module for easy creation and use of regular expressions

Kenneth McDonald kenneth.m.mcdonald at sbcglobal.net
Fri Jun 11 00:24:48 CEST 2004


This is a 'pre-release' release of the 'rex' module, a Python module
intended to ease the use of regular expressions. While I think the code
is currently quite useful, I'm advising against using it except on an
experimental basis, as the API is subject to change. One of the purposes
of this release is to solicit feedback on the final API.

The module is available as two text files appended at the end of this
message: '__init__.py' should be placed in a folder called 'rex' on
your python path, while the test file can be placed anywhere, and
simply does some basic testing of rex. I'd attempted to release this
about a week ago with a uuencoded copy of the module, but I believe my
news service disallowed that message; at least, I never saw it on my
news feed. Sorry if this is a repeat.

Immediately below are a couple of snippets from the internal rex documentation,
to give you an idea of what rex does. I hope this will intrigue you enough
to try and to give me feedback. rex is free software.

At this point, I would greatly appreciate feedback. Do you think this
could be a useful module? Do you have any suggestions for the API?
What would you like to see for additional functionality?

If there is sufficient enthusiasm, I would certainly welcome code
or documentation contributions. But we'll worry about that if it
turns out enough people are interested... :-)


A Bit About rex...
==================

'rex' stands for any of (your choice):
	- Regular Expression eXtensions
	- Regular Expressions eXpanded
	- Rex, King of Regular Expressions (ha, ha ha, ha).
	
rex provides a completely different way of writing regular expressions (REs). 
You do not use strings to write any part of the RE _except_ for 
regular expression literals. No escape characters, metacharacters,
etc. Regular expression operations, such as repetition, alternation,
concatenation, etc., are done via Python operators, methods, or 
functions.

As an example, take a look at the definition of an RE matching a complex
number, an example included in test_rex.py. The rex Python code to do this is:
	
	COMPLEX= 		PAT.aFloat['re']			+ \
					PAT.anyWhitespace 			+ \
					ALT("+", "-")['op']			+ \
					PAT.anyWhitespace			+ \
					PAT.aFloat['im'] 			+ \
					'i'

while the analogous RE is:
	
	(?P<re>(?:\+|\-)?\d+(?:.\d*)?)\s*(?P<op>\+|\-)\s*(?P<im>(?:\+|\-)?\d+(?:.\d*)?)i

The rex code is more verbose than
the simple RE (which, by the way, was the RE generated by the rex code,
and is pretty much what you'd produce by hand). It is also FAR easier to
read, modify, and debug. And, it illustrates how easy it is to reuse rex patterns:
PAT.aFloat and PAT.anyWhitespace are predefined patterns provided in rex which
match, respectively, a string representation of a floating point number (no exponent),
and a sequence of zero or more whitespace characters.


FILE __init__.py
============================================================================
'''Module to provide a better interface to regular expressions.

LICENSE

This is the 'rex' module, written by Kenneth M. McDonald, (c) 2004 Kenneth M. McDonald.
You may use it for any reason, subject to the following simple conditions:
	
	1) You may not distribute or publish modified versions of this
	module or files in it, unless you change the name of the module. 
	
	2) You may incorporate this code into other files. If you use more
	than 30 lines (total) of code/text from this module in another file or group of files, you must
	include a line stating "Some of the code in this file/these files are from the
	free Python module 'rex', by Kenneth M. McDonald."
	
	3) If you sell or otherwise gain revenue from a product using this
	code, then:
		
		a) The product or its documentation must state, in a reasonably 
		prominent place, "Some of the functionality of this program is
		provided by freely available source code", or words to that effect.
		Incorporating this phrase in the 'About' box of the program, or in
		the introduction to the User's manual, will satisfy the requirement
		of 'reasonable prominence'. You do not need to mention this module
		specifically, nor do you need to indicate what functionality this module
		is used for.
		
		b) You must ensure the buyers of the product have access to the
		source code for this module. You can do this by including this module
		with your software, by providing a URL to a download site for this
		module (I do not maintain such a URL), or by other similar means.

INTRODUCTION

'rex' stands for any of (your choice):
	- Regular Expression eXtensions
	- Regular Expressions eXpanded
	- Rex, King of Regular Expressions (ha, ha ha, ha).
	
rex provides a completely different way of writing regular expressions (REs). You
do not use strings to write any part of the RE _except_ for 
regular expression literals. No escape characters, metacharacters,
etc. Regular expression operations, such as repetition, alternation,
concatenation, etc., are done via Python operators, methods, or 
functions.

The major advantages of rex are:
	
	- rex expressions are checked for well-formedness by the Python
		parser; this will typically provide earlier and easier-to-understand
		diagnoses of syntactically malformed regular expressions
		
	- rex expressions are all strings! They are, in fact, a specialized subclass
		of strings, which means you can pass them to existing code
		which expects REs. [NOTE: This may change in the future.]
		
	- rex goes to some lengths to produce REs which are similar to
		those written by hand, i.e. it tries to avoid unnecessary use of
		nongrouping parentheses, uses special escape sequences
		where possible, writes 'A?' instead of 'A{0,1}', etc. In general,
		rex tries to produce concise REs, on the theory that if you
		really need to read the buggers at some point, it's easier to
		read simpler ones than more complex ones.
		
	- [This is the biggie.] rex permits complex REs to be built up easily
		of smaller parts. In fact, a rex definition for a complex RE is likely
		to end up looking somewhat like a mini grammar.
		
	- [Another biggie.] As an ancillary to the above, rex permits REs to be easily reused.
	
As an example, take a look at the definition of an RE matching a complex
number, an example included in the test_rex.py. The rex Python code to do this is:
	
	COMPLEX= 		PAT.aFloat['re']			+ \
					PAT.anyWhitespace 		+ \
					ALT("+", "-")['op']			+ \
					PAT.anyWhitespace		+ \
					PAT.aFloat['im'] 			+ \
					'i'

while the analogous RE is:
	
	(?P<re>(?:\+|\-)?\d+(?:.\d*)?)\s*(?P<op>\+|\-)\s*(?P<im>(?:\+|\-)?\d+(?:.\d*)?)i

The rex code is more verbose than
the simple RE (which, by the way, was the RE generated by the rex code,
and is pretty much what you'd produce by hand). It is also FAR easier to
read, modify, and debug. And, it illustrates how easy it is to reuse rex patterns:
PAT.aFloat and PAT.anyWhitespace are predefined patterns provided in rex which
match, respectively, a string representation of a floating point number (no exponent),
and a sequence of zero or more whitespace characters.

USE

This is a quick overview of how to use rex. See documentation associated
with a specific method/function/name for details on that entity.	

In the following, we use the abbreviation RE to refer to standard regular
expressions defined as strings, and the word 'rexp' to refer to rex objects
which denote regular expressions.

	- The starting point for building a rexp is either rex.PAT, 
		which we'll just call PAT, or rex.CHAR, which we'll just call CHAR.
		CHAR builds rexps which match single character strings. PAT builds rexps
		which match strings of varying lengths.
		
	- PAT(string) returns a rexp which will match exactly the string given, and nothing else. 
	
	- PAT._someattribute_ returns (for defined attributes) a corresponding rexp. 
	For example, PAT.aDigit returns a rexp matching a single digit.
		
	- CHAR(a1, a2, . . .) returns a rexp matching a single character from a set
		of characters defined by its arguments. For example, CHAR("-", ["0","9"], ".")
		matches the characters necessary to build basic floating point numbers.
		See CHAR docs for details.
		
	- Now assume that A, B, C,... are rexps. The following Python expressions
		(_not_ strings) may be used to build more complex rexps:
			
			- A | B | C . . . : returns a rexp which matches a string if any of the operands
				match that string. Similar to "A|B|C" in normal REs, except of course you can't
				use Python code to define a normal RE.
				
			- A + B + C ...: returns a rexp which matches a string if all of A, B, C match consecutive
				substrings of the string in succession. Like "ABC" in normal REs.
				
			- A*n : returns a rexp which matches a number of times as defined by n.
				This replaces '?', '+', and '*' as used in normal REs. See docs for details.
				
			- A**n : Like A*n, but does nongreedy matching.
				
			- +A : positive lookahead assertion: matches if A matches, but doesn't
				consume any of the input.
				
			- ~+A : negative lookahead assertion: matches of A _doesn't_ match,
				but doesn't consume any of the input.
				
			- -A, ~-A : positive and negative lookback assertions. Lke lookahead assertions,
				but in the other direction.
				
			- A[name] : name must be a string: anything matched by A can be referred
				to by the given name in the match result object. (This is the equivalent
				of named groups in the re module).
				
			- A.group() : A will be in an unnamed group, referable by number.
			
	- In addition, a few other operations can be done:
		
		- Some of the attributes defined in PAT have "natural inverses"; for such
			attributes, the inverse may be taken. For example, ~ PAT.digit is
			a pattern matching any character except a digit.
			
		- Character classes may be inverted: ~CHAR("aeiouAEIOU") returns a pattern
			matching anything except a vowel.
			
		- 'ALT' gives a different way to denote alternation: ALT(A, B, C,...) does
			the same thing as A | B | C | . . ., except that none of the arguments
			to ALT need be rexps; any which are normal strings will be converted
			to a rexp using PAT.
			
		- 'PAT' can take multiple arguments: PAT(A, B, C,...), which gives the same
			result as PAT(A) + PAT(B) + PAT(C) + . . .  .
			
	- Finally, a very convenient shortcut is that only the first object in a sequence of 
		operator/method calls needs to be a rexp; all others will be automatically
		converted as if PAT[...] had been called on them. For example, the
		sequence A | "hello" is the same as A | PAT("hello")

'CHAR' USE

CHAR(args...) defines a character class. Arguments are any number of strings or two-tuples/two-element lists.
	
	eg. 
		CHAR("ab-z") 
	is the same as the regular expression r"[ab\-z]". NOTE that there are no 'character range metacharacters';
	the preceding define a character class containing four characters, one of which was a '-'. 
	
	This is a character
	class containing a backslash, hyphen, and open/close brackets:
		
		CHAR(r"\-[]")     or        CHAR("\\-[]")
		
	Note that we still need to use raw strings to turn off normal Python string escaping.
	
	To define ranges, do this :
		
		CHAR(["a","z"], ["A","Z"])
		
	To define inverse ranges, use the ~ operator, eg. To define the class of all non-numeric characters:
		
		~CHAR(["0","9"])
		
	Character classes cannot (yet) be doubly negated: ~~CHAR("A") is an error.	
'''

import re, string

def _escapeSpecialRangeChars(char):
	'''Function to escape characters which have a special meaning in character ranges.
	We don't actually need to escape '[', but I think it makes the string representation a little
	less confusing.'''
	if char in "^-\\[]": return "\\"+char
	else: return char

class _rexobj(str):
	'''Class of strings which are to be treated as regular expressions.'''
	def __init__(self, s):
		str.__init__(self, s)
		
	def compile(self, ignorecase=False, locale=False, multiline=False, dotmatchesnewline=False, unicode=False):
		flags = 	(ignorecase and re.IGNORECASE) | \
				(locale and re.LOCALE) | \
				(multiline and re.MULTILINE) | \
				(dotmatchesnewline and re.DOTALL) | \
				(unicode and re.UNICODE)
		return re.compile(self, flags)
		
	def __mul__(self, num):
		'''Greedy repetition operator'''
		return self.repeat(num)
		
	def __pow__(self, num):
		'''Nongreedy repetition operator'''
		return self.repeat(num, greedy=False)
		
	def __pos__(self):
		'''Lookahead assertion'''
		return _relookaheadassertion("(?=%s)"  % (self, ))
			
	def __neg__(self):
		'''Lookback assertion'''
		return _relookbackassertion("(?<=%s)"  % (self, ))
		
	def __contains__(self, text):
		'''Another abuse of operators. A regular express can be considered as the set
		of all strings it can match (or generate, for those of you who know the theory.)
		The code
		
			if text in rexp: body
			
		executes 'body' if text is in the set of strings generated by the rexp, and false otherwise.
		'''
		pattern = PAT.stringStart + self + PAT.stringEnd
		return bool( pattern.match(text) )

	def repeat(self, num=0, greedy=True, doc=None):
		'''This is the repetition function. However, repetition is normally done 
		with the * (greedy) operator, or ** (nongreedy) operator,
		like so:
			A*3 : Three or moreoccurrences of A.
			A*0 : In re terms, same as A*
			A*1: In re terms, same as A+
			A*(2, 4) : 2-4 occurrences of A. In re terms, same as A{2,4}
			A*-5 : Up to 5 occurrences of A. In re terms, same as A{0,5}
			A*-1 : In re terms, same as A?
			
			A**x : nongreedy versions above, like A*?, A+?, A{1,3}?.
			
		You can use repeat() as a functional equivalent; in this case, the
		'num' parameter is what you would pass as the second argument
		to */**, 'greedy' is a boolean determining if the operation should
		be greedy or nongreedy, and 'doc' can be used for a documentation
		comment (but is not currently used in any way.)
		'''
		min=0
		max=None
		if isinstance(num, int): 
			if num >=0: min=num
			else: max=-num
		else:
			assert isinstance(num, tuple) and len(num)==2
			min, max = num
			
		if greedy: nongreedy=""
		else: nongreedy="?"
		del greedy
		if min==0 and max==None:
			return _reblock(str(self.block())+"*"+nongreedy)
		elif min==1 and max==None:
			return _reblock(str(self.block())+"+"+nongreedy)
		elif min==0 and max==1:
			return _reblock(str(self.block())+"?"+nongreedy)
		else:
			return _reblock("%s{%s,%s}%s" % (self.block(), min, max, nongreedy))
 
	def name(self, name, doc=None):
		'''Enclose this RE in a named group.'''
		return _reblock('(?P<%s>%s)' % (name, self))
		
	def __getitem__(self, key):
		if isinstance(key, str): return self.name(key)
		else: return str.__getitem__(self, key)
		
	def group(self, doc=None):
		'''Enclose this RE in a numbered group.'''
		return _reblock('(%s)' % (self,))
		
	def __or__(self, other):
		'''Alternation (choice) operator.'''
		other = _convert(other)
		if self.precedence() > _realt.precedence(): 
			self = self.block()
		if other.precedence() > _realt.precedence(): 
			other = other.block()
		return _realt('%s|%s' % (self, other))

	def block(self, doc=None):
		'''Enclose this RE, if necessary, in an anonymous block, i.e. '(?:...)'. If
		the RE is already in a block, it will be returned unchanged.'''
		if self.precedence()==0: return self
		else: return _reblock('(?:%s)' % (self,))		

	def precedence():
		'''Indicates the binding precedence of the top-level operators used to form this
		RE. 0 is highest precedence.'''
		raise NotImplementedError, "precedence() should be defined in a subclass."
	precedence = staticmethod(precedence)

	def __add__(self, other):
		'''Concatenation operator.'''
		other = _convert(other)
		if self.precedence() > _recat.precedence(): 
			self = self.block()
		if other.precedence() > _recat.precedence(): 
			other = other.block()
		return _recat('%s%s' % (self, other))
		
	def itersearch(self, text, matched=None):
		'''Iterates sequentially throught nonoverlapping substrings in 'text'
		which match and do not match self. See docs on match results.
		
		@param matched: If None (the default), returns both substrings from text
		matching the pattern, and those substrings between the matches. If true,
		returns only the matching substrings. If false (other than None), returns only the nonmatching substrings.
		
		EXAMPLE:
			To print all digit sequences in a string:
				for matchresult in PAT.someDigits.itersearch(aString):
					if matchresult: print matchresult[0]
			To print all the sequences form the string that _weren't_ digits:
				for matchresult in PAT.someDigits.itersearch(aString):
					if not matchresult: print matchresult[0]
		'''
		re = self.compile()
		start = 0
		while True:
			match = re.search(text, start)
			# If we couldn't find any more matches, yield whatever nonmatching remnant is left, and return
			if not match:
				if not matched and start < len(text): 
					yield MatchResult(text[start:], start, len(text))
				return
			else:
				# If there was an unmatched substring before the found match, yield it.
				if not matched and start < match.start(): yield MatchResult(text[start:match.start()], start, match.start())
				if matched==None or matched: yield MatchResult(match)
				start = match.end()
				
	def iterstrings(self, text, matched=None):
		'''Convenience function; simply extracts and returns group 0 from an underlying call
		to itersearch().'''
		for m in self.itersearch(text, matched): yield m[0]
				
	def search(self, text, matched=None):
		'''Returns the first element from a search of the text using itersearch and the 'matched' param'''
		return self.itersearch(text).next()
		
	def match(self, text):
		'''@todo: Not yet fully implemented, please don't use.'''
		return MatchResult(self.compile().match(text))
		
				
class MatchResult(object):
	'''A MatchResult specifies what was found by an attempt to look for an pattern
	in a string.'''
	def __init__(self, match, start=None, end=None):
		self.__match = match
		if self:
			self.__start = match.start()
			self.__end = match.end()
		else:
			self.__start = start
			self.__end = end
			
	def __nonzero__(self):
		'''A match result is 'true' if it is the result of successfully finding
		a pattern in a target string, in which case self.__match will be
		an re.MatchObj or something like that. However, match results
		may also be used to indicate the part of a target string that a
		pattern _failed_ to match; in this case, self.__match will be
		that part of the target, and the match result will be 'false'''
		return not (isinstance(self.__match, str) or self.__match == None)
		
	def __getitem__(self, key):
		return self.get(key)
		
	def expand(self, expansion):
		'''@todo: don't use this yet.'''
		return self.__match.expand(expansion)
	
	def get(self, key=0):
		'''Returns a string matching a group identified by name or by index. If this is a 
		failed match result (see docs for '__nonzero__' in this class), the only allowable
		index is 0, which indicates the entire unmatched string.'''
		if not self:
			if key==0: return self.__match
			else: raise KeyError, "Invalid group index: "+ `key` + " (a failed match result only has one group, indexed by 0)."
		result = self.__match.group(key)
		if result==None: raise KeyError, "Invalid group index: "+ `key`
		return result
		
	start = property(fget=lambda self: self.__start, doc="The starting position of the match result in the search string")
	end = property(fget=lambda self: self.__end, doc="The ending position of the match result in the search string")
		
	string = property(fget=lambda self: self.get(), doc="The string found by the match result.")
	
	def __str__(self): return self.string
		
		
class _recat(_rexobj):
	'''RE constructed via concatenation'''
	def __init__(self, s):
		_rexobj.__init__(self, s)

	def precedence(): 
		'''The precedence of a single character can always be considered as 0, since
		it can't be split into subparts.'''
		#if len(self)==1: return 0
		return 1
	precedence = staticmethod(precedence)
		
class _realt(_rexobj):
	'''re constructed from alternation operators'''
	def __init__(self, s):
		_rexobj.__init__(self, s)
		
	def precedence(): return 2
	precedence = staticmethod(precedence)
				
class _reblock(_rexobj):
	'''Superclass for all classes of res which are in a block of 
	some sort, i.e. which do not need to be enclosed in parentheses
	in order to assure associativity.'''
	def __init__(self, s):
		_rexobj.__init__(self, s)
		
	def precedence(): return 0
	precedence = staticmethod(precedence)
	
class _relookaheadassertion(_reblock):
	_inverses = {'=':'!', '!':'='}
	def __init__(self, s):
		_reblock.__init__(self, s)
	
	def __invert__(self):
		assert self[0:2] == "(?"
		return _relookaheadassertion("(?" + self._inverses[self[2]] + self[3:])
		
class _relookbackassertion(_reblock):
	_inverses = {'=':'!', '!':'='}
	def __init__(self, s):
		_reblock.__init__(self, s)
	
	def __invert__(self):
		assert self[0:3] == "(?<"
		return _relookbackassertion("(?<" + self._inverses[self[3]] + self[4:])
		
class _range(_reblock):
	def __init__(self, s):
		_reblock.__init__(self, s)

	def __invert__(self):
		return _inverserange(self[0] + "^" +self[1:])
		
class _inverserange(_reblock):
	def __init__(self, s):
		_reblock.__init__(self, s)

class _reprimitive(_reblock):
	'''Atomic REs, i.e. not composed of other REs. Typically single-characters, special sequences, etc.'''
	def __init__(self, s):
		_reblock.__init__(self, s)
		
	def __invert__(self):
		if self in _primitiveInverses: return _primitiveInverses[self]
		raise NotImplementedError, 'No inverse for ' + self

def RAW(s):
	'''@todo: This is a hack to let something else work. It will go away.'''
	return _reprimitive(s)
	
def _convert(s):
	'''Convert s to a literal rexobj; if it is already a rexobj, leave it unchanged.'''
	if isinstance(s, _rexobj): return s
	else: return _recat(re.escape(s))

def _CHARfun(*args):
	strings = ["["]
	for a in args:
		if isinstance(a, _reprimitive):
			strings.append(a)
		elif isinstance(a, str):
			strings.append("".join(map(_escapeSpecialRangeChars, a)))
		else:
			assert isinstance(a, tuple) and len(a)==2
			startchar, endchar = a
			strings.append('%s-%s' %(_escapeSpecialRangeChars(startchar), _escapeSpecialRangeChars(endchar)))
			
	strings.append("]")
	return _range("".join(strings))
	
class _CHAR(object):

	def __call__(self, *args):
		return _CHARfun(*args)
		
CHAR = _CHAR()
			
MAYBE = -1
ANY = 0
SOME = 1

class _PAT(object):
	dot = _reprimitive(".")
	aChar = _reprimitive(r"[.\n]")
	aDigit = _reprimitive(r'\d')
	aWhitespace = _reprimitive(r'\s')
	aBackslash = _reprimitive(r'\\')
	anAlphanum = _reprimitive(r'\w')
	aLetter = _CHARfun(('a','z'), ('A','Z'))
	aPunctuationMark = _CHARfun("""~`!@#$%^&*()_-+={[}]|\:;"'<,>.?/""") # The standard US keyboard punctuation marks: does _not_ include whitspace chars.
	stringStart = _reprimitive(r'\A')
	stringEnd = _reprimitive(r'\Z')
	wordBorder = _reprimitive(r'\b')
	emptyString = _reprimitive('')
	someDigits = aDigit * 1
	anyDigits = aDigit * 0
	someLetters = aLetter*1
	anyLetters = aLetter * 0
	someChars = aChar*1
	anyChars = aChar * 0
	someWhitespace = aWhitespace*1
	anyWhitespace = aWhitespace*0
	anInt = (_convert("+")|"-")*MAYBE + someDigits
	aFloat = anInt + (_recat(".") + anyDigits)*MAYBE
	
	def __call__(self, arg, *rest):
		''''Returns a _rexobj, a subclass of the builtin string class, which happens to know it
		is a regular expression.'''
		arg = _convert(arg)
		for next in rest:
			arg = arg + next
		return arg
		
PAT = _PAT()

_primitiveInverses = {
	PAT.aDigit : r'\D',
	PAT.aWhitespace : r'\S',
	PAT.wordBorder : r'\B',
	PAT.anAlphanum: r'\W',
	}
# Fill in the reverse mappings
for key, val in _primitiveInverses.items():
	_primitiveInverses[val] = key
	
def ALT(arg, *rest):
	'''ALT(a, b, c,...) is the same as PAT(a) | b | c | ...'''
	arg = _convert(arg)
	for next in rest:
		arg = arg | next
	return arg

if __name__ == '__main__':
	print "Look in 'rex/__init__.py for documentation. Look in rex/_test/test_rex.py for some examples of using rex."



=============================================================================
END __init__.py



FILE test_rex.py
=============================================================================
import unittest
from rex import *

class rex_test(unittest.TestCase):
	
	COMPLEX= 		PAT.aFloat['re']			+ \
					PAT.anyWhitespace 		+ \
					ALT("+", "-")['op']			+ \
					PAT.anyWhitespace		+ \
					PAT.aFloat['im'] 			+ \
					'i'
					
	def testNames(self):
		'''Test extraction of data from named groups.'''
		results = []
		for c in self.COMPLEX.itersearch("3+4i  5.78- +3.14i   1.  +2.i"):
			if c: results.append(c['re'] + c['op'] + c['im'] + 'i')
		self.assertEquals(results, ["3+4i", "5.78-+3.14i", "1.+2.i"])
	
	def testCharacterRanges(self):
		aRange = CHAR(('a','z'), 'C', '+', '\\', '\t', "F-H[]")
		self.assert_('c' in aRange)
		self.assert_('X' not in aRange)
		self.assert_('\t' in aRange, 'Does the raw tab in the char range get processed correctly?')
		self.assert_('\n' not in aRange)
		self.assert_('G' not in aRange)
		self.assert_('-' in aRange)
		self.assert_('[' in aRange and ']' in aRange)
		
	def testPrecedence(self):
		'''Tests to ensure that '+' functions correctly as having higher precedence than '|'.
		This test is necessary because rex pulls a few tricks to avoid creating too many
		nongrouping parentheses in patterns.'''
		pattern = PAT('a') + 'b' | 'c' + 'd'
		self.assert_('ab' in pattern)
		self.assert_('a' not in pattern)
		self.assert_('b' not in pattern)
		self.assert_('cd' in pattern)
		self.assert_('c' not in pattern)
		self.assert_('d' not in pattern)
		
		pattern2 = PAT('a') | 'b' + 'c' | 'd'
		self.assert_('bc' in pattern2)
		self.assert_('a' in pattern2)
		self.assert_('d' in pattern2)
		self.assert_('b' not in pattern2)
		self.assert_('c' not in pattern2)
		
	def testLookAheadBack1(self):
		'''Tests lookahead and lookback assertions.'''
		phrase1 = "12wordOne ()wordTwo_wordThree_wordFour  wordFive! wordSix3 wordSeven"

		# For testing purposes, this definition of a 'word' isn't really a word.
		aWordPat1 = PAT(
				~-(PAT.aDigit | PAT("_") | PAT.aLetter), # The character before the possible word can't be a digit, underscore, or letter.
				PAT.someLetters, # Find the word.
				(+(PAT.aPunctuationMark | PAT.aWhitespace | PAT.stringEnd)) # must be followed by something which will end a word.
		)
		self.assertEquals(
			list(aWordPat1.iterstrings(phrase1, matched=True)), 
			['wordTwo', 'wordFive', 'wordSeven']
		)

	def testLookAheadBack2(self):
		'''Tests lookahead and lookback assertions.'''
		phrase1 = "12wordOne ()wordTwo_wordThree_wordFour  wordFive! wordSix3 wordSeven"

		# For testing purposes, this definition of a 'word' isn't really a word.
		aWordPat2 = PAT(
				-(~PAT.aLetter),	# Check that the character before the first letter of a possible word isn't a letter...
				~-PAT('_'),	#...or an underscore.
				PAT.someLetters, # Find the word.
				~+PAT.aLetter, # The following character can't be a letter.
				~+(PAT("!")) # but only accept words not followed by an exclamation!
		)
		self.assertEquals(
			list(aWordPat2.iterstrings(phrase1, matched=True)), 
			['wordOne', 'wordTwo', 'wordSix', 'wordSeven']
		)

if __name__=='__main__':
	unittest.main()
	
	



More information about the Python-list mailing list