text processing problem

Paul McGuire ptmcg at austin.rr.com
Thu Apr 7 22:32:50 EDT 2005


Maurice -

Here is a pyparsing treatment of your problem.  It is certainly more
verbose, but hopefully easier to follow and later maintain (modifying
valid word characters, for instance).  pyparsing implicitly ignores
whitespace, so tabs and newlines within the expression are easily
skipped, without cluttering up the expression definition.  The example
also shows how to *not* match "<X> (<X>)" if inside a quoted string (in
case this becomes a requirement).

Download pyparsing at http://pyparsing.sourceforge.net.

-- Paul
(replace leading '.'s with ' 's)

from pyparsing import *

LPAR = Literal("(")
RPAR = Literal(")")

# define a word as beginning with an alphabetic character followed by
# zero or more alphanumerics, -, _, ., or $ characters
word = Word(alphas, alphanums+"-_$.")

targetExpr = word.setResultsName("first") + \
............LPAR + word.setResultsName("second") + RPAR

# this will match any 'word ( word )' arrangement, but we want to
# reject matches if the two words aren't the same
def matchWords(s,l,tokens):
....if tokens.first != tokens.second:
........raise ParseException(s,l,"")
....return tokens[0]
targetExpr.setParseAction( matchWords )


testdata = """
This is (is) a match.
This is (isn't) a match.
I.B.M.\t\t\t(I.B.M. ) is a match.
This is also a A.T.T.
(A.T.T.) match.
Paris in "the(the)" Spring(  Spring  ).
"""
print testdata

print targetExpr.transformString(testdata)

print "\nNow don't process ()'s inside quoted strings..."
targetExpr.ignore(quotedString)
print targetExpr.transformString(testdata)

Prints out:
This is (is) a match.
This is (isn't) a match.
I.B.M.			(I.B.M. ) is a match.
This is also a A.T.T.
(A.T.T.) match.
Paris in "the(the)" Spring(  Spring  ).


This is a match.
This is (isn't) a match.
I.B.M. is a match.
This is also a A.T.T. match.
Paris in "the" Spring.


Now don't process ()'s inside quoted strings...

This is a match.
This is (isn't) a match.
I.B.M. is a match.
This is also a A.T.T. match.
Paris in "the(the)" Spring.




More information about the Python-list mailing list