How to insert string in each match using RegEx iterator

Wed Jun 10 10:36:33 EDT 2009

On Jun 10, 5:17 am, Paul McGuire <pt... at austin.rr.com> wrote:
> On Jun 9, 11:13 pm, "504cr... at gmail.com" <504cr... at gmail.com> wrote:
>
> > By what method would a string be inserted at each instance of a RegEx
> > match?
>
> Some might say that using a parsing library for this problem is
> overkill, but let me just put this out there as another data point for
> you.  Pyparsing (http://pyparsing.wikispaces.com) supports callbacks
> that allow you to embellish the matched tokens, and create a new
> string containing the modified text for each match of a pyparsing
> expression.  Hmm, maybe the code example is easier to follow than the
> explanation...
>
> from pyparsing import Word, nums, Regex
>
> # an integer is a 'word' composed of numeric characters
> integer = Word(nums)
>
> # or use this if you prefer
> integer = Regex(r'\d+')
>
> # attach a parse action to prefix 'INSERT ' before the matched token
> integer.setParseAction(lambda tokens: "INSERT " + tokens[0])
>
> # use transformString to search through the input, applying the
> # parse action to all matches of the given expression
> test = '123 abc 456 def 789 ghi'
> print integer.transformString(test)
>
> # prints
> # INSERT 123 abc INSERT 456 def INSERT 789 ghi
>
> I offer this because often the simple examples that get posted are
> just the barest tip of the iceberg of what the poster eventually plans
> to tackle.
>
> Good luck in your Pythonic adventure!
> -- Paul

Thanks for all of the instant feedback. I have enumerated three
responses below:

First response:

Peter,

I wonder if you (or anyone else) might attempt a different explanation
for the use of the special sequence '\1' in the RegEx syntax.

The Python documentation explains:

\number
    Matches the contents of the group of the same number. Groups are
numbered starting from 1. For example, (.+) \1 matches 'the the' or
'55 55', but not 'the end' (note the space after the group). This
special sequence can only be used to match one of the first 99 groups.
If the first digit of number is 0, or number is 3 octal digits long,
it will not be interpreted as a group match, but as the character with
octal value number. Inside the '[' and ']' of a character class, all
numeric escapes are treated as characters.

In practice, this appears to be the key to the key device to your
clever solution:

>>> re.compile(r"(\d+)").sub(r"INSERT \1", string)
'abc INSERT 123 def INSERT 456 ghi INSERT 789'

>>> re.compile(r"(\d+)").sub(r"INSERT ", string)
'abc INSERT  def INSERT  ghi INSERT '

I don't, however, precisely understand what is meant by "the group of
the same number" -- or maybe I do, but it isn't explicit. Is this just
a shorthand reference to match.group(1) -- if that were valid --
implying that the group match result is printed in the compile
execution?

Second response:

I've encountered a problem with my RegEx learning curve which I'll be
posting in a new thread -- how to escape hash characters # in strings
being matched, e.g.:

>>> string = re.escape('123#456')
>>> match = re.match('\d+', string)
>>> print match
<_sre.SRE_Match object at 0x00A6A800>
>>> print match.group()
123

Third response:

Paul,

Thanks for the referring me to the Pyparsing module. I'm thoroughly
enjoying Python, but I'm not prepared right now to say I've mastered
the Pyparsing module. As I continue my work, however, I'll be tackling
the problem of parsing addresses, exactly as the Pyparsing module
example illustrates. I'm sure I'll want to use it then.