Backreference within a character class

Andrew M. Kuchling akuchlin at mems-exchange.org
Thu Feb 24 16:53:30 EST 2000


taashlo at sandia.gov writes:
> Using the re module, lets say that I want to match "XIX" but not "XXX"
> or "WOW" but not "WWW".  In my first attempt I used r"(.)([^\1])\1".
> This, of course, did't work because the "\1" in character class isn't
> interpreted as a backreference.

If you're trying to match 3-character words with the same letters in
positions 1 and 3, but not 2, then a lookahead negation would do it:

pat = re.compile(r"(.)(?!\1).\1")

The steps here are 1) matches a character 2) assert that the
backreference \1 doesn't match at this point 3) consume the character,
because assertions are zero-width and don't consume any characters,
and 4) match \1.  (Alternatively, if the 3-character string is in
variable S, 'if (S[0] == S[2] and S[0] != S[1])' would do it.)

On a theoretical plane: If you wanted to match general strings of the
form ABA, where A!=B and A,B are of arbitrary non-zero length, I think
this isn't possible with regexes (of either Python or Perl varieties),
because in step 3 you couldn't consume as many characters as were
matched by the first group.  Anyone see a
clever way I've missed?  (Another jeu d'esprit.)  You'd have to do it
by matching the pattern r"(.+)(.+)\1", and then verifying that group 2
!= group 1 in Python code.

(thinking aloud) I don't think lookbehinds would help, in either full
generality or in the limited Perl implementation.  Implementing
lookbehinds in full generality would be fun; going to have to take a
crack at it when SRE becomes available...

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
Testing? That's scheduled for first thing after 3.0 ships. Quality is job
Floating Point Error; Execution Terminated.
    -- Benjamin Ketcham, on applications for Microsoft Windows, in
       _comp.os.unix.advocacy_.




More information about the Python-list mailing list