Backreference within a character class
Andrew M. Kuchling
akuchlin at mems-exchange.org
Thu Feb 24 16:53:30 EST 2000
taashlo at sandia.gov writes:
> Using the re module, lets say that I want to match "XIX" but not "XXX"
> or "WOW" but not "WWW". In my first attempt I used r"(.)([^\1])\1".
> This, of course, did't work because the "\1" in character class isn't
> interpreted as a backreference.
If you're trying to match 3-character words with the same letters in
positions 1 and 3, but not 2, then a lookahead negation would do it:
pat = re.compile(r"(.)(?!\1).\1")
The steps here are 1) matches a character 2) assert that the
backreference \1 doesn't match at this point 3) consume the character,
because assertions are zero-width and don't consume any characters,
and 4) match \1. (Alternatively, if the 3-character string is in
variable S, 'if (S[0] == S[2] and S[0] != S[1])' would do it.)
On a theoretical plane: If you wanted to match general strings of the
form ABA, where A!=B and A,B are of arbitrary non-zero length, I think
this isn't possible with regexes (of either Python or Perl varieties),
because in step 3 you couldn't consume as many characters as were
matched by the first group. Anyone see a
clever way I've missed? (Another jeu d'esprit.) You'd have to do it
by matching the pattern r"(.+)(.+)\1", and then verifying that group 2
!= group 1 in Python code.
(thinking aloud) I don't think lookbehinds would help, in either full
generality or in the limited Perl implementation. Implementing
lookbehinds in full generality would be fun; going to have to take a
crack at it when SRE becomes available...
--
A.M. Kuchling http://starship.python.net/crew/amk/
Testing? That's scheduled for first thing after 3.0 ships. Quality is job
Floating Point Error; Execution Terminated.
-- Benjamin Ketcham, on applications for Microsoft Windows, in
_comp.os.unix.advocacy_.
More information about the Python-list
mailing list