Regular expression question -- exclude substring
James Stroud
jstroud at mbi.ucla.edu
Mon Nov 7 22:39:55 EST 2005
On Monday 07 November 2005 17:31, Kent Johnson wrote:
> James Stroud wrote:
> > On Monday 07 November 2005 16:18, google at fatherfrost.com wrote:
> >>Ya, for some reason your non-greedy "?" doesn't seem to be taking.
> >>This works:
> >>
> >>re.sub('(.*)(00.*?01) target_mark', r'\2', your_string)
> >
> > The non-greedy is actually acting as expected. This is because non-greedy
> > operators are "forward looking", not "backward looking". So the
> > non-greedy finds the start of the first start-of-the-match it comes
> > accross and then finds the first occurrence of '01' that makes the
> > complete match, otherwise the greedy operator would match .* as much as
> > it could, gobbling up all '01's before the last because these match '.*'.
> > For example:
> >
> > py> rgx = re.compile(r"(00.*01) target_mark")
> > py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat
> > 01') ['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01']
> > py> rgx = re.compile(r"(00.*?01) target_mark")
> > py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat
> > 01') ['00 noise1 01 noise2 00 target 01', '00 dowhat 01']
>
> ??? not in my Python:
> >>> rgx = re.compile(r"(00.*01) target_mark")
> >>> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat
> >>> 01')
>
> ['00 noise1 01 noise2 00 target 01']
>
> >>> rgx = re.compile(r"(00.*?01) target_mark")
> >>> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat
> >>> 01')
>
> ['00 noise1 01 noise2 00 target 01']
>
> Since target_mark only occurs once in the string the greedy and non-greedy
> match is the same in this case.
Somehow my cutting and pasting got messed up. It should be:
py> rgx = re.compile(r"(00.*?01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01 target_mark')
['00 noise1 01 noise2 00 target 01', '00 dowhat 01']
py> rgx = re.compile(r"(00.*01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01 target_mark')
['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01']
Sorry about that.
James
--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095
http://www.jamesstroud.com/
More information about the Python-list
mailing list