Regexes: How to handle escaped characters

John Machin sjmachin at lexicon.net
Thu May 17 17:06:03 EDT 2007


On May 18, 6:00 am, Torsten Bronger <bron... at physik.rwth-aachen.de>
wrote:
> Hallöchen!
>
> James Stroud writes:
> > Torsten Bronger wrote:
>
> >> I need some help with finding matches in a string that has some
> >> characters which are marked as escaped (in a separate list of
> >> indices).  Escaped means that they must not be part of any match.
>
> >> [...]
>
> > You should probably provide examples of what you are trying to do
> > or you will likely get a lot of irrelevant answers.
>
> Example string: u"Hollo", escaped positions: [4].  Thus, the second
> "o" is escaped and must not be found be the regexp searches.
>
> Instead of re.search, I call the function guarded_search(pattern,
> text, offset) which takes care of escaped caracters.  Thus, while
>
>     re.search("o$", string)
>
> will find the second "o",
>
>     guarded_search("o$", string, 0)

Huh? Did you mean 4 instead of zero?

>
> won't find anything.

Quite apart from the confusing use of "escape", your requirements are
still as clear as mud. Try writing up docs for your "guarded_search"
function. Supply test cases showing what you expect to match and what
you don't expect to match. Is "offset" the offset in the text? If so,
don't you really want a set of "forbidden" offsets, not just one?

>  But how to program "guarded_search"?
> Actually, it is about changing the semantics of the regexp syntax:
> "." doesn't mean anymore "any character except newline" but "any
> character except newline and characters marked as escaped".

Make up your mind whether you are "escaping" characters [likely to be
interpreted by many people as position-independent] or "escaping"
positions within the text.

>  And so
> on, for all syntax elements of regular expressions.  Escaped
> characters must spoil any match, however, the regexp machine should
> continue to search for other matches.
>

Whatever your exact requirement, it would seem unlikely to be so
wildly popularly demanded as to warrant inclusion in the "regexp
machine". You would have to write your own wrapper, something like the
following totally-untested example of one possible implementation of
one possible guess at what you mean:

import re
def guarded_search(pattern, text, forbidden_offsets, overlap=False):
    regex = re.compile(pattern)
    pos = 0
    while True:
        m = regex.search(text, pos)
        if not m:
            return
        start, end = m.span()
        for bad_pos in forbidden_offsets:
            if start <= bad_pos < end:
                break
        else:
            yield m
        if overlap:
            pos = start + 1
        else:
            pos = end
8<-------

HTH,
John




More information about the Python-list mailing list