Regexes: How to handle escaped characters
Torsten Bronger
bronger at physik.rwth-aachen.de
Fri May 18 03:35:06 EDT 2007
Hallöchen!
John Machin writes:
> On May 18, 6:00 am, Torsten Bronger <bron... at physik.rwth-aachen.de>
> wrote:
>
>> [...]
>>
>> Example string: u"Hollo", escaped positions: [4]. Thus, the
>> second "o" is escaped and must not be found be the regexp
>> searches.
>>
>> Instead of re.search, I call the function guarded_search(pattern,
>> text, offset) which takes care of escaped caracters. Thus, while
>>
>> re.search("o$", string)
>>
>> will find the second "o",
>>
>> guarded_search("o$", string, 0)
>
> Huh? Did you mean 4 instead of zero?
No, the "offset" parameter is like the "pos" parameter in the search
method of regular expression objects. It's like
guarded_search("o$", string[offset:])
Actually, my real guarded_search even has an "endpos" parameter,
too.
> [...]
>
> Quite apart from the confusing use of "escape", your requirements are
> still as clear as mud. Try writing up docs for your "guarded_search"
> function.
Note that I don't want to add functionality to the stdlib, I just
want to solve my tiny annoying problem. Okay, here is a more
complete story:
I've specified a simple text document syntax, like reStructuredText,
Wikimedia, LaTeX or whatever. I already have a preprocessor for it,
now I try to implement the parser. A sectioning heading looks like
this:
Introduction
============
Thus, my parser searches (among many other things) for
r"\n\s*={4,}\s*$". However, the author can escape any character
with a backslash:
Introduction or Introduction
\=========== ====\=======
This means the first (or fifth) equation sign is an equation sign as
is and not part of a heading underlining. This must not be
interpreted as a section begin. The preprocessor generates
u"===========" with escaped_positions=[0]. (Or [4], in the
righthand case.)
This is why I cannot use normal search methods.
> [...]
>
> Whatever your exact requirement, it would seem unlikely to be so
> wildly popularly demanded as to warrant inclusion in the "regexp
> machine". You would have to write your own wrapper, something like
> the following totally-untested example of one possible
> implementation of one possible guess at what you mean:
>
> import re
> def guarded_search(pattern, text, forbidden_offsets, overlap=False):
> regex = re.compile(pattern)
> pos = 0
> while True:
> m = regex.search(text, pos)
> if not m:
> return
> start, end = m.span()
> for bad_pos in forbidden_offsets:
> if start <= bad_pos < end:
> break
> else:
> yield m
> if overlap:
> pos = start + 1
> else:
> pos = end
> 8<-------
This is similar to my current approach, however, it also finds too
many "^a" patterns because it starts a fresh search at different
positions.
Tschö,
Torsten.
--
Torsten Bronger, aquisgrana, europa vetus
Jabber ID: bronger at jabber.org
(See http://ime.webhop.org for ICQ, MSN, etc.)
More information about the Python-list
mailing list