[Tutor] search/match file position q
Martin A. Brown
martin at linux-ip.net
Wed Oct 8 00:05:07 CEST 2014
Good afternoon Clayton,
> !A regex doesn't understand the structure of an html document. For
> !example
> !you need to keep track of the nesting level manually to find the cells
> !of
> !the inner of two nested tables.
> !
> !> question still remains: does the
> !> search start at the beginning of the line each time or does it step
> !> forward from the last search?
> !
> !re.search() doesn't keep track of prior searches; whatever string you
> !feed
> !it (in your case a line cut out of an html document) is searched.
> !
>
> So, you are saying that each regex starts at the beginning of the
> long line? Is there a way to start the next search at the end of
> the last one?
Well, it depends on how you are using the re module (if you really
want to do that). Have a look at:
https://docs.python.org/2/library/re.html#re.RegexObject.search
But...I'll add my voice to the admonition against using regex here.
Consider the following events that could happen in the future
after you have labored over your program and are able to get it to
work, based on today's HTML.
1. Somebody inserts a line-break in the middle of the element you
were searching for with regex.
2. A week from now, somebody runs 'tidy' on the HTML or changes
or removes the the line endings.
3. Somebody adds an HTML comment which causes your regex to match.
These are the first three reasons that occur to me for why regex is
the wrong tool for the job here, given that you know precisely the
format of the data. It is HTML.
The good thing is that there are other tools for processing HTML.
Anyway, if you want to use regexes, nobody can stop you, so see
below, a bit of nonsense text which you can search for 2 distinct
instances of the string "ei" [0].
> !> I will check out beautiful soup as suggested
> !> in a subsequent mail; I'd still like to finish this process:<}}
> !Do you say that when someone points out that you are eating your shoe?
> Depends on the flavor of the shoe:<)))
Root beer float.
-Martin
[0] If you really, really want to use regex, here's an example of how to
keep track of where you last sought, and how to search from
that place in the string.
from __future__ import print_function
import re
def main():
s = 'Wo lattenzaun aneinander erhaltenen vorpfeifen grasgarten.'
pattern = re.compile('ei', re.IGNORECASE)
matched = pattern.search(s,0)
while matched:
endpos = matched.end()
print(matched.group(0), matched.start(), matched.end())
matched = pattern.search(s, endpos)
--
Martin A. Brown
http://linux-ip.net/
More information about the Tutor
mailing list