[Tutor] search/match file position q

Martin A. Brown martin at linux-ip.net
Wed Oct 8 00:05:07 CEST 2014


Good afternoon Clayton,

> !A regex doesn't understand the structure of an html document. For
> !example
> !you need to keep track of the nesting level manually to find the cells
> !of
> !the inner of two nested tables.
> !
> !> question still remains: does the
> !> search start at the beginning of the line each time or does it step
> !> forward from the last search?
> !
> !re.search() doesn't keep track of prior searches; whatever string you
> !feed
> !it (in your case a line cut out of an html document) is searched.
> !
>
> So, you are saying that each regex starts at the beginning of the 
> long line? Is there a way to start the next search at the end of 
> the last one?

Well, it depends on how you are using the re module (if you really 
want to do that).  Have a look at:

   https://docs.python.org/2/library/re.html#re.RegexObject.search

But...I'll add my voice to the admonition against using regex here.

Consider the following events that could happen in the future 
after you have labored over your program and are able to get it to 
work, based on today's HTML.

   1. Somebody inserts a line-break in the middle of the element you
      were searching for with regex.
   2. A week from now, somebody runs 'tidy' on the HTML or changes
      or removes the the line endings.
   3. Somebody adds an HTML comment which causes your regex to match.

These are the first three reasons that occur to me for why regex is 
the wrong tool for the job here, given that you know precisely the 
format of the data.  It is HTML.

The good thing is that there are other tools for processing HTML.

Anyway, if you want to use regexes, nobody can stop you, so see 
below, a bit of nonsense text which you can search for 2 distinct 
instances of the string "ei" [0].

> !> I will check out beautiful soup as suggested
> !> in a subsequent mail; I'd still like to finish this process:<}}

> !Do you say that when someone points out that you are eating your shoe?
> Depends on the flavor of the shoe:<)))

Root beer float.

-Martin

  [0] If you really, really want to use regex, here's an example of how to
      keep track of where you last sought, and how to search from
      that place in the string.

        from __future__ import print_function

        import re

        def main():
            s = 'Wo lattenzaun aneinander erhaltenen vorpfeifen grasgarten.'
            pattern = re.compile('ei', re.IGNORECASE)
            matched = pattern.search(s,0)
            while matched:
                endpos = matched.end()
                print(matched.group(0), matched.start(), matched.end())
                matched = pattern.search(s, endpos)


-- 
Martin A. Brown
http://linux-ip.net/


More information about the Tutor mailing list