(htmllib) How to capture text that includes tags?

Paul Rubin http
Wed Nov 5 17:59:18 CET 2003


I've generally found that trying to parse the whole page with
regexps isn't appropriate.  Here's a class that I use sometimes.
Basically you do something like

  b = buf(urllib.urlopen(url).read())

and then search around for patterns you expect to find in the page:

  b.search("name of the product")
  b.rsearch('<a href="')
  href = b.up_to('"')

Note that there's an esearch method that lets you do forward searches
for regexps (defaults to case independent since that's usually what
you want for html).  But unfortunately, due to a deficiency in the Python
library, there's no simple way to implement backwards regexp searches.

Maybe I'll clean up the interface for this thing sometime.

================================================================

import re

class buf:
    def __init__(self, text=''):
        self.buf = text
        self.point = 0
        self.stack = []

    def seek(self, offset, whence='set'):
        if whence=='set':
            self.point = offset
        elif whence=='cur':
            self.point += offset
        elif whence=='end':
            self.point = len(self.buf) - offset
        else:
            raise ValueError, "whence must be one of ('set','cur','end')"

    def save(self):
        self.stack.append(self.point)

    def restore(self):
        self.point = self.stack.pop()

    def search(self, str):
        p = self.buf.index(str, self.point)
        self.point = p + len(str)
        return self.point

    def esearch(self, pat, *opts):
        opts = opts or [re.I]
        p = re.compile(pat, *opts)
        g = p.search(self.buf, self.point)
        self.point = g.end()
        return self.point

    def rsearch(self, str):
        p = self.buf.rindex(str, 0, self.point)
        self.point = p
        return self.point

    def up_to(self, str):
        a = self.point
        b = self.search(str)
        return self.buf[a:b-1]




More information about the Python-list mailing list