searching backwards in a string

Steve Holden sholden at holdenweb.com
Wed Feb 13 07:42:11 EST 2002


"Paul Rubin" <phr-n2002a at nightsong.com> wrote in ...
> "Steve Holden" <sholden at holdenweb.com> writes:
> > Paul, this thread's probably now old enough for you to tell us what the
real
> > problem is! Why exactly do you need to search backwards from the
50,000th
> > character to find the beginning of an HTML tag?
>
> Suppose I'm parsing the file and I see a </table> tag and I want to
> find the matching <table> tag.  It could be pretty far back in the file.
> That's what I was doing when I encountered this question.  But searching
> backwards is a normal thing to want to do in general--for example it's
> a standard command in any decent text editor.
>
> Anyway, I just entered a sourceforge bug about it being missing from
> Python's re module.
>
I'd be very surprised if this meets with any response other than "this is
not a bug".

Frankly, if you see a </table> tag and you have no idea where the matching
<table> tag appears then whatever you are doing to the HTML file you
certainly aren't parsing it!

Don't know whether this will help: it's an example from "Python Web
Programming" that shows you how to extract the table structure from an HTML
file.

import htmllib, urllib, formatter, sys

def Usage():
    print """
Usage: python showtbls.py URL
"""

class myHTMLParser(htmllib.HTMLParser):

    def __init__(self, f):
        htmllib.HTMLParser.__init__(self, f)
        self.tblindent = 0

    def start_table(self, attrs):
        sys.stdout.write("%s<table" % ("    " * self.tblindent, ))
        for k, v in attrs:
            if k in ("width", "cellspacing"):
                sys.stdout.write(' %s="%s"' % (k, v),)
        print ">"
        self.tblindent += 1

    def end_table(self):
        self.tblindent -= 1
        print "%s</table>" % ("    " * self.tblindent, )

def parse(url, formatter):
    f = urllib.urlopen(url)
    data = f.read()
    f.close()
    p = myHTMLParser(formatter)
    p.feed(data)
    p.close()

if len(sys.argv) != 2:
    Usage()
else:
    fmt  = formatter.NullFormatter()
    parse(sys.argv[1], fmt)

regards
 Steve
--
Consulting, training, speaking: http://www.holdenweb.com/
Author, Python Web Programming: http://pydish.holdenweb.com/pwp/

"This is Python.  We don't care much about theory, except where it
intersects with useful practice."  Aahz Maruch on c.l.py







More information about the Python-list mailing list