Mining strings from a HTML document.

Thu Jan 26 04:54:55 EST 2006

Runsun Pan helped me out with the following:

    You can also try the following very primitive solution that I
sometimes
    use to extract simple information in a quick and dirty way:

    def extract(text,s1,s2):
        ''' Extract strings wrapped between s1 and s2.

        >>> t="""this is a <span>test</span> for <span>extract()</span>
            that <span>does multiple extract</span> """
        >>> extract(t,'<span>','</span>')
        ['test', 'extract()', 'does multiple extract']

        '''
        beg = [1,0][text.startswith(s1)]
        tmp = text.split(s1)[beg:]
        end = [len(tmp), len(tmp)+1][ text.endswith(s2)]
        return [ x.split(s2)[0] for x in tmp if
len(x.split(s2))>1][:end]

This will help out a  *lot*! Thank you. This is a better bet than the
parser in this particular implementation because the data I need is not
encapsulated in tags! Field names are within <b></b> tags followed by
plain text data and ended with a <br> tag. This was my main problem
with a parser, but your extract fuction solves it beautifully!

I'm posting back to the NG in just in case it is of value to anyone
else.

Could you/anyone explain the 4 lines of code to me though? A crash
course in Python shorthand? What does it mean when you use two sets of
brackets as in : beg = [1,0][text.startswith(s1)] ?

Thanks for the help!
-d-