Mining strings from a HTML document.
Derick van Niekerk
derickvn at gmail.com
Thu Jan 26 04:54:55 EST 2006
Runsun Pan helped me out with the following:
You can also try the following very primitive solution that I
sometimes
use to extract simple information in a quick and dirty way:
def extract(text,s1,s2):
''' Extract strings wrapped between s1 and s2.
>>> t="""this is a <span>test</span> for <span>extract()</span>
that <span>does multiple extract</span> """
>>> extract(t,'<span>','</span>')
['test', 'extract()', 'does multiple extract']
'''
beg = [1,0][text.startswith(s1)]
tmp = text.split(s1)[beg:]
end = [len(tmp), len(tmp)+1][ text.endswith(s2)]
return [ x.split(s2)[0] for x in tmp if
len(x.split(s2))>1][:end]
This will help out a *lot*! Thank you. This is a better bet than the
parser in this particular implementation because the data I need is not
encapsulated in tags! Field names are within <b></b> tags followed by
plain text data and ended with a <br> tag. This was my main problem
with a parser, but your extract fuction solves it beautifully!
I'm posting back to the NG in just in case it is of value to anyone
else.
Could you/anyone explain the 4 lines of code to me though? A crash
course in Python shorthand? What does it mean when you use two sets of
brackets as in : beg = [1,0][text.startswith(s1)] ?
Thanks for the help!
-d-
More information about the Python-list
mailing list