[Tutor] beautifulSoup and .next iteration
Jon Crump
jjcrump at myuw.net
Thu Apr 5 22:02:56 CEST 2007
As a complete tyro, I've broken my teeth on this web-page scraping
problem. I've several times wanted to scrape pages in which the only
identifying elements are positional rather than syntactical, that is,
pages in which everything's a sibling and there's no way to predict how
many sibs there are in each section headed by an empty named anchor. I've
been trying to use beautifulSoup to scrape these. It's not clear to me
which is worse: my grasp of python in general or beautifulSoup in
particular. Here's a stripped down example of the sort of thing I mean:
<html>
<body>
<a name="A1"></a>
<p>paragraph 1</p>
<p>paragraph 1.A</p>
<ul>
<li>some line</li>
<li>another line</li>
</ul>
<p>paragraph 1.B</p>
<a name="A2"></a>
<p>paragraph 2</p>
<p>paragraph 2.B</p>
<a name="A3"></a>
<p>paragraph 3</p>
<table>
<tr><td>some</td><td>data</td></tr>
</table>
</body>
</html>
I want to end up with some container, say a list, containing something
like this:
[
[A1, paragraph 1, paragraph 1.A, some line, another line, paragraph 1.B]
[A2, paragraph 2, paragraph 2.B]
[A3, paragraph 3, some, data]
]
I've tried things like this: (just using print for now, I think I'll be
able to build the lists or whatever once I get the basic idea.)
anchors = soup.findAll('a', { 'name' : re.compile('^A.*$')})
for x in anchors:
print x
x = x.next
while getattr(x, 'name') != 'a':
print x
And get into endless loops. I can't help thinking there are simple and
obvious ways to do this, probably many, but as a rank beginner, they are
escaping me.
Can someone wise in the ways of screen scraping give me a clue?
thanks,
Jon
More information about the Tutor
mailing list