[Tutor] beautifulSoup and .next iteration

Jon Crump jjcrump at myuw.net
Thu Apr 5 22:02:56 CEST 2007


As a complete tyro, I've broken my teeth on this web-page scraping 
problem. I've several times wanted to scrape pages in which the only 
identifying elements are positional rather than syntactical, that is, 
pages in which everything's a sibling and there's no way to predict how 
many sibs there are in each section headed by an empty named anchor. I've 
been trying to use beautifulSoup to scrape these. It's not clear to me 
which is worse: my grasp of python in general or beautifulSoup in 
particular. Here's a stripped down example of the sort of thing I mean:

<html>
<body>
<a name="A1"></a>
<p>paragraph 1</p>
<p>paragraph 1.A</p>
<ul>
   <li>some line</li>
   <li>another line</li>
</ul>
<p>paragraph 1.B</p>

<a name="A2"></a>
<p>paragraph 2</p>
<p>paragraph 2.B</p>

<a name="A3"></a>
<p>paragraph 3</p>
<table>
   <tr><td>some</td><td>data</td></tr>
</table>
</body>
</html>

I want to end up with some container, say a list, containing something 
like this:
[
   [A1, paragraph 1, paragraph 1.A, some line, another line, paragraph 1.B]
   [A2, paragraph 2, paragraph 2.B]
   [A3, paragraph 3, some, data]
]
I've tried things like this: (just using print for now, I think I'll be 
able to build the lists or whatever once I get the basic idea.)

anchors = soup.findAll('a', { 'name' : re.compile('^A.*$')})
for x in anchors:
   print x
   x = x.next
   while getattr(x, 'name') != 'a':
     print x

And get into endless loops. I can't help thinking there are simple and 
obvious ways to do this, probably many, but as a rank beginner, they are 
escaping me.

Can someone wise in the ways of screen scraping give me a clue?

thanks,
Jon


More information about the Tutor mailing list