[Tutor] BeautifulSoup - getting cells without new line characters

Danny Yoo dyoo at hkn.eecs.berkeley.edu
Sat Apr 1 22:37:41 CEST 2006



> > Have you read a Python tutorial? It seems like some of the things you
> > are struggling with might be addressed in general Python material.
>
>
> You consider a thing about me. If I ask something it is because I cannot
> find the solution. I do not it by whim.

Hello Jonas,

Yes, but don't take Kent's question as a personal insult --- he's asking
because it looks like you're having trouble interpreting error messages or
considering border cases.

If you're doing HTML parsing, there's an unspoken assumption that you've
already mastered basic programming.  After reading the questions you're
asking, I agree with Kent; you're struggling with things that you should
have already covered in tutorials.

Out of curiosity, what tutorials have you looked at?  Give us links, and
we'll take a look and evaluate them for accuracy.  Some Python tutorials
are good, but some of them out there are quite bad too.  The ones linked
from:

    http://wiki.python.org/moin/BeginnersGuide/NonProgrammers

and:

    http://wiki.python.org/moin/BeginnersGuide/Programmers

should be ok.



> * for rows in table('tr'): print rows('td')
>
> it fails when i'm going to get data of each cell using:
>
> for rows in table('tr'): print rows('td')[0]

Just to clarify: when you say it "fails", please try to be more specific.
What exactly does Python report as the error?


I see that you mention the error message here:

    http://mail.python.org/pipermail/tutor/2006-March/046103.html

But are you looking at the error message and trying to understand what
it's saying?  It says that the cell doesn't have a zeroth element. And
this is probably true! I wouldn't disbelieve the computer on this one.
*grin*


TD elements probably won't have nested sub-elements.  But they may have
a 'string' attribute, though, which is what Kent's example used to pull
the text out of the TD.

But your reply to his message doesn't look like it even responds to Kent's
example.  It is unclear to us why you're not reading or understanding his
example, and just going off and doing something else.  If you don't
understand a reply, try asking a question about it: we'll be happy to
elaborate.  Try not to go off so quickly and ignore responses:  it gives
the impression that you don't care enough to read through things.



Anyway, the program snippet above makes assumptions, so let's get those
out of the way.  Concretely:

    for rows in table('tr'):
        print rows('td')[0]

makes an assumption that is not necessarely true:

    * It assumes that each row has a td element.

Do you understand the border case here?  In particular:

    * What if you hit a TR table row that does not have any TD columns?

>From viewing the wiki web page you gave as an example, I can see several
TR's in the page's content that do not have TD's, but only TH's.  I'm not
certain what BeautifulSoup will do in this situtation --- I suspect that
it'll return None --- but in any case, your code has to account for this
possibility.



More information about the Tutor mailing list