newb: BeautifulSoup

crybaby joemystery123 at gmail.com
Fri Sep 21 08:42:34 EDT 2007


I added extra td tags to your example, for whatever reason I am
getting None. When I do the following:

print all_tds[0].string
print all_tds[8].string


from BeautifulSoup import BeautifulSoup

doc = """
<html>
    <head>
        <title></title>
    </head>
    <body>
        <table>
        </table>

        <table>
            <tr><td>hello</td></tr>
            <tr><td>world</td><td>goodbye</td></tr>
              <tr>
              <td width=1 height=0 bgcolor="#800000"><img src="/img/
spacer.gif" width=1 height=0 alt="|"/></td>
              <td align=right width=80><font size=2 face="New Times
Roman,Times,Serif"> 48.884 </font></td>
              <td width=1 height=0 bgcolor="#800000"><img src="/img/
spacer.gif" width=1 height=0 alt="|"/></td>
              <td align=right width=80><font size=2 face="New Times
Roman,Times,Serif"> 49.950 </font></td>
              <td width=1 height=0 bgcolor="#800000"><img src="/img/
spacer.gif" width=1 height=0 alt="|"/></td>
              <td align=right width=80><font size=2 face="New Times
Roman,Times,Serif"> 69.322 </font></td>
              <td width=1 height=0 bgcolor="#800000"><img src="/img/
spacer.gif" width=1 height=0 alt="|"/></td>
              <td align=right width=80><font size=2 face="New Times
Roman,Times,Serif"> 99.740 </font></td>
              <td width=1 height=0 bgcolor="#800000"><img src="/img/
spacer.gif" width=1 height=0 alt="|"/></td>
            </tr>
        </table>
    </body>
</html>
"""

soup = BeautifulSoup(doc)

tables = soup.findAll('table')
target_table = tables[1]

all_tds = target_table.findAll('td')
print all_tds[0].string
print all_tds[8].string
tds_str = all_tds[8].string
print tds_str

Output I am getting is following:

>>> hello
None
None

I am not sure why I am getting None for these lines:

print all_tds[0].string
print all_tds[8].string

On Sep 21, 3:38 am, 7stud <bbxx789_0... at yahoo.com> wrote:
> On Sep 20, 9:04 pm, crybaby <joemystery... at gmail.com> wrote:
>
> > I need to traverse a html page with big table that has many row and
> > columns.  For example, how to go 35th td tag and do regex to retireve
> > the content.  After that is done, you move down to 15th td tag from
> > 35th tag (35+15) and do regex to retrieve the content?
>
> 1) You can find your table using one of these methods:
>
> a)
> target_table = soup.find('table', id='car_parts')
>
> b)
> tables = soup.findall('table')
> target_table = tables[2]
>
> The tables are put in a list in the order that they appear on the
> page.
>
> 2) You can get all the td's in the table using this statement:
>
> all_tds = target_table.findall('td')
>
> 3) You can get the contents of the tags using these statements:
>
> print all_tds[34].string
> print all_tds[49].string
>
> Here is an example:
>
> from BeautifulSoup import BeautifulSoup
>
> doc = """
> <html>
>     <head>
>         <title></title>
>     </head>
>     <body>
>         <table>
>         </table>
>
>         <table>
>             <tr><td>hello</td></tr>
>             <tr><td>world</td><td>goodbye</td></tr>
>         </table>
>     </body>
> </html>
> """
>
> soup = BeautifulSoup(doc)
>
> tables = soup.findAll('table')
> target_table = tables[1]
>
> all_tds = target_table.findAll('td')
> print all_tds[0].string
> print all_tds[2].string
>
> --output:--
> hello
> goddbye





More information about the Python-list mailing list