python-parser running Beautiful Soup needs to be reviewed

Alexander Kapps alex.kapps at web.de
Sat Dec 11 17:26:12 EST 2010


On 11.12.2010 22:38, Stef Mientki wrote:
> On 11-12-2010 17:24, Martin Kaspar wrote:
>> Hello commnity
>>
>> i am new to Python and to Beatiful Soup also!
>> It is told to be a great tool to parse and extract content. So here i
>> am...:
>>
>> I want to take the content of a<td>-tag of a table in a html
>> document. For example, i have this table
>>
>> <table class="bp_ergebnis_tab_info">
>>      <tr>
>>              <td>
>>                       This is a sample text
>>              </td>
>>
>>              <td>
>>                       This is the second sample text
>>              </td>
>>      </tr>
>> </table>
>>
>> How can i use beautifulsoup to take the text "This is a sample text"?
>>
>> Should i make use
>> soup.findAll('table' ,attrs={'class':'bp_ergebnis_tab_info'}) to get
>> the whole table.
>>
>> See the target http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=799.601437941842&SchulAdresseMapDO=142323
>>
>> Well - what have we to do first:
>>
>> The first thing is t o find the table:
>>
>> i do this with Using find rather than findall returns the first item
>> in the list
>> (rather than returning a list of all finds - in which case we'd have
>> to add an extra [0]
>> to take the first element of the list):
>>
>>
>> table = soup.find('table' ,attrs={'class':'bp_ergebnis_tab_info'})
>>
>> Then use find again to find the first td:
>>
>> first_td = soup.find('td')
>>
>> Then we have to use renderContents() to extract the textual contents:
>>
>> text = first_td.renderContents()
>>
>> ... and the job is done (though we may also want to use strip() to
>> remove leading and trailing spaces:
>>
>> trimmed_text = text.strip()
>>
>> This should give us:
>>
>>
>> print trimmed_text
>> This is a sample text
>>
>> as desired.
>>
>>
>> What do you think about the code? I love to hear from you!?
> I've no opinion.
> I'm just struggling with BeautifulSoup myself, finding it one of the toughest libs I've seen ;-)

Really? While I'm by no means an expert, I find it very easy to work 
with. It's very well structured IMHO.

> So the simplest solution I came up with:
>
> Text = """
> <table class="bp_ergebnis_tab_info">
>      <tr>
>              <td>
>                       This is a sample text
>              </td>
>
>              <td>
>                       This is the second sample text
>              </td>
>      </tr>
> </table>
> """
> Content = BeautifulSoup ( Text )
> print Content.find('td').contents[0].strip()
>>>> This is a sample text
>
> And now I wonder how to get the next contents !!

Content = BeautifulSoup ( Text )
for td in Content.findAll('td'):
     print td.string.strip() # or td.renderContents().strip()



More information about the Python-list mailing list