Parsing html with Beautifulsoup
Gabriel Genellina
gagsl-py2 at yahoo.com.ar
Sun Dec 13 05:58:55 EST 2009
En Fri, 11 Dec 2009 04:04:38 -0300, Johann Spies <jspies at sun.ac.za>
escribió:
> Gabriel Genellina het geskryf:
>> En Thu, 10 Dec 2009 06:15:19 -0300, Johann Spies <jspies at sun.ac.za>
>> escribió:
>>
>>> How do I get Beautifulsoup to render (taking the above line as
>>> example)
>>>
>>> sunentint for <img src=icons/group.png> <a
>>> href=#OBJ_sunetint>sunetint</A><BR>
>>>
>>> and still provide the text-parts in the <td>'s with plain text?
>>
>> Hard to tell if we don't see what's inside those <td>'s - please
>> provide at least a few rows of the original HTML table.
>>
> Thanks for your reply. Here are a few lines:
>
> <!------- Rule 1 ------->
> <tr style="background-color: #ffffff"><td class=normal>2</td><td><img
> src=icons/usrgroup.png> All Users at Any<br><td><im$
> </td><td><img src=icons/any.png> Any<br></td><td><img
> src=icons/clientencrypt.png> clientencrypt</td><td><img src$
> </td><td> </td></tr>
I *think* I finally understand what you want (your previous example above
confused me).
If you want for Rule 1 to generate a line like this:
2,All Users at Any,<im$,Any,clientencrypt,,
this code should serve as a starting point:
lines = []
soup = BeautifulSoup(html)
for table in soup.findAll("table"):
for row in table.findAll("tr"):
line = []
for cell in row.findAll("td"):
text = ' '.join(
s.replace('\n',' ').replace(' ',' ')
for s in cell.findAll(text=True)).strip()
line.append(text)
lines.append(line)
import csv
with open("output.csv","wb") as f:
writer = csv.writer(f)
writer.writerows(lines)
cell.findAll(text=True) returns a list of all text nodes inside a <td>
cell; I preprocess all \n and in each text node, and join them all.
lines is a list of lists (each entry one cell), as expected by the csv
module used to write the output file.
--
Gabriel Genellina
More information about the Python-list
mailing list