Removing certain tags from html files
Marc 'BlackJack' Rintsch
bj_666 at gmx.net
Fri Jul 27 14:45:29 EDT 2007
On Fri, 27 Jul 2007 17:40:23 +0000, sebzzz wrote:
> My question, since I'm quite new to python, is about what tool I
> should use to remove the table, tr and td tags, but not what's
> enclosed in it. I think BeautifulSoup isn't good for that because it
> removes what's enclosed as well.
Than take a hold on the content and add it to the parent. Somthing like
this should work:
from BeautifulSoup import BeautifulSoup
def remove(soup, tagname):
for tag in soup.findAll(tagname):
contents = tag.contents
parent = tag.parent
tag.extract()
for tag in contents:
parent.append(tag)
def main():
source = '<a><b>This is a <c>Test</c></b></a>'
soup = BeautifulSoup(source)
print soup
remove(soup, 'b')
print soup
> Is re the good module for that? Basically, if I make an iteration that
> scans the text and tries to match every occurrence of a given regular
> expression, would it be a good idea?
No regular expressions are not a very good idea. They get very
complicated very quickly while often still miss some corner cases.
Ciao,
Marc 'BlackJack' Rintsch
More information about the Python-list
mailing list