[Tutor] Remove certain tags in html files

Sebastien Noel sebastien at solutions-linux.org
Fri Jul 27 22:04:45 CEST 2007


Here is the code to deal with 1 given file (the code to iterate all the 
files as working, and I will glue both together when the second one does 
what I want.):

It's a little long, but I wanted to put it all so you maybe I can get 
some tips to speed things up because it's pretty slow.

import BeautifulSoup, tidy

file = open("index.htm", "r")
soup = BeautifulSoup.BeautifulSoup(file)
file.close()

#remove unnecessary things (scripts, styles, ...)
for script in soup("script"):
    soup.script.extract()

for style in soup("style"):
    soup.style.extract()
   
#remove comments
comments = soup.findAll(text=lambda text:isinstance(text, 
BeautifulSoup.Comment))
[comment.extract() for comment in comments]

#the following removes things specific to the pages I'm working with, 
don't mind the langcanada things
#I was just too lazy to change the name of this variable each time
#I think this is an area that could be done differently to get more speed

langcanada = soup.findAll("img", src="graphics/button_f.jpg")
[img.parent.parent.extract() for img in langcanada]

langcanada = soup.findAll("img", src="graphics/button_e.jpg")
[img.parent.parent.extract() for img in langcanada]

langcanada = soup.findAll("img", src="http://u1.extreme-dm.com/i.gif")
[img.parent.parent.extract() for img in langcanada]

langcanada = soup.findAll("a", href="research/disclaimer.htm")
[img.parent.extract() for img in langcanada]

comments = soup.findAll(text=" ")
[comment.extract() for comment in comments]

langcanada = soup.findAll("img", id="logo")
[img.parent.parent.parent.extract() for img in langcanada]

langcanada = soup.findAll("img", id="about")
[img.parent.parent.parent.extract() for img in langcanada]

langcanada = soup.findAll("img", src="images/navbgrbtm.jpg")
[img.parent.parent.parent.parent.extract() for img in langcanada]

langcanada = soup.findAll("img", src="images/navbgrtop.jpg")
[img.parent.parent.parent.parent.extract() for img in langcanada]

#delete class attributes
   
for divs in range(len(soup.findAll("div"))):
    le_div = soup.findAll("div")[divs]
    del le_div["class"]
   
for paras in range(len(soup.findAll("p"))):
    le_par = soup.findAll("p")[paras]
    del (le_par["class"])
   
for imgs in range(len(soup.findAll("img"))):
    le_img = soup.findAll("img")[imgs]
    del (le_img["hspace"])
    del (le_img["vspace"])
    del (le_img["border"])
   
# Add some class attributes
for h1s in range(len(soup.findAll("h1"))):
    le_h1 = soup.findAll("h1")[h1s]
    le_h1["class"] = "heading1_main"
   
for h2s in range(len(soup.findAll("h2"))):
    le_h2 = soup.findAll("h2")[h2s]
    le_h2["class"] = "heading2_main"

for h3s in range(len(soup.findAll("h3"))):
    le_h3 = soup.findAll("h3")[h3s]
    le_h3["class"] = "heading3_main"
   
for h4s in range(len(soup.findAll("h4"))):
    le_h4 = soup.findAll("h4")[h4s]
    le_h4["class"] = "heading4_main"
   
for h5s in range(len(soup.findAll("h5"))):
    le_h5 = soup.findAll("h5")[h5s]
    le_h5["class"] = "heading5_main"
   
# links, makes difference between internal and external ones
for links in range(len(soup.findAll("a"))):
    le_link = soup.findAll("a")[links]
    le_href = le_link["href"]
    if le_href.startswith("""http://caslt.org""") or 
le_href.startswith("""http://www.caslt.org"""):
        le_link["class"] = "caslt_link"
    elif le_href.startswith("""http://"""):
        le_link["class"] = "external_link"
    else:
        le_link["class"] = "caslt_link"
   
del (soup.body["onload"])

# This is what needs to be done:
###### change tables to divs
###### remove all td tags
###### remove all tr tags



# Tidying

soup = soup.prettify()
erreurs = ""

tidy_options = {"tidy-mark": 0,
"wrap": 0,
"wrap-attributes": 0,
"indent": "auto",
"output-xhtml": 1,
"doctype": "loose",
"input-encoding": "utf8",
"output-encoding": "utf8",
"break-before-br": 1,
"clean": 1,
"logical-emphasis": 1,
"drop-font-tags": 1,
"enclose-text": 1,
"alt-text": " ",
"write-back": 1,
"error-file": erreurs,
"show-warnings": 0,
"quiet": 1,
"drop-empty-paras": 1,
"drop-proprietary-attributes": 1,
"join-classes": 1,
"join-styles": 1,
"show-body-only": 1,
"word-2000": 1,
"force-output": 1}

soup_tidy = tidy.parseString(soup, **tidy_options)

outputfile = open("index2.htm", "w")
outputfile.write(str(soup_tidy))
outputfile.close()


Alan Gauld wrote:
> "Sebastien Noel" <sebastien at solutions-linux.org> wrote
>
>   
>> My question, since I'm quite new to python, is about what tool I 
>> should
>> use to remove the table, tr and td tags, but not what's enclosed in 
>> it.
>> I think BeautifulSoup isn't good for that because it removes what's
>> enclosed as well.
>>     
>
> BS can do what you want, you must be missing something. One of the
> most basic examples of using BS is to print an HTML file as plain text
> - ie stripping just the tags. So it must be possible.
>
> Can you put together a short example of the code you are using?
>
> You an use lower level parsers but BS is geneally easier, but until
> we know what you are doing its hard to guess what might be wrong.
>
> Alan G. 
>
>
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
>
>   



More information about the Tutor mailing list