Use BeautifulSoup to delete certain tag while keeping its content

Paul Boddie paul at boddie.org.uk
Sun Sep 7 02:37:52 CEST 2008


On 6 Sep, 17:11, "Jackie Wang" <jackie.pyt... at gmail.com> wrote:
>
> I have the following html code:
>
> <td valign="top" headers="col1">
>  <font size="2">
>   Center Bank
>   <br />
>   Los Angeles, CA
>  </font>
> </td>
>
> <td valign="top" headers="col1">
>  <font size="2">
>   Salisbury
> Bank and Trust Company
>   <font face="arial, helvetica" size="2" color="#0000000">
>    <br />
>    Lakeville, CT
>   </font>
>  </font>
> </td>
>
> How should I delete the 'font' tags while keeping the content inside?

This sounds like an editing exercise, really. If you're comfortable
learning a new tool, I can recommend XSLT for this kind of job. Here's
the stylesheet:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/
Transform">

  <xsl:template match="font">
    <xsl:apply-templates/>
  </xsl:template>

  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

</xsl:stylesheet>

This just describes two things: firstly, that you want to recognise
font elements and to include their contents, not each element's start
and end tags; secondly, that all other parts of the document should be
copied.

You can apply stylesheets using a number of XSL processors. The
xsltproc program is usually available where libxslt is installed, and
although I'm sure others will be along to tell you  all about their
favourite libraries and tools, here's how I use mine within Python:

# XSLTools: http://www.python.org/pypi/XSLTools
# libxml2dom: http://www.python.org/pypi/libxml2dom
import XSLTools.XSLOutput
import libxml2dom
# If s is the document text...
d = libxml2dom.parseString(s)
# Save the above stylesheet to a file somewhere, then...
proc = XSLTools.XSLOutput.Processor(["/tmp/no-font.xsl"])
# Get the result document
d2 = proc.get_result(d)

Anyway, this is just one option of many to deal with this kind of
problem.

Paul



More information about the Python-list mailing list