How to apply text changes to HTML, keeping it intact if inside "a" tags

vbfoobar at vbfoobar at
Wed Sep 27 05:51:06 CEST 2006


I have HTML input to which I apply some changes.

Feature 1:
I want to tranform all the text, but if the text is inside
an "a href" tag, I want to leave the text as it is.

The HTML is not necessarily well-formed, so
I would like to do that using BeautifulSoup (or
maybe another tolerant parser).

As a test case, suppose I want to uppercase all the text
except the text that is within "a href" tags:

ExampleString = """
    <footag>Lorem Ipsum</footag> is simply
    dummy text of <a href="junk.html">the printing</a> and
    <a href="junk2.html">typesetting <b>industry</b>.</a>

When applying the text transform, I want to obtain:

    <footag>LOREM IPSUM</footag> IS SIMPLY
    DUMMY TEXT OF <a href="junk.html">the printing</a> AND
    <a href="junk2.html">typesetting <b>industry</b>.</a>

Feature 2:
Another thing I may want to do: If the text I would normally
transform is inside an "a href" tag, then do not transform it,
but insert the result of text transformation just after the "</a>".

Using the same example as input, application of
this feature2 would give something like that:

    <footag>LOREM IPSUM</footag> IS SIMPLY
    DUMMY TEXT OF <a href="junk.html">the printing</a><feat2>THE
    <a href="junk2.html">typesetting
<b>industry</b>.</a><feat2>TYPESETTING <b>INDUSTRY</b>.</feat2>

Thanks for your help

More information about the Python-list mailing list