Wildcard for string replacement?!?!

Mark McEahern marklists at mceahern.com
Mon Mar 10 20:41:14 CET 2003

> I 'm working for over a week on this script but I can't make my
> way out. The
> whole idea is to replace (better say delete) anything that stands between
> the <td> and</td> tag of an html file.

One thing wrong with your psuedocode is that you assume the <td>blah</td>
never spans more than one line.

I think there are two basic approaches to your problem:

1.  Use regular expressions.

2.  Use some library that lets you get at the HTML via an object model.

1 seems easier.  Try this...

#!/usr/bin/env python

import re

def disembowel(html, tag):
    """Return the html with the innards of the specified tag removed."""
    template = r'(\<%s.*?\>)(.*?)(\<\/%s\>)'
    _pattern = template % (tag, tag)
    pattern = re.compile(_pattern, re.DOTALL | re.IGNORECASE)
    return pattern.sub(r'\1\3', html)

html = """<html>

tag = 'td'
print disembowel(html, tag)


More information about the Python-list mailing list