Wildcard for string replacement?!?!
marklists at mceahern.com
Mon Mar 10 20:41:14 CET 2003
> I 'm working for over a week on this script but I can't make my
> way out. The
> whole idea is to replace (better say delete) anything that stands between
> the <td> and</td> tag of an html file.
One thing wrong with your psuedocode is that you assume the <td>blah</td>
never spans more than one line.
I think there are two basic approaches to your problem:
1. Use regular expressions.
2. Use some library that lets you get at the HTML via an object model.
1 seems easier. Try this...
def disembowel(html, tag):
"""Return the html with the innards of the specified tag removed."""
template = r'(\<%s.*?\>)(.*?)(\<\/%s\>)'
_pattern = template % (tag, tag)
pattern = re.compile(_pattern, re.DOTALL | re.IGNORECASE)
return pattern.sub(r'\1\3', html)
html = """<html>
tag = 'td'
print disembowel(html, tag)
More information about the Python-list