re.sub does not replace all occurences
Neil Cerutti
horpner at yahoo.com
Tue Aug 7 14:18:43 EDT 2007
On 2007-08-07, Christoph Krammer <redtiger84 at googlemail.com> wrote:
> Hello everybody,
>
> I wanted to use re.sub to strip all HTML tags out of a given string. I
> learned that there are better ways to do this without the re module,
> but I would like to know why my code is not working. I use the
> following:
>
> def stripHtml(source):
> source = re.sub("[\n\r\f]", " ", source)
> source = re.sub("<.*?>", "", source, re.S | re.I | re.M)
> source = re.sub("&(#[0-9]{1,3}|[a-z]{3,6});", "", source, re.I)
> return source
>
> But the result still has some tags in it. When I call the
> second line multiple times, all tags disappear, but since HTML
> tags cannot be overlapping, I do not understand this behavior.
> There is even a difference when I omit the re.I (IGNORECASE)
> option. Without this option, some tags containing only capital
> letters (like </FONT>) were kept in the string when doing one
> processing run but removed when doing multiple runs.
>
> Perhaps anyone can tell me why this regex is behaving like
> this.
>>> import re
>>> help(re.sub)
Help on function sub in module re:
sub(pattern, repl, string, count=0)
Return the string obtained by replacing the leftmost
non-overlapping occurrences of the pattern in string by the
replacement repl. repl can be either a string or a callable;
if a callable, it's passed the match object and must return
a replacement string to be used.
And from the Python Library Reference for re.sub:
The pattern may be a string or an RE object; if you need to
specify regular expression flags, you must use a RE object,
or use embedded modifiers in a pattern; for example,
"sub("(?i)b+", "x", "bbbb BBBB")" returns 'x x'.
The optional argument count is the maximum number of pattern
occurrences to be replaced; count must be a non-negative
integer. If omitted or zero, all occurrences will be
replaced. Empty matches for the pattern are replaced only
when not adjacent to a previous match, so "sub('x*', '-',
'abc')" returns '-a-b-c-'.
In other words, the fourth argument to sub is count, not a set of
re flags.
--
Neil Cerutti
More information about the Python-list
mailing list