[Tutor] regular expression
Raymond Hettinger
python@rcn.com
Sun, 31 Mar 2002 15:07:24 -0500
If I'm understanding what you want correctly, this should work:
>>> re.sub( r'(<[^>]+)( class[^>]+)(>)' , r'\1\3', t
'<table><tr><td> class="Three" </td>'
The strategy is find three consequetive groups and keep on the first and
third.
The first group is the tag start, an open angle bracket followed by
anything except a close bracket.
The second group is the class assignment, a space followed by 'class'
followed by anything other than an angle close bracket.
The third group is the close angle bracket.
When found together, the three groups are a full tag definition containing a
class definition. Drop the middle group (the class definition) and you're
left with a classless tag.
If for some reason you want to kill the whole tag, replace r'\1\3' with
r''.
Raymond Hettinger
Grasshopper: 'I have a problem I want to solve with regular expressions'
Master: 'Now you have two problems'
----- Original Message -----
From: "ingo" <seedseven@home.nl>
To: <tutor@python.org>
Sent: Sunday, March 31, 2002 10:40 AM
Subject: [Tutor] regular expression
> From an HTML-file I want to strip all css related stuff. Using re.sub
> looks ideal because in some cases the css has to be replaced by
> something else.
> The problem I run into is that I can't find a way to match 'class=".."'
> with one expression, without matching the string when it is outside a
> tag.
>
> in t I don't want to have a match for class="Three"
>
> >>> import re
> >>> t=r'<table class="One"><tr><td class="Two"> class="Three" </td>'
> >>> pat1=re.compile(r'<.*?class=".*?".*?>')
> >>> pat2=re.compile(r'class=".*?"')
> >>> p=pat1.search(t)
> >>> p=pat2.search(t,p.start(),p.end())
> >>> p.group()
> 'class="One"'
> >>>
>
> Doing it in two steps is possible but now re.sub can't be used. Is
> there a way to do it in one go?
>
> Ingo
>
>
> _______________________________________________
> Tutor maillist - Tutor@python.org
> http://mail.python.org/mailman/listinfo/tutor
>