[Tutor] regular expression

Sun, 31 Mar 2002 15:07:24 -0500

If I'm understanding what you want correctly, this should work:

>>> re.sub( r'(<[^>]+)( class[^>]+)(>)' , r'\1\3', t
'<table><tr><td> class="Three" </td>'

The strategy is find three consequetive groups and keep on the first and
third.

The first group is the tag start, an open angle bracket  followed by
anything except a close bracket.

The second group is the class assignment, a space followed  by 'class'
followed by anything other than an angle close bracket.

The third group is the close angle bracket.

When found together, the three groups are a full tag definition containing a
class definition.  Drop the middle group (the class definition) and you're
left with a classless tag.

If for some reason you want to kill the whole tag, replace r'\1\3'  with
r''.

Raymond Hettinger

Grasshopper:  'I have a problem I want to solve with regular expressions'
Master: 'Now you have two problems'

----- Original Message -----
From: "ingo" <seedseven@home.nl>
To: <tutor@python.org>
Sent: Sunday, March 31, 2002 10:40 AM
Subject: [Tutor] regular expression

> From an HTML-file I want to strip all css related stuff. Using re.sub
> looks ideal because in some cases the css has to be replaced by
> something else.
> The problem I run into is that I can't find a way to match 'class=".."'
> with one expression, without matching the string when it is outside a
> tag.
>
> in t I don't want to have a match for class="Three"
>
> >>> import re
> >>> t=r'<table class="One"><tr><td class="Two"> class="Three" </td>'
> >>> pat1=re.compile(r'<.*?class=".*?".*?>')
> >>> pat2=re.compile(r'class=".*?"')
> >>> p=pat1.search(t)
> >>> p=pat2.search(t,p.start(),p.end())
> >>> p.group()
> 'class="One"'
> >>>
>
> Doing it in two steps is possible but now re.sub can't be used. Is
> there a way to do it in one go?
>
> Ingo
>
>
> _______________________________________________
> Tutor maillist  -  Tutor@python.org
> http://mail.python.org/mailman/listinfo/tutor
>