Regular expression problem

Wolfgang Grafen wolfgang.grafen at gmx.de
Thu Feb 28 06:05:51 EST 2002


Asheesh Laroia <pan-news at asheeshenterprises.com> wrote in message news:<nxhf8.149057$dh.38723170 at typhoon.nyroc.rr.com>...
> This is great, thanks!
> 
> Only one problem.  I'm having trouble (I did give it a try) making the
> following work:
> 
> 	<@Trap Body text:>Useful Text
> 
> I need to still be able to extract "Useful Text", not delete it.
> 
> Thanks again!
Hope this is it:

rc=re.compile("<@Trap Body text\s*"
                          "(?:(?P<assigned>=)|(?P<unassigned>))\s*"
                          "(?P<UselessText>(?:<.*?>[^<>]*)*[^>]*>)\s*"
                          "(?P<UsefulText>.*?)?\s*\Z",
                          re.MULTILINE|re.DOTALL).match

t1='<@Trap Body text>'

t2='<@Trap Body text=<FONT "Times">'

t3="""<FONT "Times"><CCOLOR\n   "Black"><SIZE
11><HORIZONTAL 100><LETTERSPACE 0><CTRACK 127><CSSIZE 70><C+SIZE\n
58.3><C-POSITION 33.3><C+POSITION 33.3><P><CBASELINE 0><CNOBREAK
0><CLEADING -0.05\n  ><GGRID 0><GLEFT 0><GRIGHT 0><GFIRST
19.2><G+BEFORE
0><G+AFTER 0><GALIGNMENT \n  "justify\n  "><GMETHOD "proportional"><G&
"ENGLISH"><GPAIRS 4><G% 120><GKNEXT 0><GKWIDOW \n  1><GKORPHAN\n
1><GTABS $><GHYPHENATION 2 36 0><GWORDSPACE 75 100 150><GSPACE -5 0
25>>"""

t4='<@Trap Body text=<FONT "Times"> hello <otto>> and the useful rest'

and you get
>>> rc(t1).groups()
(None, '', '')

>>> rc(t2).groups()
('=', None, '<FONT "Times">')

>>> rc(t3).groups()
('=', None, '<FONT "Times"><CCOLOR\012  
"Black"><SIZE\01211><HORIZONTAL 100><LETTERSPACE 0><CTRACK 127><CSSIZE
70><C+SIZE\012\01258.3><C-POSITION 33.3><C+POSITION 33.3><P><CBASELINE
0><CNOBREAK\0120><CLEADING -0.05\012  ><GGRID 0><GLEFT 0><GRIGHT
0><GFIRST 19.2><G+BEFORE\0120><G+AFTER 0><GALIGNMENT \012 
"justify\012  "><GMETHOD "proportional"><G&\012"ENGLISH"><GPAIRS 4><G%
120><GKNEXT 0><GKWIDOW \012  1><GKORPHAN\012\0121><GTABS
$><GHYPHENATION 2 36 0><GWORDSPACE 75 100 150><GSPACE -5 0\01225>')

and just to get useful text:
t4 = t1 + t1
t5 = t2 + t2
t6 = t3 + t3

>>> rc(t4).groups()
(None, '', '>', '<@Trap Body text>')

>>> rc(t5).groups()
('=', None, '<FONT "Times">>', '<@Trap Body text=<FONT "Times">>')

>>> rc(t5).groups()
...


def parse_it(text):
    t = rc(text)
    assert t,"Text does not match!:\n%r" % text
    return t.group('UsefulText')

>>> print parse_it(t5)
<@Trap Body text=<FONT "Times">>

wolfgang.

> -- Asheesh.
> 
> On Wed, 27 Feb 2002 20:19:56 -0500, Wolfgang Grafen wrote:
> 
> > import re
> > 
> > rc=re.compile("<@Trap Body text\s*"
> >                           "(?:(?P<assigned>=)|(?P<unassigned>>))\s*"
> >                           "(?P<rest>.*?)\s*\Z",
> >                          re.MULTILINE|re.DOTALL).match
> > 
> > t1='<@Trap Body text>'
> > t2='<@Trap Body text=<FONT "Times">'
> > t3="""<@Trap Body text=<FONT "Times"><CCOLOR\n   "Black"><SIZE
> > 11><HORIZONTAL 100><LETTERSPACE 0><CTRACK 127><CSSIZE 70><C+SIZE\n
> > 58.3><C-POSITION 33.3><C+POSITION 33.3><P><CBASELINE 0><CNOBREAK
> > 0><CLEADING -0.05\n  ><GGRID 0><GLEFT 0><GRIGHT 0><GFIRST 19.2><G+BEFORE
> > 0><G+AFTER 0><GALIGNMENT \n  "justify\n  "><GMETHOD "proportional"><G&
> > "ENGLISH"><GPAIRS 4><G% 120><GKNEXT 0><GKWIDOW \n  1><GKORPHAN\n
> > 1><GTABS $><GHYPHENATION 2 36 0><GWORDSPACE 75 100 150><GSPACE -5 0
> > 25>>"""
> > 
> > rc(t1).groups()
> > (None, '>', '')
> > 
> > rc(t2).groups()
> > ('=', None, '<FONT "Times">')
> > 
> > rc(t3).groups()
> > ('=', None, '<FONT "Times"><CCOLOR\n   "Black"><SIZE 11><HORIZONTAL
> > 100><LETTERSPACE 0><CTRACK 127><CSSIZE 70><C+SIZE\n  58.3><C-POSITION
> > 33.3><C+POSITION 33.3><P><CBASELINE 0><CNOBREAK 0><CLEADING -0.05\n
> > ><GGRID 0><GLEFT 0><GRIGHT 0><GFIRST 19.2><G+BEFORE 0><G+AFTER
> > 0><GALIGNMENT \n  "justify\n  "><GMETHOD "proportional"><G&
> > "ENGLISH"><GPAIRS 4><G% 120><GKNEXT 0><GKWIDOW \n  1><GKORPHAN\n
> > 1><GTABS $><GHYPHENATION 2 36 0><GWORDSPACE 75 100 150><GSPACE -5 0
> > 25>>')
> > 
> > cheers
> > 
> > wolfgang
> > 
> > 
> > Asheesh Laroia schrieb:
> > 
> >> I have some SGML input (PageMaker 6.5 tagged text), and I want to be
> >> able to recognize (and delete) a tag.  That tag looks like:
> >>
> >>         <@Trap Body text:>
> >>
> >> It may also look like <@Trap Body text: useless-data>.
> >>
> >> So, I tried the regular expression r"<@.?>".  That doesn't match the
> >> above string.  Nor does r"<@.?Trap Body text.?>".  What RE should I be
> >> using, and why doesn't this work?
> >>
> >> Thanks in advance!
> >>
> >> -- Asheesh Laroia.
> >>
> >> PS: An example of the tag "in the wild" is the following string:
> >>
> >> <@Trap Body text=<FONT "Times"><CCOLOR
> >>  "Black"><SIZE 11><HORIZONTAL 100><LETTERSPACE 0><CTRACK 127><CSSIZE
> >>  70><C+SIZE
> >> 58.3><C-POSITION 33.3><C+POSITION 33.3><P><CBASELINE 0><CNOBREAK
> >> 0><CLEADING -0.05
> >> ><GGRID 0><GLEFT 0><GRIGHT 0><GFIRST 19.2><G+BEFORE 0><G+AFTER
> >> >0><GALIGNMENT "justify
> >> "><GMETHOD "proportional"><G& "ENGLISH"><GPAIRS 4><G% 120><GKNEXT
> >> 0><GKWIDOW 1><GKORPHAN 1><GTABS $><GHYPHENATION 2 36 0><GWORDSPACE 75
> >> 100 150><GSPACE -5 0 25>>



More information about the Python-list mailing list