Regular expression problem
Wolfgang Grafen
wolfgang.grafen at gmx.de
Thu Feb 28 06:05:51 EST 2002
Asheesh Laroia <pan-news at asheeshenterprises.com> wrote in message news:<nxhf8.149057$dh.38723170 at typhoon.nyroc.rr.com>...
> This is great, thanks!
>
> Only one problem. I'm having trouble (I did give it a try) making the
> following work:
>
> <@Trap Body text:>Useful Text
>
> I need to still be able to extract "Useful Text", not delete it.
>
> Thanks again!
Hope this is it:
rc=re.compile("<@Trap Body text\s*"
"(?:(?P<assigned>=)|(?P<unassigned>))\s*"
"(?P<UselessText>(?:<.*?>[^<>]*)*[^>]*>)\s*"
"(?P<UsefulText>.*?)?\s*\Z",
re.MULTILINE|re.DOTALL).match
t1='<@Trap Body text>'
t2='<@Trap Body text=<FONT "Times">'
t3="""<FONT "Times"><CCOLOR\n "Black"><SIZE
11><HORIZONTAL 100><LETTERSPACE 0><CTRACK 127><CSSIZE 70><C+SIZE\n
58.3><C-POSITION 33.3><C+POSITION 33.3><P><CBASELINE 0><CNOBREAK
0><CLEADING -0.05\n ><GGRID 0><GLEFT 0><GRIGHT 0><GFIRST
19.2><G+BEFORE
0><G+AFTER 0><GALIGNMENT \n "justify\n "><GMETHOD "proportional"><G&
"ENGLISH"><GPAIRS 4><G% 120><GKNEXT 0><GKWIDOW \n 1><GKORPHAN\n
1><GTABS $><GHYPHENATION 2 36 0><GWORDSPACE 75 100 150><GSPACE -5 0
25>>"""
t4='<@Trap Body text=<FONT "Times"> hello <otto>> and the useful rest'
and you get
>>> rc(t1).groups()
(None, '', '')
>>> rc(t2).groups()
('=', None, '<FONT "Times">')
>>> rc(t3).groups()
('=', None, '<FONT "Times"><CCOLOR\012
"Black"><SIZE\01211><HORIZONTAL 100><LETTERSPACE 0><CTRACK 127><CSSIZE
70><C+SIZE\012\01258.3><C-POSITION 33.3><C+POSITION 33.3><P><CBASELINE
0><CNOBREAK\0120><CLEADING -0.05\012 ><GGRID 0><GLEFT 0><GRIGHT
0><GFIRST 19.2><G+BEFORE\0120><G+AFTER 0><GALIGNMENT \012
"justify\012 "><GMETHOD "proportional"><G&\012"ENGLISH"><GPAIRS 4><G%
120><GKNEXT 0><GKWIDOW \012 1><GKORPHAN\012\0121><GTABS
$><GHYPHENATION 2 36 0><GWORDSPACE 75 100 150><GSPACE -5 0\01225>')
and just to get useful text:
t4 = t1 + t1
t5 = t2 + t2
t6 = t3 + t3
>>> rc(t4).groups()
(None, '', '>', '<@Trap Body text>')
>>> rc(t5).groups()
('=', None, '<FONT "Times">>', '<@Trap Body text=<FONT "Times">>')
>>> rc(t5).groups()
...
def parse_it(text):
t = rc(text)
assert t,"Text does not match!:\n%r" % text
return t.group('UsefulText')
>>> print parse_it(t5)
<@Trap Body text=<FONT "Times">>
wolfgang.
> -- Asheesh.
>
> On Wed, 27 Feb 2002 20:19:56 -0500, Wolfgang Grafen wrote:
>
> > import re
> >
> > rc=re.compile("<@Trap Body text\s*"
> > "(?:(?P<assigned>=)|(?P<unassigned>>))\s*"
> > "(?P<rest>.*?)\s*\Z",
> > re.MULTILINE|re.DOTALL).match
> >
> > t1='<@Trap Body text>'
> > t2='<@Trap Body text=<FONT "Times">'
> > t3="""<@Trap Body text=<FONT "Times"><CCOLOR\n "Black"><SIZE
> > 11><HORIZONTAL 100><LETTERSPACE 0><CTRACK 127><CSSIZE 70><C+SIZE\n
> > 58.3><C-POSITION 33.3><C+POSITION 33.3><P><CBASELINE 0><CNOBREAK
> > 0><CLEADING -0.05\n ><GGRID 0><GLEFT 0><GRIGHT 0><GFIRST 19.2><G+BEFORE
> > 0><G+AFTER 0><GALIGNMENT \n "justify\n "><GMETHOD "proportional"><G&
> > "ENGLISH"><GPAIRS 4><G% 120><GKNEXT 0><GKWIDOW \n 1><GKORPHAN\n
> > 1><GTABS $><GHYPHENATION 2 36 0><GWORDSPACE 75 100 150><GSPACE -5 0
> > 25>>"""
> >
> > rc(t1).groups()
> > (None, '>', '')
> >
> > rc(t2).groups()
> > ('=', None, '<FONT "Times">')
> >
> > rc(t3).groups()
> > ('=', None, '<FONT "Times"><CCOLOR\n "Black"><SIZE 11><HORIZONTAL
> > 100><LETTERSPACE 0><CTRACK 127><CSSIZE 70><C+SIZE\n 58.3><C-POSITION
> > 33.3><C+POSITION 33.3><P><CBASELINE 0><CNOBREAK 0><CLEADING -0.05\n
> > ><GGRID 0><GLEFT 0><GRIGHT 0><GFIRST 19.2><G+BEFORE 0><G+AFTER
> > 0><GALIGNMENT \n "justify\n "><GMETHOD "proportional"><G&
> > "ENGLISH"><GPAIRS 4><G% 120><GKNEXT 0><GKWIDOW \n 1><GKORPHAN\n
> > 1><GTABS $><GHYPHENATION 2 36 0><GWORDSPACE 75 100 150><GSPACE -5 0
> > 25>>')
> >
> > cheers
> >
> > wolfgang
> >
> >
> > Asheesh Laroia schrieb:
> >
> >> I have some SGML input (PageMaker 6.5 tagged text), and I want to be
> >> able to recognize (and delete) a tag. That tag looks like:
> >>
> >> <@Trap Body text:>
> >>
> >> It may also look like <@Trap Body text: useless-data>.
> >>
> >> So, I tried the regular expression r"<@.?>". That doesn't match the
> >> above string. Nor does r"<@.?Trap Body text.?>". What RE should I be
> >> using, and why doesn't this work?
> >>
> >> Thanks in advance!
> >>
> >> -- Asheesh Laroia.
> >>
> >> PS: An example of the tag "in the wild" is the following string:
> >>
> >> <@Trap Body text=<FONT "Times"><CCOLOR
> >> "Black"><SIZE 11><HORIZONTAL 100><LETTERSPACE 0><CTRACK 127><CSSIZE
> >> 70><C+SIZE
> >> 58.3><C-POSITION 33.3><C+POSITION 33.3><P><CBASELINE 0><CNOBREAK
> >> 0><CLEADING -0.05
> >> ><GGRID 0><GLEFT 0><GRIGHT 0><GFIRST 19.2><G+BEFORE 0><G+AFTER
> >> >0><GALIGNMENT "justify
> >> "><GMETHOD "proportional"><G& "ENGLISH"><GPAIRS 4><G% 120><GKNEXT
> >> 0><GKWIDOW 1><GKORPHAN 1><GTABS $><GHYPHENATION 2 36 0><GWORDSPACE 75
> >> 100 150><GSPACE -5 0 25>>
More information about the Python-list
mailing list