Reg Exp: Need advice concerning "greediness"

Franz GEIGER fgeiger at datec.at
Wed Oct 4 10:02:05 EDT 2000


That was definitly what I was lookin' for! Alex mentioned the SGML parser
already but supposed that there remains considerable work to do. But it was
rather painless to implement my stuff into this frame.

Thank you all again, great community!

Best regards
Franz GEIGER


Robert Roy <rjroy at takingcontrol.com> schrieb in im Newsbeitrag:
39da9d50.6517406 at news1.on.sympatico.ca...
> On Sat, 30 Sep 2000 15:08:02 +0200, "Franz GEIGER" <fgeiger at datec.at>
> wrote:
>
> >Hello all,
> >
> >I want to exchange font colors of headings of a certain level in HTML
files.
> >
> >I have a line containing a heading level 1, e.g.: <h1><font
> >COLOR="#FF0000">Heading Level 1</font></h1>.
> >
> >Now I want to split this into 3 groups: Everything before "COLOR=xyz",
> >"COLOR=xyz" itself, and everything after "COLOR=xyz".
> >
> >I tried:
> >sRslt = "<h1><font COLOR="#FF0000">Heading Level 1</font></h1>";
> >print re.findall(re.compile(r'(.*?FONT.*?)(COLOR=.*?)*([ |>].*)', re.I |
> >re.S), sRslt);
> >
> >This returns [("<h1><font, , COLOR="#FF0000">Heading Level
1</font></h1>)].
> >I'd expected to receive [("<h1><font , COLOR="#FF0000", >Heading Level
> >1</font></h1>)].
> >
> >It works if I replace (COLOR=.*?)* by (COLOR=.*?). But I need having the
'*'
> >because there may be headings w/o the color attribute but with a face
> >attribute.
> >
> >As I understood until now, '*' means 'zero or more of preceeding, but as
> >many as possible'. If a color attribute is present, 'as many as possible'
> >means 'the one that is there', doesn't it? If there is no such attribute,
> >well - then it's 'zero'.
> >
> >What did I miss?
> >
> >Best regards
> >Franz
> >
> >
> >
>
> Here is some example code using sgmllib. Note that you should make
> sure you are submitting valid html. This will also change the case of
> your element and attribute names to lower case.
>
> Make changes-additions where the comment tells you to
> Add member variables to the class as required to keep track of where
> you are.
>
> Bob
>
> #######################
>
> from sgmllib import SGMLParser
> import string
>
> class MySGMLParser(SGMLParser):
>     def __init__(self, verbose=0, outfile=None):
>        if not hasattr(outfile, 'write'):
>            raise "outfile must have attribute write"
>        self.outfile = outfile
>        SGMLParser.__init__(self, verbose)
>
>     def handle_data(self, data):
>         self.outfile.write(data)
>
>     def handle_comment(self, data):
>         self.outfile.write('<!--%s-->' % data)
>
>     def unknown_starttag(self, tag, attrs):
>         if not attrs:
>             self.outfile.write('<' + tag + '>')
>         else:
>             self.outfile.write('<' + tag)
>             for attr in attrs:
>                 self.outfile.write(' %s="%s"' % attr)
>             self.outfile.write('>')
>
>     def unknown_endtag(self, tag):
>         self.outfile.write('</%s>' % tag)
>
>     def unknown_entityref(self, ref):
>         self.outfile.write('&%s;' % ref)
>     # so known refs do not get translated
>     handle_entityref = unknown_entityref
>
>     def unknown_charref(self, ref):
>         self.outfile.write('&#%s;' % ref)
>     # so known refs do not get translated
>     handle_charref = unknown_charref
>
>     def close(self):
>         SGMLParser.close(self)
>
>     ## put tag handlers here,
>     ## for my sample code I took the  www.python.org homepage and
>     ## changed the bgcolor of the wrapper tables
>     ## define start and end tag handlers as start_TAGNAME, end_TAGNAME
>
>     def start_td(self, attrs):
>         if not attrs:
>             self.outfile.write('<td>')
>         else:
>             self.outfile.write('<td')
>             for name, val in attrs:
>                 if string.lower(name) == 'bgcolor':
>                     self.outfile.write(' %s="%s"' % (name, '#ffcc99'))
>                 else:
>                     self.outfile.write(' %s="%s"' % (name, val))
>             self.outfile.write('>')
>
>     def end_td(self):
>         self.outfile.write('</td>')
>
>
>
> if __name__ == "__main__":
>     import sys
>     if len(sys.argv) != 3:
>         print "usage: python changeattr.py infile, outfile"
>         raise SystemExit
>     infile = sys.argv[1]
>     outfile = sys.argv[2]
>     ofp = open(outfile, 'w')
>     # this is a one shot parser
>     p = MySGMLParser(outfile=ofp)
>     p.feed(open(infile).read())
>     p.close()
>     ofp.close()
>
>
>
>





More information about the Python-list mailing list