Reg Exp: Need advice concerning "greediness"

Robert Roy rjroy at takingcontrol.com
Tue Oct 3 23:08:49 EDT 2000


On Sat, 30 Sep 2000 15:08:02 +0200, "Franz GEIGER" <fgeiger at datec.at>
wrote:

>Hello all,
>
>I want to exchange font colors of headings of a certain level in HTML files.
>
>I have a line containing a heading level 1, e.g.: <h1><font
>COLOR="#FF0000">Heading Level 1</font></h1>.
>
>Now I want to split this into 3 groups: Everything before "COLOR=xyz",
>"COLOR=xyz" itself, and everything after "COLOR=xyz".
>
>I tried:
>sRslt = "<h1><font COLOR="#FF0000">Heading Level 1</font></h1>";
>print re.findall(re.compile(r'(.*?FONT.*?)(COLOR=.*?)*([ |>].*)', re.I |
>re.S), sRslt);
>
>This returns [("<h1><font, , COLOR="#FF0000">Heading Level 1</font></h1>)].
>I'd expected to receive [("<h1><font , COLOR="#FF0000", >Heading Level
>1</font></h1>)].
>
>It works if I replace (COLOR=.*?)* by (COLOR=.*?). But I need having the '*'
>because there may be headings w/o the color attribute but with a face
>attribute.
>
>As I understood until now, '*' means 'zero or more of preceeding, but as
>many as possible'. If a color attribute is present, 'as many as possible'
>means 'the one that is there', doesn't it? If there is no such attribute,
>well - then it's 'zero'.
>
>What did I miss?
>
>Best regards
>Franz
>
>
>

Here is some example code using sgmllib. Note that you should make
sure you are submitting valid html. This will also change the case of
your element and attribute names to lower case.

Make changes-additions where the comment tells you to
Add member variables to the class as required to keep track of where
you are.

Bob

#######################

from sgmllib import SGMLParser
import string

class MySGMLParser(SGMLParser):
    def __init__(self, verbose=0, outfile=None):
       if not hasattr(outfile, 'write'):
           raise "outfile must have attribute write"
       self.outfile = outfile
       SGMLParser.__init__(self, verbose)

    def handle_data(self, data):
        self.outfile.write(data)

    def handle_comment(self, data):
        self.outfile.write('<!--%s-->' % data)
        
    def unknown_starttag(self, tag, attrs):
        if not attrs:
            self.outfile.write('<' + tag + '>')
        else:
            self.outfile.write('<' + tag)
            for attr in attrs:
                self.outfile.write(' %s="%s"' % attr)
            self.outfile.write('>')

    def unknown_endtag(self, tag):
        self.outfile.write('</%s>' % tag)

    def unknown_entityref(self, ref):
        self.outfile.write('&%s;' % ref)
    # so known refs do not get translated
    handle_entityref = unknown_entityref

    def unknown_charref(self, ref):
        self.outfile.write('&#%s;' % ref)
    # so known refs do not get translated
    handle_charref = unknown_charref

    def close(self):
        SGMLParser.close(self)

    ## put tag handlers here, 
    ## for my sample code I took the  www.python.org homepage and
    ## changed the bgcolor of the wrapper tables 
    ## define start and end tag handlers as start_TAGNAME, end_TAGNAME

    def start_td(self, attrs):
        if not attrs:
            self.outfile.write('<td>')
        else:
            self.outfile.write('<td')
            for name, val in attrs:
                if string.lower(name) == 'bgcolor': 
                    self.outfile.write(' %s="%s"' % (name, '#ffcc99'))
                else:
                    self.outfile.write(' %s="%s"' % (name, val))
            self.outfile.write('>')

    def end_td(self):
        self.outfile.write('</td>')



if __name__ == "__main__":
    import sys
    if len(sys.argv) != 3:
        print "usage: python changeattr.py infile, outfile"
        raise SystemExit
    infile = sys.argv[1]
    outfile = sys.argv[2]
    ofp = open(outfile, 'w')
    # this is a one shot parser
    p = MySGMLParser(outfile=ofp)
    p.feed(open(infile).read())
    p.close()
    ofp.close()
        






More information about the Python-list mailing list