Buffering HTML as HTMLParser reads it?

Bruno Desthuilliers bdesth.quelquechose at free.quelquepart.fr
Mon Aug 6 09:51:20 CEST 2007


chrispwd at gmail.com a écrit :
> Hello,
> 
> I am working on a project where I'm using python to parse HTML pages,
> transforming data between certain tags. Currently the HTMLParser class
> is being used for this. In a nutshell, its pretty simple -- I'm
> feeding the contents of the HTML page to HTMLParser, then I am
> overriding the appropriate handle_ method to handle this extracted
> data. In that method, I take the found data and I transform it into
> another string based on some logic.
> 
> Now, what I would like to do here is take that transformed string and
> put it "back into" the HTML document. Has anybody ever implemented
> something like this with HTMLParser?

Works the same with any sax (event-based) parser. First subclass the 
parser, adding a 'buffer' (best is to use a file-like object so you can 
either write to a stream, a file, a cStringIO etc) attribute to it and 
making all the handlers writing to this buffer. Then subclass your 
customized parser, and only override the needed handlers.

Q&D example implementation:

def format_attrs(attrs) :
   return ' '.join('%s=%s' % attr for attr in attrs)

def format_tag(tag, attrs, formats):
   attrs = format_attrs(attrs)
   return formats[bool(attrs)] % dict(tag=tag, attrs=attrs)

class BufferedHTMLParser(HTMLParser):
   START_TAG_FORMATS = ('<%(tag)s>', '<%(tag)s %(attrs)s>')
   STARTEND_TAG_FORMATS = ('<%(tag)s />', '<%(tag)s %(attrs)s />')

   def __init__(self, buffer):
     self.buffer = buffer

   def handle_starttag(self, tag, attrs):
      self.buffer.write(format_tag(tag,attrs,self.START_TAG_FORMATS))
          	
   def handle_startendtag(self, tag):
     self.buffer.write(format_tag(tag,attrs,self.STARTEND_TAG_FORMATS))

   def handle_endtag(self, tag):
     self.buffer.write('</%s> % tag)

   def handle_data(self, data):
     self.buffer.write(data)

   # etc for all handlers


class MyParser(BufferedHtmlParser):
    def handle_data(self, data):
       data = data.replace(
         'Ni',
         "Ekky-ekky-ekky-ekky-z'Bang, zoom-Boing, z'nourrrwringmm"
         )
       BufferedHTMLParser.handle_data(self, data)

HTH



More information about the Python-list mailing list