Buffering HTML as HTMLParser reads it?
Bruno Desthuilliers
bdesth.quelquechose at free.quelquepart.fr
Mon Aug 6 03:51:20 EDT 2007
chrispwd at gmail.com a écrit :
> Hello,
>
> I am working on a project where I'm using python to parse HTML pages,
> transforming data between certain tags. Currently the HTMLParser class
> is being used for this. In a nutshell, its pretty simple -- I'm
> feeding the contents of the HTML page to HTMLParser, then I am
> overriding the appropriate handle_ method to handle this extracted
> data. In that method, I take the found data and I transform it into
> another string based on some logic.
>
> Now, what I would like to do here is take that transformed string and
> put it "back into" the HTML document. Has anybody ever implemented
> something like this with HTMLParser?
Works the same with any sax (event-based) parser. First subclass the
parser, adding a 'buffer' (best is to use a file-like object so you can
either write to a stream, a file, a cStringIO etc) attribute to it and
making all the handlers writing to this buffer. Then subclass your
customized parser, and only override the needed handlers.
Q&D example implementation:
def format_attrs(attrs) :
return ' '.join('%s=%s' % attr for attr in attrs)
def format_tag(tag, attrs, formats):
attrs = format_attrs(attrs)
return formats[bool(attrs)] % dict(tag=tag, attrs=attrs)
class BufferedHTMLParser(HTMLParser):
START_TAG_FORMATS = ('<%(tag)s>', '<%(tag)s %(attrs)s>')
STARTEND_TAG_FORMATS = ('<%(tag)s />', '<%(tag)s %(attrs)s />')
def __init__(self, buffer):
self.buffer = buffer
def handle_starttag(self, tag, attrs):
self.buffer.write(format_tag(tag,attrs,self.START_TAG_FORMATS))
def handle_startendtag(self, tag):
self.buffer.write(format_tag(tag,attrs,self.STARTEND_TAG_FORMATS))
def handle_endtag(self, tag):
self.buffer.write('</%s> % tag)
def handle_data(self, data):
self.buffer.write(data)
# etc for all handlers
class MyParser(BufferedHtmlParser):
def handle_data(self, data):
data = data.replace(
'Ni',
"Ekky-ekky-ekky-ekky-z'Bang, zoom-Boing, z'nourrrwringmm"
)
BufferedHTMLParser.handle_data(self, data)
HTH
More information about the Python-list
mailing list