Buffering HTML as HTMLParser reads it?

Paul McGuire ptmcg at austin.rr.com
Wed Aug 1 16:08:08 EDT 2007


On Aug 1, 1:31 pm, chris... at gmail.com wrote:
<snip>
>
> I'm thinking maybe somehow have HTMLParser append each character it
> reads except for data inside tags in some kind of buffer? This way I
> can have the HTML contents read into a buffer, then when I do my own
> handle_ overrides, I can also append to that buffer with the
> transformed data. Once the HTML page is finished parsing, ideally I
> would be able to print the contents of the buffer and the HTML would
> be identical except for the string transformations.
>
> I also need to make sure that all newlines, tags, spacing, etc are
> kept in tact -- this part is a requirement for other reasons.
>
> Thanks!

What you describe is almost exactly how pyparsing implements
transformString.  See below:

from pyparsing import *

boldStart,boldEnd = makeHTMLTags("B")

# convert <B> to <div class="bold"> and </B> to </div>
boldStart.setParseAction(replaceWith('<div class="emphatic">'))
boldEnd.setParseAction(replaceWith('</div>'))
converter = boldStart | boldEnd

html = "Display this in <b>bold</b>"
print converter.transformString(html)

Prints:

Display this in <div class="emphatic">bold</div>

All text not matched by a pattern in the converter is left as-is.  (My
CSS style/form may not be up to date, but I hope you get the idea.)

-- Paul




More information about the Python-list mailing list