HTMLParser bug ?

Richard Brodie R.Brodie at rl.ac.uk
Mon May 12 13:30:16 EDT 2003


"Anand Pillai" <pythonguy at Hotpop.com> wrote in message
news:84fc4588.0305120626.661020f5 at posting.google.com...

> (Quoted from url http://www.python.org/doc/2.3b1/whatsnew/index.html)
>
> <link rel="first" href="whatsnew23.html" title='What's New in Python 2.3'>
>
> The attribute title contains a single quote inside its value.

So it's malformed; that would be a docutils problem, I suppose. You can
make an arbitrary decision on what to do but part of the time you will
silently ignore errors. For example if the original text was:

<link rel="first" href='whatsnew23.html title='What's New in Python 2.3'>

I think you greatly underestimate the difficulty of parsing almost-HTML.

Since this topic comes up from time to time, maybe the HTMLParser
documentation should have a note about feeding it malformed HTML.






More information about the Python-list mailing list