[Chicago] sgmlparser problem

Ian Bicking ianb at colorstudy.com
Wed Dec 13 16:19:21 CET 2006


Martin Maney wrote:
> On Mon, Dec 11, 2006 at 06:34:20PM -0600, Lukasz Szybalski wrote:
>> Yea that is one solution. It does work, but instead of skipping bad html 
>> i am fixing it and then trashing it. It seems kind of odd.
> 
> Nah, it's a common technique: you turn the original problem into a
> related problem to which there is a known solution, and Bob's your
> uncle.  Conservation of programmer time at the expense of plentiful CPU
> cycles - of course there are cases where that's not a win, but this
> doesn't appear to be one of them.

I agree with Martin.  It's not worth your time to try to parse 
HTML-in-the-wild with sgmlparser or HTMLParser.  You'll fix this, then 
encounter something else later, and on and on.

Another option for HTML parsing is lxml.etree.HTML(), which is also 
quite tolerant.

-- 
Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org


More information about the Chicago mailing list