[Chicago] sgmlparser problem
Ian Bicking
ianb at colorstudy.com
Wed Dec 13 16:19:21 CET 2006
Martin Maney wrote:
> On Mon, Dec 11, 2006 at 06:34:20PM -0600, Lukasz Szybalski wrote:
>> Yea that is one solution. It does work, but instead of skipping bad html
>> i am fixing it and then trashing it. It seems kind of odd.
>
> Nah, it's a common technique: you turn the original problem into a
> related problem to which there is a known solution, and Bob's your
> uncle. Conservation of programmer time at the expense of plentiful CPU
> cycles - of course there are cases where that's not a win, but this
> doesn't appear to be one of them.
I agree with Martin. It's not worth your time to try to parse
HTML-in-the-wild with sgmlparser or HTMLParser. You'll fix this, then
encounter something else later, and on and on.
Another option for HTML parsing is lxml.etree.HTML(), which is also
quite tolerant.
--
Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org
More information about the Chicago
mailing list