htmllib.py and parsing malformed HTML

Jeremy Bowers jerf at jerf.org
Fri Sep 5 04:44:51 CEST 2003


On Thu, 04 Sep 2003 11:50:07 -0400, KC wrote:
> As with most organizations,
> changing *our* code is much more acceptable to the powers that be, than
> bringing in a third-party product that will have to be evaluated and have
> countless meetings over its approval.  For many of us, business and policy
> decisions often forge the direction for technology usage within our
> organizations.

If you are having real problems with poor HTML, HTMLTidy may be worth
going to bat over. If you can find a simple solution that works on the
HTML you are processing, great, go with it, and it's worth researching in
your situation first. But HTML can go bad in more ways then you can
imagine (which is in fact part of the problem); if you are getting HTML
that's bad in a lot of little ways, you'll find the "apply a hack to fix
this file, apply a hack to fix that file" will start stepping on its own
toes.

HTMLTidy represents a ***lot*** of grunt work and a ***lot*** of
functionality that you can *not* replicate in a reasonable amount of time;
it's one of those packages that isn't so much a program that "does
something" as a program that represents many, many man-years of "knowledge
acquired". 

I'm not trying to push anything, since I don't know your situation, but
HTMLTidy is one of those rare projects that you really shouldn't allow NMH
to scuttle unless you *really* need to. (Again, I mention if there's some
simple way you can characterize the bad HTML coming out of one single
program, go ahead and try to fix it; maybe you'll get lucky and a regex
will be enough.)




More information about the Python-list mailing list