[issue1486713] HTMLParser : A auto-tolerant parsing mode
report at bugs.python.org
Wed Nov 16 13:16:26 CET 2011
kxroberto <kxroberto at users.sourceforge.net> added the comment:
16ed15ff0d7c was not in current stable py3.2 so I missed it..
When the comma is now raised as attribute name, then the problem is anyway moved to the higher level anyway - and is/can be handled easily there by usual methods.
(still I guess locatestarttagend_tolerant matches a free standing comma extra after an attribute)
"should be generic enough to work with all kind of invalid markup": I think we would be rather complete then (->missing space issue)- at least regarding %age of real cases. And it could be improved with few touches over time if something missing. 100% is not the point unless it shall drive the official W3C checker. The call of self.warning, as in old patch, doesn't cost otherwise and I see no real increase of complexity/cpu-time.
"HTMLParser won't do any check about the validity of the elements' names or attributes' names/values": yes thats of course up to the next level handler (BTDT)- thus the possibilty of error handling is not killed. Its about what HTMLParser _hides_ irrecoverably.
"there should be a valid use case for this": Almost any app which parses HTML (self authored or remote) can have (should have?) a no-fuzz/collateral warn log option. (->no need to make a expensive W3C checker session). I mostly have this in use as said, as it was anyway there.
Well, as for me, I use anyway a private backport to Python2 of this. I try to avoid Python3 as far as possible. (No real plus, too much problems) So for me its just about joining Python4 in the future perhaps - which can do true CPython multithreading, stackless, psyco/static typing ... and print statement again without typing so many extra braces ;-)
I considered extra libs like the HTML tidy binding, but this is all too much fuzz for most cases. And HTMLParser has already quite everything, with the few calls inserted ..
Python tracker <report at bugs.python.org>
More information about the Python-bugs-list