[issue1486713] HTMLParser : A auto-tolerant parsing mode

kxroberto report at bugs.python.org
Wed Nov 16 13:16:26 CET 2011


kxroberto <kxroberto at users.sourceforge.net> added the comment:

16ed15ff0d7c was not in current stable py3.2 so I missed it..

When the comma is now raised as attribute name, then the problem is anyway moved to the higher level anyway - and is/can be handled easily there by usual methods.
(still I guess locatestarttagend_tolerant matches a free standing comma extra after an attribute)

"should be generic enough to work with all kind of invalid markup": I think we would be rather complete then (->missing space issue)- at least regarding %age of real cases. And it could be improved with few touches over time if something missing. 100% is not the point unless it shall drive the official W3C checker. The call of self.warning, as in old patch, doesn't cost otherwise and I see no real increase of complexity/cpu-time.

"HTMLParser won't do any check about the validity of the elements' names or attributes' names/values": yes thats of course up to the next level handler (BTDT)- thus the possibilty of error handling is not killed. Its about what HTMLParser _hides_ irrecoverably.

"there should be a valid use case for this": Almost any app which parses HTML (self authored or remote) can have (should have?) a no-fuzz/collateral warn log option. (->no need to make a expensive W3C checker session). I mostly have this in use as said, as it was anyway there.

Well, as for me, I use anyway a private backport to Python2 of this. I try to avoid Python3 as far as possible. (No real plus, too much problems) So for me its just about joining Python4 in the future perhaps - which can do true CPython multithreading, stackless, psyco/static typing ... and print statement again without typing so many extra braces ;-)
I considered extra libs like the HTML tidy binding, but this is all too much fuzz for most cases. And HTMLParser has already quite everything, with the few calls inserted ..

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue1486713>
_______________________________________


More information about the Python-bugs-list mailing list