[Web-SIG] HTML parsers and DOM; WWW::Mechanize work-alike
casey at zope.com
Tue Dec 2 23:03:03 EST 2003
On Tue, 2 Dec 2003 22:28:03 -0500
Greg Ward <gward at python.net> wrote:
> On 02 December 2003, Casey Duncan said:
> > I volunteer for phase 1. Actually I will do a phase 0 first which will
> > just be stupid wrapper that exposes the API and nothing else. From
> > there we can discuss what needs to be done to complete phase 1.
> > This looks like a good job for SWIG, does anyone oppose using it?
> Note that the current Berkeley DB wrapper did not get into the standard
> library until AMK rewrote it from hand with no hint of SWIG. (And even
> then, it took a year or two before the bsddb in 2.3 got in.)
And it still seems to break often due to the API instabilities of bsddb itself. Oh well.
> As I recall, there were Serious Reservations about the quality of code
> generated by SWIG. Grovel through the python-dev archives for more. If
> SWIG has changed much since then, it might be worth revisiting -- but I
> suspect you'd have a selling job to do to get SWIGged code past
Yup, I have reservations of my own about it. I definitely don't want to do it by hand (and maintain it) if it will see little use, so I think we should discuss a bit more exactly what our needs are.
>From what I understand we want a DOM parser for real-world (aka broken) HTML code. From what I can see, tidylib will (or at least aspires to) do this. I think some testing is in order, now if only I could find some broken HTML code... ;^)
Now the DOM api from tidylib is not W3C compliant. If we were to use tidylib as a base for some new HTML DOM parser, would we desire a W3C compliant api? As much as I want to say no, it would probably help its credibility in terms of becoming part of the st lib.
OTOH, if anyone has a better idea, I'm all ears. What kind of api do people want?
So a revised plan A will be to vet tidylib as the solution to the HTML parser problem. I will do this, but can anyone already speak more specifically about their experiences good and bad?
More information about the Web-SIG