[Expat-discuss] Quick crazy question: HTML?
Sam TH
sam@uchicago.edu
Mon, 12 Mar 2001 08:32:53 -0600
---------------------- multipart/signed attachment
On Mon, Mar 12, 2001 at 09:07:37AM -0500, Fred L. Drake, Jr. wrote:
>=20
> On Mon, Mar 12, 2001 at 02:09:26AM -0800, Dru Nelson wrote:
> > How hard would it be to get expat to handle your typical
> > HTML as well? I'm talking about the typical sloppy
> > HTML out in the wild?
>=20
> Greg Stein writes:
> > Very hard. Super difficult. Insane work. ... :-)
> >=20
> > XML was designed *expressly* to get away from the sloppiness of HTML. =
One of
> > its main design points is to be absolutely rigorous. From that standpo=
int,
> > there isn't even a desire/motivation to make Expat provide tolerance.
>=20
> I agree with Greg on this point. Having worked on the Grail browser
> project, I've put a lot of effort into making "HTML as deployed" work
> with some similarity to other browsers -- and believe me, the parser
> that supports all the hueristics needed to do that is very different
> from something that supports "proper" XML.
> If you need a parser for HTML as deployed, you can take a look at
> the source of any of the many HTML parsers out there, but none that
> really work with "as deployed" HTML will be an easy read. Certainly
> the Mozilla sources are available, but the "entry fee" to read the
> sources is probably pretty high. The Grail sources are in Python, so
> it might not be too hard to wrap your head around, but you'll need to
> dig through the "sgml" directory a bit to see the structure of the
> parser and the Grail-specific code in the main directory to figure out
> some of the hueristics used to make it work.
Yes, parsing HTML is horrible. XHTML is much nicer, mostly since
expat parses it. :-)
If you want to parse HTML, your might want to check out libhtmlparse.
It's at http://msalem.translator.cx/libhtmlparse.html
=20
sam th --- sam@uchicago.edu --- http://www.abisource.com/~sam/
OpenPGP Key: CABD33FC --- http://samth.dyndns.org/key
DeCSS: http://samth.dynds.org/decss
---------------------- multipart/signed attachment
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 232 bytes
Desc: not available
Url : http://mail.libexpat.org/pipermail-21/expat-discuss/attachments/20010312/06bfe23b/attachment.bin
---------------------- multipart/signed attachment--