[Expat-discuss] Quick crazy question: HTML?

Sam TH sam@uchicago.edu
Mon, 12 Mar 2001 08:32:53 -0600


---------------------- multipart/signed attachment
On Mon, Mar 12, 2001 at 09:07:37AM -0500, Fred L. Drake, Jr. wrote:
>=20
> On Mon, Mar 12, 2001 at 02:09:26AM -0800, Dru Nelson wrote:
>  > How hard would it be to get expat to handle your typical
>  > HTML as well? I'm talking about the typical sloppy
>  > HTML out in the wild?
>=20
> Greg Stein writes:
>  > Very hard. Super difficult. Insane work. ... :-)
>  >=20
>  > XML was designed *expressly* to get away from the sloppiness of HTML. =
One of
>  > its main design points is to be absolutely rigorous. From that standpo=
int,
>  > there isn't even a desire/motivation to make Expat provide tolerance.
>=20
>   I agree with Greg on this point.  Having worked on the Grail browser
> project, I've put a lot of effort into making "HTML as deployed" work
> with some similarity to other browsers -- and believe me, the parser
> that supports all the hueristics needed to do that is very different
> from something that supports "proper" XML.
>   If you need a parser for HTML as deployed, you can take a look at
> the source of any of the many HTML parsers out there, but none that
> really work with "as deployed" HTML will be an easy read.  Certainly
> the Mozilla sources are available, but the "entry fee" to read the
> sources is probably pretty high.  The Grail sources are in Python, so
> it might not be too hard to wrap your head around, but you'll need to
> dig through the "sgml" directory a bit to see the structure of the
> parser and the Grail-specific code in the main directory to figure out
> some of the hueristics used to make it work.

Yes, parsing HTML is horrible.  XHTML is much nicer, mostly since
expat parses it.  :-)

If you want to parse HTML, your might want to check out libhtmlparse.
It's at http://msalem.translator.cx/libhtmlparse.html
          =20
sam th --- sam@uchicago.edu --- http://www.abisource.com/~sam/
OpenPGP Key: CABD33FC --- http://samth.dyndns.org/key
DeCSS: http://samth.dynds.org/decss


---------------------- multipart/signed attachment
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 232 bytes
Desc: not available
Url : http://mail.libexpat.org/pipermail-21/expat-discuss/attachments/20010312/06bfe23b/attachment.bin

---------------------- multipart/signed attachment--