[Spambayes] Matt Sergeant: Introduction
Tue, 01 Oct 2002 19:29:48 +1000
>>> Matt Sergeant wrote
> And to give back I'll tell you that one of my biggest wins was parsing
> HTML (with HTML::Parser - a C implementation so it's very fast) and
> tokenising all attributes, so I get:
> face=Arial, Helvetica, sans-serif
> as tokens. Plus using a proper HTML parser I get to parse HTML comments
> too (which is a win).
With the Graham code, we found that the simple minded parsing of HTML
actually hurt more than it gained, but it was a _very_ simple split-on-
whitespace. In a case of syncronicity, at the moment I'm running a test
over my newer larger monster corpus (35Kh/17Ks) to extract the avpairs
from HTML tokens.