[Spambayes] New web training interface for pop3proxy
Sat Nov 23 22:10:27 2002
> Make 'hovertips' that display the first few lines of the body
> This is done. The code to strip HTML content uses a regular expres=
> from tokenizer.py which is commented "Cheap-ass gimmick", so I'm
> interested to see how well people find it works!
It works very well except when it doesn't <wink>. The chief damned-
whether-you-do-or-don't problem: I've seen several msgs with HTML style
sheets and/or HTML comments exceeding 2K characters. The 2K limit in the
minimal matches serves two purposes:
1. Prevent the C stack from blowing up in the regexp engine. But
François Granger reported a C stack blowup anyway on Mac OS 9,
and I still have no clue how small a limit would prevent that on
2. Prevent it from consuming an arbitrary amount of text in case
we matched a "begin long construct" character sequence by accident.
It's *unlikely* that random test contains <style" or "<!--"
by accident, though, so I'm not much worried about that one.
> (Apologies to Tim - it seems to work extremely well.)
Yes, when it works at all <wink>. Fixing it in all cases requires doing
real HTML parsing, and that's expensive, so the current "cheap-ass gimmick"
> Rest assures it's safe from HTML content leaking into the web
> interface - the worst that will happen is that you'll see HTML sour=
> in the hovertip.
A giant <style .. </style> section near the start seems the most likely
I have half a mind to replace the comment and style nuking with an
iterative, stack-friendly scheme (like, e.g., crack_uuencode() and
crack_urls(), which only use regexps to help find the right places to poke
at -- they can't blow the C stack). But if you're doing this from
More information about the Spambayes