[Spambayes] New web training interface for pop3proxy
Sat Nov 23 22:51:31 2002
> The code to strip HTML content uses a regular expression
> from tokenizer.py which is commented "Cheap-ass gimmick", so I'm
> interested to see how well people find it works!
> I have half a mind to replace the comment and style nuking with an
> iterative, stack-friendly scheme (like, e.g., crack_uuencode() and
> crack_urls(), which only use regexps to help find the right places to poke
> at -- they can't blow the C stack). But if you're doing this from
I'm using it from Python, but (currently) only in a relatively unimportant
feature. I wouldn't call it worth changing for the sake of the hovertips,
but it's definitely worth changing to fix François' stack explosion.
Somewhere (I can't find it right now, but I'll have a proper look ASAP) I
have an attempt at rewriting Tom Christiansen's striphtml program
using non-greedy Python regexps - it's mostly a mechanical rewrite,
changing things like "<.*?>" to "<[^>]*>", and so forth. That might work -
I'll try to dig it out.
More information about the Spambayes