[Spambayes] New web training interface for pop3proxy

Sat Nov 23 22:40:40 2002

11/23/2002 4:10:27 PM, Tim Peters <tim.one@comcast.net> wrote:

>[David Ascher]
>> Make 'hovertips' that display the first few lines of the body
>
>[Richie Hindle]
>> This is done.  The code to strip HTML content uses a regular expression
>> from tokenizer.py which is commented "Cheap-ass gimmick", so I'm
>> interested to see how well people find it works!
>
>It works very well except when it doesn't <wink>.  The chief damned-
>whether-you-do-or-don't problem:  I've seen several msgs with HTML style
>sheets and/or HTML comments exceeding 2K characters.  The 2K limit in the
>minimal matches serves two purposes:
>
>1. Prevent the C stack from blowing up in the regexp engine.  But
>   François Granger reported a C stack blowup anyway on Mac OS 9,
>   and I still have no clue how small a limit would prevent that on
>   his box.
>
>2. Prevent it from consuming an arbitrary amount of text in case
>   we matched a "begin long construct" character sequence by accident.
>   It's *unlikely* that random test contains <style" or "<!--"
>   by accident, though, so I'm not much worried about that one.
>
>> (Apologies to Tim - it seems to work extremely well.)
>
>Yes, when it works at all <wink>.  Fixing it in all cases requires doing
>real HTML parsing, and that's expensive, so the current "cheap-ass gimmick"
>is accurate.
>
>> Rest assures it's safe from HTML content leaking into the web
>> interface - the worst that will happen is that you'll see HTML source
>> in the hovertip.
>
>A giant <style .. </style> section near the start seems the most likely
>glitch here.  Are you using this regexp *from* Python, or from Javascript?
>I have half a mind to replace the comment and style nuking with an
>iterative, stack-friendly scheme (like, e.g., crack_uuencode() and
>crack_urls(), which only use regexps to help find the right places to poke
>at -- they can't blow the C stack).  But if you're doing this from
>Javascript, that wouldn't help you.

A giant <style> or a giant comment, though those don't occur that often.  
Another tag that is probably huge and worthless is <script>...</script>, often 
couched in a huge comment.  (But do scripts even occur in emailed html?)  We 
should probably use another cheap ass gimmick to get rid of those tags, then 
use the cheap ass regex to get rid of the rest of the html.

One other problem with the regex that I see is that it doesn't seem to handle 
tags with ill placed whitespace very well... like < a href=...  A whitespace 
normalization substitution regex might be well advised.  Taking out whitespace 
after a < would change a < b to a <b, not altering its meaning from a clue 
perspective, and would change <  a href=... to <a href=..., making it 
recognizable to the cheap-ass gimmick regex.

There was some talk earlier about gleaning clues from some tags, like 
background, font, color, etc. kind of things... any more thought along those 
lines?
>
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
- Tim
www.fourstonesExpressions.com