[Spambayes] CRM114 in November breaks 99.9%. :-)

Matt Sergeant msergeant@startechgroup.co.uk
Mon Dec 2 16:21:10 2002


Bill Yerazunis said the following on 02/12/02 15:57:
>    From: Matt Sergeant <msergeant@startechgroup.co.uk>
> 
>    CRM114's learn and classify stuff looks really interesting, but it has a 
>    really freaky syntax to someone who is used to regular procedural or OO 
>    languages like Perl, Python, C, etc. 
> 
> It _is_ procedural, it's just extremely high level.  Perhaps higher-level
> than APL if you count statements rather than operators.

Sorry, I meant "prodedural like Perl/Python/C" not "procedural, like 
Perl/Python/C". Actually maybe python shouldn't be in that list since it 
has a weirdass syntax too :-)

>    Is there *any* chance the library 
>    in crm114 for learning and classifying can be extracted into a plain 
>    .so? That would be tremendous, and I'd be willing to build a perl XS 
>    library for it in a heartbeat.
> 
> Yes, it's not difficult to get at the code.  
> 
> Pop the .gz open, emacs the file crm114.c, and look for the case
> headers "CRM_LEARN" and "CRM_CLASSIFY" respectively.  The code there
> is _not_ generated, but executed in-line, so cut and paste will work.
> 
> The current code requires a null-terminated string as input, but
> that's because of the GNU regex library limits (when TRE gives me a
> new library, that requirement will go away).  You _will_ need to link
> it against a regex library (of your choice, CRM114 uses the standard
> ANSI regcomp/regexec calling sequence), and the OS itself needs to
> support stat() [for file existence/length] and mmap() [to map a file
> into virtual memory without actually reading it in a byte at a time-
> this is just for efficiency and can be worked around].

I was thinking of punting on splitting the email to tokens back to the 
host language. Since perl and python both support POSIX regexps (and 
thus [[:graph:]]) its probably easier that way. Unless there's an 
inherent reason it has to be embedded in the library.

> How bad do you want it? :-)

What interests me is the hashing technique. It should be reasonably easy 
to extract that, but for me it's just a lack of tuits - it's hard enough 
keeping up with my regular day to day activities, and my todo list never 
gets shorter.

> (*) all in all, I like the way it ended up; one can just type programs
> on the command line and they do useful things.  But hindsight is always
> 20/20, and "less wierdass" might be better in the long run.

I imagine you'd get a few more users with a regular syntax ;-)

Matt.




More information about the Spambayes mailing list