[Spambayes] CRM114 in November breaks 99.9%. :-)
Matt Sergeant
msergeant@startechgroup.co.uk
Mon Dec 2 16:21:10 2002
Bill Yerazunis said the following on 02/12/02 15:57:
> From: Matt Sergeant <msergeant@startechgroup.co.uk>
>
> CRM114's learn and classify stuff looks really interesting, but it has a
> really freaky syntax to someone who is used to regular procedural or OO
> languages like Perl, Python, C, etc.
>
> It _is_ procedural, it's just extremely high level. Perhaps higher-level
> than APL if you count statements rather than operators.
Sorry, I meant "prodedural like Perl/Python/C" not "procedural, like
Perl/Python/C". Actually maybe python shouldn't be in that list since it
has a weirdass syntax too :-)
> Is there *any* chance the library
> in crm114 for learning and classifying can be extracted into a plain
> .so? That would be tremendous, and I'd be willing to build a perl XS
> library for it in a heartbeat.
>
> Yes, it's not difficult to get at the code.
>
> Pop the .gz open, emacs the file crm114.c, and look for the case
> headers "CRM_LEARN" and "CRM_CLASSIFY" respectively. The code there
> is _not_ generated, but executed in-line, so cut and paste will work.
>
> The current code requires a null-terminated string as input, but
> that's because of the GNU regex library limits (when TRE gives me a
> new library, that requirement will go away). You _will_ need to link
> it against a regex library (of your choice, CRM114 uses the standard
> ANSI regcomp/regexec calling sequence), and the OS itself needs to
> support stat() [for file existence/length] and mmap() [to map a file
> into virtual memory without actually reading it in a byte at a time-
> this is just for efficiency and can be worked around].
I was thinking of punting on splitting the email to tokens back to the
host language. Since perl and python both support POSIX regexps (and
thus [[:graph:]]) its probably easier that way. Unless there's an
inherent reason it has to be embedded in the library.
> How bad do you want it? :-)
What interests me is the hashing technique. It should be reasonably easy
to extract that, but for me it's just a lack of tuits - it's hard enough
keeping up with my regular day to day activities, and my todo list never
gets shorter.
> (*) all in all, I like the way it ended up; one can just type programs
> on the command line and they do useful things. But hindsight is always
> 20/20, and "less wierdass" might be better in the long run.
I imagine you'd get a few more users with a regular syntax ;-)
Matt.
More information about the Spambayes
mailing list