[Spambayes] Re: Collecting word lists.. - BUMMER

Tim Peters tim_one at email.msn.com
Sun May 25 23:33:36 EDT 2003


[Brad]
> This is only the first pass at 'analysis'. I had thought that I would
> be saving data and making multiple passes. So, to save on RAM I felt
> converting unique sha hashes to an int would, in later passes
> (loading sets from pickles, etc) would use less memory.

Ah, premature optimization <0.5 wink>.  Get it working correctly first.
I'll emphasize again that it's simply impossible for 7 databases not to have
any words in common.  For example, just from this sentence, the tokens
"for", "just", "from", "this", "the" and "should" should be in everyone's
database.  It's even more so for spambayes databases, due to what should be
universally present synthesized tokens like

    'url:www'
and
    'proto:http'

Those should be present in every database even if the source is entirely
non-English, and a lot of synthesized header-line tokens should be present
everywhere too.

When you get beyond this hurdle, Neil Schemenauer's msg will be important
too.

>>>     >>> f = open("key-hash", "w")

>> SHA digests are binary data, so it's necessary to open the output
>> file in "wb" mode (and "w" mode is silently deadly on Windows).

> Is the default 'b' on Linux?

No, but there's no difference between "w" and "wb" on Unix systems.  There
is on Windows and Macs, and more so on Windows.

> Can we get new contributions using "wb"?
>
> I've cleaned out the upload directory, you can use the same names.

Could you post the instructions again, please?  Uploaders should be careful
to ensure binary-mode transfers, too (ftp command "binary" before
uploading).




More information about the Spambayes mailing list