[spambayes-dev] modified version of sb_dbexpimp.py
skip at pobox.com
Tue Mar 16 18:21:47 EST 2004
(Not noticing that Tony cc'd spambayes-dev I originally sent this reply just
>> Since this is rather late in the game 1.0-wise I would like a little
>> extra feedback before checking this stuff in.
Tony> I was too late trying it out for this, but it (cvs version) also
Tony> works for me.
Tony> One query (I've used the csv module quite a bit since I moved to
Tony> 2.3, but only reading, never writing, so haven't noticed this
Tony> before): I see that it writes rows with '\r\n' termination, so in
Tony> Excel I get blank lines between every row (with a file as long as
Tony> the spambayes database, this means I miss a lot of data).
The csv file should be opened in "wb" mode. I thought I caught that. Can
you take a quick look? Also, you are talking about using the real csv
module, not the compatcsv thing, right?
Tony> Should we provide an option to the dbexpimp script to change the
Tony> line terminator to '\n'? (Simple enough to do, if I read the csv
Tony> doc correctly). Or maybe just have a "if sys.platform == "win32":
Tony> lineterminator = '\n'" kinda thing?
No, I don't think so. It seems we have a bug to squash. We control
everything about reading and writing that file. We should be able to make
it work without any hints from the user.
Tony> For example, I'll want to see how often an experimental token gets
Tony> used, or something like that. A lot of the time I could just use
Tony> a shell script (even on Windows <wink>) to get around the long
Tony> pathname, anyway. Forget I mentioned it ;)
Okay. Here's a simple use of spamcounts:
% spamcounts -d ~/tmp/tte.db -r 'long cons word'
long cons word,32,7,0.797764401748
subject:long cons word,9,0,0.97619047619
It says report on all tokens in tte.db which match the regular expression
(using re.search) 'long cons word'. Without the -r it only matches the
first token. (It also runs a lot faster.)
More information about the spambayes-dev