RE: [Spambayes] don't update if you don't want to retrain
From: Tim Stone - Four Stones Expressions
So... does this lay to rest forever the pickle/dbm debate? Is there any reason left to use a pickle?
Sorry, quite the opposite (IMHO). The patch switches to using shelve, which uses anydbm, which (still) uses the buggy BerkeleyDB 1.85 on Windows. So Windows users should probably still use pickles. Basically, you're never going to avoid the fact that Windows users don't have a reliable DBM implementation by default (unless you count dumbdbm). So you either use pickles, or ship/require some 3rd party solution. [Assuming that *in practice* the risk involved with the DBM 1.85 bugs is high enough to be worth worrying about - it's only a Python 2.2 issue, as 2.3 will have a newer DBM implementation included]. Paul.
So then, "Moore, Paul" <Paul.Moore@atosorigin.com> is all like:
From: Tim Stone - Four Stones Expressions
So... does this lay to rest forever the pickle/dbm debate? Is there any reason left to use a pickle?
Sorry, quite the opposite (IMHO). The patch switches to using shelve, which uses anydbm, which (still) uses the buggy BerkeleyDB 1.85 on Windows. So Windows users should probably still use pickles.
I've just checked in a new anydbm that has a more appropriate list of database back-ends to try on the Windows platform. But it needs someone with a Windows box to fix the dumb test I put in it: # XXX: Some windows dude should fix this test if sys.platform == "windows": # dbm on windows is awful. _names = ["dbhash", "gdbm", "dumbdbm"] else: _names = ["dbhash", "gdbm", "dbm", "dumbdbm"] So, if you are a Windows dude and feel up to fixing that test, please do so, and remove the first comment while you're at it :) This should eliminate any dbm concerns for Windows folk. Neale
[Neale]
I've just checked in a new anydbm that has a more appropriate list of database back-ends to try on the Windows platform. [...] This should eliminate any dbm concerns for Windows folk.
You left dbhash in the list - that's just another interface to the broken bsddb. And if that gets removed, Windows users will be left with dumbdbm - the name doesn't inspire confidence, and the docstring says "XXX TO DO: - seems to contain a bug when updating..." As far as I can see there's a complete solution available to these DBM problems. Perhaps I've missed something, but I've been back over all the discussions and I can't see anything wrong with it: o We demand bsddb 3 or better on platforms where bsddb is the dbm implementation that gets picked up. So until Python 2.3 is released, Windows users need to install pybsddb. I've just done this and it's trivial. (We already demand a new "email" library and no-one's complained.) Would this cause problems on any other platforms? o If training goes slowly, we implement Tim Peters' idea: "Bulk training could be taught to use a new classifier based on an in-memory dict. When that's done, the in-memory dict's ham and spam counts would be added into the persistent DB (rewriting only those WordInfo records corresponding to words that appeared in the bulk training data), and then the in-memory dict could be thrown away." o Or (Neale) you were talking about writing a caching front-end for the DBM (regardless of which actual DBM was behind it) - that would work as well. Wouldn't that solve *everything*? Startup times would be quick, training would be quick, no buggy DBM implementations would be used, and different components wouldn't default to different storage formats (hammie vs. pop3proxy). Installing pybsddb on Windows is trivial, and once Python 2.3 comes out you won't even need to do that. I've probably missed something - it's hard to keep up! -- Richie Hindle richie@entrian.com
Richie Hindle <richie@entrian.com> writes:
As far as I can see there's a complete solution available to these DBM problems. Perhaps I've missed something, but I've been back over all the discussions and I can't see anything wrong with it:
o We demand bsddb 3 or better on platforms where bsddb is the dbm implementation that gets picked up. So until Python 2.3 is released, Windows users need to install pybsddb. I've just done this and it's trivial. (We already demand a new "email" library and no-one's complained.) Would this cause problems on any other platforms?
I'm all in favour of this. However, it's worth pointing out a couple of things: 1. Email is pure python, bsddb is not only in C, but also needs a 3rd party library (Sleepycat DB). No problem on Windows (Python 2.3 will come with it built in, and there's a trivial-to-install binary build for 2.2 users), but might it cause problems on Unix systems? 2. On Unix, as I understand it, it's possible to use the new Sleepycat DB with the old Python module. So Unix users quite possibly don't need to bother with bsddb 3. The simple answer is to require bsddb 3 on Windows with Python 2.2, and otherwise use it if present, otherwise use the built-in dbhash (and assume that a suitably up to date Berkeley DB is behind it). But as I said, I'm happy with your approach - I only offer this if Unix users don't like the bsddb 3 requirement... Paul. -- This signature intentionally left blank
Paul> 1. Email is pure python, bsddb is not only in C, but also needs a 3rd Paul> party library (Sleepycat DB). No problem on Windows (Python 2.3 Paul> will come with it built in, and there's a trivial-to-install binary Paul> build for 2.2 users), but might it cause problems on Unix systems? Unlikely. Most Unixes have had recent versions of Sleepycat's library available for a long time. Versions 3 or 4(.0) are required for pybsddb. Failing that, Version 2 doesn't suffer with the bugs that Version 1 does. The old bsddb will still be available, just not built by default. Paul> 2. On Unix, as I understand it, it's possible to use the new Sleepycat Paul> DB with the old Python module. So Unix users quite possibly don't Paul> need to bother with bsddb 3. Correct. The new module has already been checked into CVS though, so Unix types will get it as the default but be able to fall back to Version 2 (or even 1) if they want. don't-worry-about-us-we're-just-fine-ly, y'rs, Skip
Neale Pickett <neale@woozle.org> writes:
I've just checked in a new anydbm that has a more appropriate list of database back-ends to try on the Windows platform. But it needs someone with a Windows box to fix the dumb test I put in it:
# XXX: Some windows dude should fix this test if sys.platform == "windows": # dbm on windows is awful. _names = ["dbhash", "gdbm", "dumbdbm"] else: _names = ["dbhash", "gdbm", "dbm", "dumbdbm"]
I see someone changed "windows" to "win32". But the other problem is more serious. Windows doesn't *have* gdbm or dbm - the problem lies with "dbhash" (the Berkeley DB implementation). So the Windows branch should be if sys.platform == "windows": # The Berkeley DB implementation on Windows is out of date _names = ["gdbm", "dbm", "dumbdbm"] (or probably just _names = ["dumbdbm"]). Paul. -- This signature intentionally left blank
participants (5)
-
Moore, Paul -
Neale Pickett -
Paul Moore -
Richie Hindle -
Skip Montanaro