
An interesting issue came up today while we were playing with a Bayesian spam classifier. Mailman's archives aren't very clean. Messages are sent to the archiver after various headering munging steps, including the adding of the List-* headers and the Subject prefix.
We still want to do some munging, e.g. for anonymous lists. This tells me that we may want to move ToArchive up before CookHeaders in the global pipeline.
I don't think we want to move ToDigest or ToUsenet because I think we /want/ those headers munged before the message is sent to the digests or news server. What do you think?
-Barry

Hi,
Barry A. Warsaw wrote:
The headers are in the raw archive and not in the monthly (or quaterly, weekly) text format archive. I would rather stop publicizing the raw archive even if the other archives are public accessible. At least it should be configurable (in mm_cfg).
We use a modified version of mailman 2.0.x in Japan and we like a feature of adding numbers in the subject header. The users tend to reference articles by the number not by the archive URL. So, we want the archive to be munged. BTW, I'm preparing a patch for numbering the subject prefix.
-- Tokio Kikuchi, tkikuchi@ is.kochi-u.ac.jp http://weather.is.kochi-u.ac.jp/

"TK" == Tokio Kikuchi <tkikuchi@is.kochi-u.ac.jp> writes:
>> An interesting issue came up today while we were playing with a
>> Bayesian spam classifier. Mailman's archives aren't very
>> clean. Messages are sent to the archiver after various
>> headering munging steps, including the adding of the List-*
>> headers and the Subject prefix.
TK> The headers are in the raw archive and not in the monthly (or
TK> quaterly, weekly) text format archive. I would rather stop
TK> publicizing the raw archive even if the other archives are
TK> public accessible. At least it should be configurable (in
TK> mm_cfg).
Some headers are stripped before being added to the quarterly/weekly mini-archive, but both see messages /after/ they've been munged.
(On the second point, I'll try to look at patch #594771. That would see like a good opportunity to make raw archives optional.)
>> We still want to do some munging, e.g. for anonymous lists.
>> This tells me that we may want to move ToArchive up before
>> CookHeaders in the global pipeline.
TK> We use a modified version of mailman 2.0.x in Japan and we
TK> like a feature of adding numbers in the subject header. The
TK> users tend to reference articles by the number not by the
TK> archive URL. So, we want the archive to be munged.
That seems to be the concensus, i.e. the archive should reflect what the members get. Makes sense -- if you want a more pristine archive, you can interpose a tee to a file before the message gets to Mailman, or you could add a different handler module. I'll leave things as is.
TK> BTW, I'm preparing a patch for numbering the subject prefix.
Cool. But this is likely a new feature that will have to wait until after 2.1 final.
Thanks, -Barry

On Mon, 26 Aug 2002 21:53:44 -0400 Barry A Warsaw <barry@zope.com> wrote:
...
What do you think?
Specifically that I want to archive the exact message, down to the byte, that subscriber's receive. Beyond that I don't care.
--
J C Lawrence
---------(*) Satan, oscillate my metallic sonatas.
claw@kanga.nu He lived as a devil, eh?
http://www.kanga.nu/~claw/ Evil is a name of a foeman, as I live.

J C Lawrence <claw@kanga.nu> wrote:
Specifically that I want to archive the exact message, down to the byte, that subscriber's receive. Beyond that I don't care.
I'd love to have this, and in addition (perhaps in a separate file) the exact headers and first ten lines of all incoming messages (postings, admin requests, subscription confirmations, everything).
Greetings, Norbert.
-- Founder & Steering Committee member of http://gnu.org/projects/dotgnu/ Norbert Bollow, Weidlistr.18, CH-8624 Gruet (near Zurich, Switzerland) Tel +41 1 972 20 59 Fax +41 1 972 20 69 http://norbert.ch List hosting with GNU Mailman on your own domain name http://cisto.com

"NB" == Norbert Bollow <nb@cisto.com> writes:
>> Specifically that I want to archive the exact message, down to
>> the byte, that subscriber's receive. Beyond that I don't care.
NB> I'd love to have this, and in addition (perhaps in a separate
NB> file) the exact headers and first ten lines of all incoming
NB> messages (postings, admin requests, subscription
NB> confirmations, everything).
If someone where to work up a patch <wink>, the way I'd do this would be to add a IncomingLogger.py handler module that logged the information you want to logs/incoming. Then I'd stick this at the top of GLOBAL_PIPELINE.
-Barry

Barry A. Warsaw <> wrote:
If someone where to work up a patch <wink>
I think it's not likely for any such patches to come from me anytime soon, as I have bigger fish to fry. Specifically I'm going forward with implementing a MySQL-based archives system which can be used as a drop-in replacement for Pipermail which which will also provide the functionalities of a web board and a search engine optimization system.
(Yes, it'll be 100% Python, and GPL'd Free Software).
Greetings, Norbert.
-- Founder & Steering Committee member of http://gnu.org/projects/dotgnu/ Norbert Bollow, Weidlistr.18, CH-8624 Gruet (near Zurich, Switzerland) Tel +41 1 972 20 59 Fax +41 1 972 20 69 http://norbert.ch List hosting with GNU Mailman on your own domain name http://cisto.com

On Wed, 28 Aug 2002, Norbert Bollow wrote:
Can I request that you build this in an SQL implementation independant manner (so one could drop in PostgreSQL in place of MySQL and have it work)?
I'm building my DB-backed piece against MySQL, but I'm trying to build it in an implementation independant manner as I'd like to change over to Postgres...
(Since MySQL doesn't support transactions, I'm not sure how I'll handle the .Save() piece yet...)
-Dale

On Wednesday, August 28, 2002, at 08:49 AM, Barry A. Warsaw wrote:
It does on InnoDB files, which seem to be their file structure of the future. Not on MyISAM.
But for an archiving system, is that really a big deal? But in any event, using InnoDB files, it's a non-issue.
(frankly, I've found you can do a lot of stuff without transactions quite nicely. For this, I just can't believe you need transacations, the volume of inserts and reads just isn't that large.)
-- Chuq Von Rospach, Architech chuqui@plaidworks.com -- http://www.chuqui.com/
Stress is when you wake up screaming and you realize you haven't fallen asleep yet.
-- Chuq Von Rospach, Architech, Apple IS&T E-mail systems chuq@apple.com

"CVR" == Chuq Von Rospach <chuqui@plaidworks.com> writes:
CVR> (frankly, I've found you can do a lot of stuff without
CVR> transactions quite nicely. For this, I just can't believe you
CVR> need transacations, the volume of inserts and reads just
CVR> isn't that large.)
Right, sorry, I was asking for a different reason. ;)
-Barry

On Wed, Aug 28, 2002 at 09:04:41AM -0700, Chuq Von Rospach wrote:
It does on BerkeleyDB too.
Also: MyISAM has table-level locking BDB has page-level locking InnoDB has row-level locking
-- Adde parvum parvo magnus acervus erit. Simone Piunno, FerraraLUG - http://members.ferrara.linux.it/pioppo

Are there any BerkeleyDB experts on this list? Apologies for the off-topic message, but if anybody has any clues I'd appreciate an *off-list* response.
"SP" == Simone Piunno <pioppo@ferrara.linux.it> writes:
SP> It does on BerkeleyDB too.
SP> Also:
| MyISAM has table-level locking
| BDB has page-level locking
| InnoDB has row-level locking
I've been doing a lot of work with straight BerkeleyDBs for Zope's ZODB, and I've run into a possibly fatal problem, for our application.
The basic problem is that a Zope transaction is essentially unbounded in the number of objects it touches. Each object modified translates into updates to one or more BDB tables. I'm using BDB BTrees and transactions, so that translates to one lock per level of the database plus one lock per page. It's /possible/ that transactions can touch a huge number of pages, and because BDB allocates a static number of locks (growable, but only before the environment is opened), it's likely that we'll hit a transaction that exhausts the locks.
There seems to be no way to avoid this. Cranking the BDB locks up just begs the question; eventually we run out of locks anyway, plus the more locks you allocate the more resources you consume. We could 'solve' this if BDB supported table-level locking, because we'd just lock the one or two tables that have unbounded updates and be done with it.
Any suggestions folks have would be greatly appreciated. If more information is needed, contact me directly.
ObMailman: It would be cool if MM3.0 had a ZODB backend, and I'd love to use a Berkeley based storage for that backend. No, this won't be /required/ for MM3.0 so you can stop fretting. ;)
Thanks, -Barry

On Thu, 29 Aug 2002 13:59:54 -0400 Barry A Warsaw <barry@python.org> wrote:
ObMailman: It would be cool if MM3.0 had a ZODB backend...
Agreed. The ZODB semantics are quite nice, tho cleanly implementing purging (some will wish to never purge) could be interesting.
--
J C Lawrence
---------(*) Satan, oscillate my metallic sonatas.
claw@kanga.nu He lived as a devil, eh?
http://www.kanga.nu/~claw/ Evil is a name of a foeman, as I live.

"JCL" == J C Lawrence <claw@kanga.nu> writes:
>> ObMailman: It would be cool if MM3.0 had a ZODB backend...
JCL> Agreed. The ZODB semantics are quite nice, tho cleanly
JCL> implementing purging (some will wish to never purge) could be
JCL> interesting.
You mean packing? One of the reasons for my question on BerkeleyDB is because I'm working on an autopacking storage based on BerkeleyDB. I think it would be a better choice for the back-back-end, although the insanity of BerkeleyDB on Linux distros might make that problematic.
If we were to go with a FileStorage, then yes, packing is an issue. Of course, with a Berkeley storage you get all the headaches of Berkeley maintenance too.
This is still a long ways off, so there's time to think about it.
-Barry

At 13:59 29/08/2002 -0400, Barry A. Warsaw wrote:
Although I'm no expert, my impression is that the Multi-Version Concurrency Control approach of PostgreSQL is a better fit with the transaction model of ZODB, certainly as it is used in Zope for instance, than record and table locking schemes used by such other databases that even encompass the concept of a transaction. It is also my impression that MVCC gives a better fit for doing transaction rollback and retry to resolve database update contention.
That said, any application doing bulk updates of a database that doesn't split the activity into coherent, commit-able sub-transactions is looking for trouble.
I would venture that if you have to do such an update then it would be best done by locking the whole database and ripping through the job and then letting everybody else back in. Rolling back and retrying a transaction involving very large numbers of database changes is not good stuff and if you get enough contention the big transaction will never succeed.

On Wednesday, August 28, 2002, at 08:49 AM, Barry A. Warsaw wrote:
It does on InnoDB files, which seem to be their file structure of the future. Not on MyISAM.
But for an archiving system, is that really a big deal? But in any event, using InnoDB files, it's a non-issue.
(frankly, I've found you can do a lot of stuff without transactions quite nicely. For this, I just can't believe you need transacations, the volume of inserts and reads just isn't that large.)
-- Chuq Von Rospach, Architech chuqui@plaidworks.com -- http://www.chuqui.com/
Stress is when you wake up screaming and you realize you haven't fallen asleep yet.
-- Chuq Von Rospach, Architech, Apple IS&T E-mail systems chuq@apple.com

On Wed, 28 Aug 2002, Dale Newfield wrote:
D'oh! Um, I guess I should've read the rest of the posts before responding.
Now to go read up on that new table type, and how they tables of differing types interact with one another.
I'm still concerned about having all of the operations in the lifetime of a mailman session reside in a single (possibly never confirmed) transaction, and whether that provides the interface we want (I.E., set values not actually getting set until .Save() time).
-Dale

"DN" == Dale Newfield <Dale@Newfield.org> writes:
DN> I'm still concerned about having all of the operations in the
DN> lifetime of a mailman session reside in a single (possibly
DN> never confirmed) transaction, and whether that provides the
DN> interface we want (I.E., set values not actually getting set
DN> until .Save() time).
The intent is that .Save() markes the transaction boundary and is equivalent to a transaction commit. The qrunners are careful to only lock the lists when they need to modify list attributes, and they're always save the list inside the try clause, but unlock the list in the finally clause, e.g.:
mlist.Lock()
try:
# do some stuff to mlist's attributes
mlist.Save()
finally:
mlist.Unlock()
Thus, if the code in the try causes an exception, the list will still get unlocked (otherwise the list would be hosed), but the transaction is aborted by virtue of not getting saved. A subsequent load of the mlist would begin a new transaction, with the old data.
It's this last bit that's dodgy. There should probably be an explicit abort on exceptions inside the try, but there's no way to spell that with the current, legacy persistence mechanism, so it isn't in any of the code. I /think/ that if your MailList.Load() implicitly aborts any active transaction, you should be okay, but of course, none of that's tested.
-Barry

On Thu, 29 Aug 2002, Barry A. Warsaw wrote:
The intent is that .Save() markes the transaction boundary and is equivalent to a transaction commit.
Right.
Right, but since .UnLock doesn't reload the data, any code refering to that mlist after unlocking would have different semantics depending upon which MemberAdaptor implementation is backing the list: With OldStyle, membership data in that mlist object is whatever it's been (temporarily) set to (but not saved); with SQL, membership data in that mlist has reverted to whatever it was when .Lock() was called.
Wouldn't that abort be triggered by a call to .UnLock() without a call to .Save()? I would think that all calls to .Lock() and any calls to .UnLock() without a prior call to .Save() should abort any current transaction.
I /think/ that if your MailList.Load() implicitly aborts any active transaction, you should be okay, but of course, none of that's tested.
MailList.Load() or MailList.Lock()? Having .Load() abort any active transaction means that you cannot load other mlists (even read-only) inside a .Lock();try....Save();finally .UnLock() block and have the transaction succeed...
The place these two models (SQL transactions and mailList load, lock/save/unlock) break down is what happens when there are more than one MailList object in memory at a time. SQL transactions assumes only one MailList can be modified at a time, and the lock/save/unlock model doesn't make that assumption.
If we can assume that only one mlist gets locked at a time, the SQL system will work, but I see no way to enforce mailman developers to abide by that assumption. (Except to implicitly abort active transactions as described above--and since the MailList.Save() method doesn't have a success/failure return code, there would be no indication that anything went wrong except silently ignoring requested DB changes.)
-Dale

"DN" == Dale Newfield <Dale@Newfield.org> writes:
>> Thus, if the code in the try causes an exception, the list will
>> still get unlocked (otherwise the list would be hosed), but the
>> transaction is aborted by virtue of not getting saved. A
>> subsequent load of the mlist would begin a new transaction,
>> with the old data.
DN> Right, but since .UnLock doesn't reload the data, any code
DN> refering to that mlist after unlocking would have different
DN> semantics depending upon which MemberAdaptor implementation is
DN> backing the list: With OldStyle, membership data in that mlist
DN> object is whatever it's been (temporarily) set to (but not
DN> saved); with SQL, membership data in that mlist has reverted
DN> to whatever it was when .Lock() was called.
You're right, and it should be cleaned up, but in practice it should be okay. For the one-shot scripts (cron, cgi), you're restarting the process each time anyway, so no worries. For the qrunner daemons, I basically had to solve the same problem, i.e. the next iteration through the loop needs to have consistent data or you're screwed. Each runner should either do a .Load() or a .Lock() at the top of their _dispose() methods, depending on if they only need read access to the list data or read-write access.
>> It's this last bit that's dodgy. There should probably be an
>> explicit abort on exceptions inside the try, but there's no way
>> to spell that with the current, legacy persistence mechanism,
>> so it isn't in any of the code.
DN> Wouldn't that abort be triggered by a call to .UnLock()
DN> without a call to .Save()? I would think that all calls to
DN> .Lock() and any calls to .UnLock() without a prior call to
DN> .Save() should abort any current transaction.
What about the qrunners that don't lock the list because they only need read access to the data? That's why I think we need an explicit abort, even if it's no-op'd for the old-style persistence.
>> I /think/ that if your MailList.Load() implicitly aborts any
>> active transaction, you should be okay, but of course, none of
>> that's tested.
DN> MailList.Load() or MailList.Lock()? Having .Load() abort any
DN> active transaction means that you cannot load other mlists
DN> (even read-only) inside a .Lock();try....Save();finally
DN> .UnLock() block and have the transaction succeed...
DN> The place these two models (SQL transactions and mailList
DN> load, lock/save/unlock) break down is what happens when there
DN> are more than one MailList object in memory at a time. SQL
DN> transactions assumes only one MailList can be modified at a
DN> time, and the lock/save/unlock model doesn't make that
DN> assumption.
Ah, so the problem probably isn't the transaction boundaries, but that Mailman assumes that each list's persistence is completely independent of other lists. I think the one place where you'll get hosed by this is in the cgi's where "global" operations loop through all the lists (yes, this sucks and is inefficient, but its the best we can currently do). I'm not sure how to get around this, except through some kind of elaborate nested transaction support.
DN> If we can assume that only one mlist gets locked at a time,
DN> the SQL system will work, but I see no way to enforce mailman
DN> developers to abide by that assumption. (Except to implicitly
DN> abort active transactions as described above--and since the
DN> MailList.Save() method doesn't have a success/failure return
DN> code, there would be no indication that anything went wrong
DN> except silently ignoring requested DB changes.)
Hmm. I'm going to have to think about this some more. I'm off line right now so can't look at code details.
-Barry

On Thu, 29 Aug 2002, Barry A. Warsaw wrote:
If it's read-only, I don't see the problem unless you need the data it is reading to remain unchanged (locked) for some span of time. Transactions are only related to changes made to the database. A read-only MailList would always be able to read, and at any given point in time the information returned would be that most recently committed. (I.E., If one process is reading and another modifying, the reading process will see the data from before the modifications until the writing process commits, then the reading process would see *all* those mods (no need to worry about incomplete changes--that's the point of an all-or-nothing transaction).)
Right.
I was hoping that these loops were always read-only, or that they could be serialized so that only one MailList is ever locked at a time.
I'm not sure how to get around this, except through some kind of elaborate nested transaction support.
Which is a road we really don't want to go down--Even if some random SQL implementation supported that, SQL doesn't support it in the standard.
Hmm. I'm going to have to think about this some more. I'm off line right now so can't look at code details.
OK. I'm going to go ahead and continue my development hoping that there's no show-stopper here. We do need to continue this conversation, but I also want to actually get stuff working :-)
-Dale

"DN" == Dale Newfield <Dale@newfield.org> writes:
>> What about the qrunners that don't lock the list because they
>> only need read access to the data? That's why I think we need
>> an explicit abort, even if it's no-op'd for the old-style
>> persistence.
DN> If it's read-only, I don't see the problem unless you need the
DN> data it is reading to remain unchanged (locked) for some span
DN> of time.
No, I was just thinking about read consistency here. I think you basically want the read state frozen at the start of the _dispose() method, which will either start with a Load() or a Lock(). It might be a bad thing (unless it isn't <wink>) for the state to change during _dispose(), even if the list is only reading its data.
DN> Transactions are only related to changes made to the database.
DN> A read-only MailList would always be able to read, and at any
DN> given point in time the information returned would be that
DN> most recently committed. (I.E., If one process is reading and
DN> another modifying, the reading process will see the data from
DN> before the modifications until the writing process commits,
DN> then the reading process would see *all* those mods (no need
DN> to worry about incomplete changes--that's the point of an
DN> all-or-nothing transaction).)
Part of the point is also to provide consistent state for read-only data. In ZODB for example, it's possible to get read-conflicts if the state of the objects aren't consistent. E.g. you read obj1 from transaction1, which has a reference to obj2. Before you read obj2, process2 has modified obj2 in transaction2. Now process1 reads obj2. Inconsistent state and a read-conflict occurs. (Aside: ZODB4 will likely have multiversion consistency control which assures that process1 will read obj2's state as it existed in transaction1).
This works in Mailman by doing a Load/Lock at the top of _dispose to sync the in-memory state with the on-disk state for the duration of the method.
>> I think the one place where you'll get hosed by this is in the
>> cgi's where "global" operations loop through all the lists
>> (yes, this sucks and is inefficient, but its the best we can
>> currently do).
DN> I was hoping that these loops were always read-only, or that
DN> they could be serialized so that only one MailList is ever
DN> locked at a time.
I believe they're serialized (they do write state), /except/ for the "parent" list for the process. E.g. I visit list1 to change my password, but click on "set globally". list1 remains locked while I cycle through the other lists, locking them in turn and making those changes.
We may have to rewrite a few of these loops.
>> I'm not sure how to get around this, except through some kind
>> of elaborate nested transaction support.
DN> Which is a road we really don't want to go down--Even if some
DN> random SQL implementation supported that, SQL doesn't support
DN> it in the standard.
Ok.
>> Hmm. I'm going to have to think about this some more. I'm off
>> line right now so can't look at code details.
DN> OK. I'm going to go ahead and continue my development hoping
DN> that there's no show-stopper here. We do need to continue
DN> this conversation, but I also want to actually get stuff
DN> working :-)
+1 :)
Let us know how it goes. -Barry

Just stumbled across this in Utils.py:
# TBD: what other characters should be disallowed? _badchars = re.compile('[][()<>|;^,/]')
and thought I'd suggest that " and ' get added to that list...
I recently wound up with a list subscriber of the form "foo@bar.baz" (*with* the quotes!) and had a more difficult time fixing it than you might expect.
-Dale

Dale Newfield <Dale@Newfield.org> writes:
and thought I'd suggest that " and ' get added to that list...
Nope, because both of those characters are valid e-mail address components. Not in the form you mention (i.e. the double quote is not allowed in the domain part), but certainly in the local part.
In fact this is a perfectly valid e-mail address:
"f@,'[& "@example.com
And the parser should be able to cope with them, or any other RFC 2822 compliant address.
Darrell

Dale Newfield <Dale@Newfield.org> wrote:
I have to focus on building this to meet my needs, and my customers' needs... PostgreSQL support isn't among these needs as far as I can see. However since there is a standard for Python database API's, it shouldn't be too hard for someone to replace MySQLdb with the PostgreSQL equivalent, or even make this a configuration option. (I'll welcome patches for making this a configuration option :-)
Greetings, Norbert.
-- Founder & Steering Committee member of http://gnu.org/projects/dotgnu/ Norbert Bollow, Weidlistr.18, CH-8624 Gruet (near Zurich, Switzerland) Tel +41 1 972 20 59 Fax +41 1 972 20 69 http://norbert.ch List hosting with GNU Mailman on your own domain name http://cisto.com

"NB" == Norbert Bollow <nb@cisto.com> writes:
>> If someone where to work up a patch <wink>
NB> I think it's not likely for any such patches to come from me
NB> anytime soon
Maybe lodge a feature request.
NB> , as I have bigger fish to fry. Specifically I'm
NB> going forward with implementing a MySQL-based archives system
NB> which can be used as a drop-in replacement for Pipermail which
NB> which will also provide the functionalities of a web board and
NB> a search engine optimization system.
NB> (Yes, it'll be 100% Python, and GPL'd Free Software).
Okay, that will be cool! You're off the hook. :)
-Barry

"JCL" == J C Lawrence <claw@kanga.nu> writes:
>> What do you think?
JCL> Specifically that I want to archive the exact message, down
JCL> to the byte, that subscriber's receive.
<wink> Of course, that's literally impossible, but I get the intention. Okay, no changes here.
-Barry

Hi,
Barry A. Warsaw wrote:
The headers are in the raw archive and not in the monthly (or quaterly, weekly) text format archive. I would rather stop publicizing the raw archive even if the other archives are public accessible. At least it should be configurable (in mm_cfg).
We use a modified version of mailman 2.0.x in Japan and we like a feature of adding numbers in the subject header. The users tend to reference articles by the number not by the archive URL. So, we want the archive to be munged. BTW, I'm preparing a patch for numbering the subject prefix.
-- Tokio Kikuchi, tkikuchi@ is.kochi-u.ac.jp http://weather.is.kochi-u.ac.jp/

"TK" == Tokio Kikuchi <tkikuchi@is.kochi-u.ac.jp> writes:
>> An interesting issue came up today while we were playing with a
>> Bayesian spam classifier. Mailman's archives aren't very
>> clean. Messages are sent to the archiver after various
>> headering munging steps, including the adding of the List-*
>> headers and the Subject prefix.
TK> The headers are in the raw archive and not in the monthly (or
TK> quaterly, weekly) text format archive. I would rather stop
TK> publicizing the raw archive even if the other archives are
TK> public accessible. At least it should be configurable (in
TK> mm_cfg).
Some headers are stripped before being added to the quarterly/weekly mini-archive, but both see messages /after/ they've been munged.
(On the second point, I'll try to look at patch #594771. That would see like a good opportunity to make raw archives optional.)
>> We still want to do some munging, e.g. for anonymous lists.
>> This tells me that we may want to move ToArchive up before
>> CookHeaders in the global pipeline.
TK> We use a modified version of mailman 2.0.x in Japan and we
TK> like a feature of adding numbers in the subject header. The
TK> users tend to reference articles by the number not by the
TK> archive URL. So, we want the archive to be munged.
That seems to be the concensus, i.e. the archive should reflect what the members get. Makes sense -- if you want a more pristine archive, you can interpose a tee to a file before the message gets to Mailman, or you could add a different handler module. I'll leave things as is.
TK> BTW, I'm preparing a patch for numbering the subject prefix.
Cool. But this is likely a new feature that will have to wait until after 2.1 final.
Thanks, -Barry

On Mon, 26 Aug 2002 21:53:44 -0400 Barry A Warsaw <barry@zope.com> wrote:
...
What do you think?
Specifically that I want to archive the exact message, down to the byte, that subscriber's receive. Beyond that I don't care.
--
J C Lawrence
---------(*) Satan, oscillate my metallic sonatas.
claw@kanga.nu He lived as a devil, eh?
http://www.kanga.nu/~claw/ Evil is a name of a foeman, as I live.

J C Lawrence <claw@kanga.nu> wrote:
Specifically that I want to archive the exact message, down to the byte, that subscriber's receive. Beyond that I don't care.
I'd love to have this, and in addition (perhaps in a separate file) the exact headers and first ten lines of all incoming messages (postings, admin requests, subscription confirmations, everything).
Greetings, Norbert.
-- Founder & Steering Committee member of http://gnu.org/projects/dotgnu/ Norbert Bollow, Weidlistr.18, CH-8624 Gruet (near Zurich, Switzerland) Tel +41 1 972 20 59 Fax +41 1 972 20 69 http://norbert.ch List hosting with GNU Mailman on your own domain name http://cisto.com

"NB" == Norbert Bollow <nb@cisto.com> writes:
>> Specifically that I want to archive the exact message, down to
>> the byte, that subscriber's receive. Beyond that I don't care.
NB> I'd love to have this, and in addition (perhaps in a separate
NB> file) the exact headers and first ten lines of all incoming
NB> messages (postings, admin requests, subscription
NB> confirmations, everything).
If someone where to work up a patch <wink>, the way I'd do this would be to add a IncomingLogger.py handler module that logged the information you want to logs/incoming. Then I'd stick this at the top of GLOBAL_PIPELINE.
-Barry

Barry A. Warsaw <> wrote:
If someone where to work up a patch <wink>
I think it's not likely for any such patches to come from me anytime soon, as I have bigger fish to fry. Specifically I'm going forward with implementing a MySQL-based archives system which can be used as a drop-in replacement for Pipermail which which will also provide the functionalities of a web board and a search engine optimization system.
(Yes, it'll be 100% Python, and GPL'd Free Software).
Greetings, Norbert.
-- Founder & Steering Committee member of http://gnu.org/projects/dotgnu/ Norbert Bollow, Weidlistr.18, CH-8624 Gruet (near Zurich, Switzerland) Tel +41 1 972 20 59 Fax +41 1 972 20 69 http://norbert.ch List hosting with GNU Mailman on your own domain name http://cisto.com

On Wed, 28 Aug 2002, Norbert Bollow wrote:
Can I request that you build this in an SQL implementation independant manner (so one could drop in PostgreSQL in place of MySQL and have it work)?
I'm building my DB-backed piece against MySQL, but I'm trying to build it in an implementation independant manner as I'd like to change over to Postgres...
(Since MySQL doesn't support transactions, I'm not sure how I'll handle the .Save() piece yet...)
-Dale

On Wednesday, August 28, 2002, at 08:49 AM, Barry A. Warsaw wrote:
It does on InnoDB files, which seem to be their file structure of the future. Not on MyISAM.
But for an archiving system, is that really a big deal? But in any event, using InnoDB files, it's a non-issue.
(frankly, I've found you can do a lot of stuff without transactions quite nicely. For this, I just can't believe you need transacations, the volume of inserts and reads just isn't that large.)
-- Chuq Von Rospach, Architech chuqui@plaidworks.com -- http://www.chuqui.com/
Stress is when you wake up screaming and you realize you haven't fallen asleep yet.
-- Chuq Von Rospach, Architech, Apple IS&T E-mail systems chuq@apple.com

"CVR" == Chuq Von Rospach <chuqui@plaidworks.com> writes:
CVR> (frankly, I've found you can do a lot of stuff without
CVR> transactions quite nicely. For this, I just can't believe you
CVR> need transacations, the volume of inserts and reads just
CVR> isn't that large.)
Right, sorry, I was asking for a different reason. ;)
-Barry

On Wed, Aug 28, 2002 at 09:04:41AM -0700, Chuq Von Rospach wrote:
It does on BerkeleyDB too.
Also: MyISAM has table-level locking BDB has page-level locking InnoDB has row-level locking
-- Adde parvum parvo magnus acervus erit. Simone Piunno, FerraraLUG - http://members.ferrara.linux.it/pioppo

Are there any BerkeleyDB experts on this list? Apologies for the off-topic message, but if anybody has any clues I'd appreciate an *off-list* response.
"SP" == Simone Piunno <pioppo@ferrara.linux.it> writes:
SP> It does on BerkeleyDB too.
SP> Also:
| MyISAM has table-level locking
| BDB has page-level locking
| InnoDB has row-level locking
I've been doing a lot of work with straight BerkeleyDBs for Zope's ZODB, and I've run into a possibly fatal problem, for our application.
The basic problem is that a Zope transaction is essentially unbounded in the number of objects it touches. Each object modified translates into updates to one or more BDB tables. I'm using BDB BTrees and transactions, so that translates to one lock per level of the database plus one lock per page. It's /possible/ that transactions can touch a huge number of pages, and because BDB allocates a static number of locks (growable, but only before the environment is opened), it's likely that we'll hit a transaction that exhausts the locks.
There seems to be no way to avoid this. Cranking the BDB locks up just begs the question; eventually we run out of locks anyway, plus the more locks you allocate the more resources you consume. We could 'solve' this if BDB supported table-level locking, because we'd just lock the one or two tables that have unbounded updates and be done with it.
Any suggestions folks have would be greatly appreciated. If more information is needed, contact me directly.
ObMailman: It would be cool if MM3.0 had a ZODB backend, and I'd love to use a Berkeley based storage for that backend. No, this won't be /required/ for MM3.0 so you can stop fretting. ;)
Thanks, -Barry

On Thu, 29 Aug 2002 13:59:54 -0400 Barry A Warsaw <barry@python.org> wrote:
ObMailman: It would be cool if MM3.0 had a ZODB backend...
Agreed. The ZODB semantics are quite nice, tho cleanly implementing purging (some will wish to never purge) could be interesting.
--
J C Lawrence
---------(*) Satan, oscillate my metallic sonatas.
claw@kanga.nu He lived as a devil, eh?
http://www.kanga.nu/~claw/ Evil is a name of a foeman, as I live.

"JCL" == J C Lawrence <claw@kanga.nu> writes:
>> ObMailman: It would be cool if MM3.0 had a ZODB backend...
JCL> Agreed. The ZODB semantics are quite nice, tho cleanly
JCL> implementing purging (some will wish to never purge) could be
JCL> interesting.
You mean packing? One of the reasons for my question on BerkeleyDB is because I'm working on an autopacking storage based on BerkeleyDB. I think it would be a better choice for the back-back-end, although the insanity of BerkeleyDB on Linux distros might make that problematic.
If we were to go with a FileStorage, then yes, packing is an issue. Of course, with a Berkeley storage you get all the headaches of Berkeley maintenance too.
This is still a long ways off, so there's time to think about it.
-Barry

At 13:59 29/08/2002 -0400, Barry A. Warsaw wrote:
Although I'm no expert, my impression is that the Multi-Version Concurrency Control approach of PostgreSQL is a better fit with the transaction model of ZODB, certainly as it is used in Zope for instance, than record and table locking schemes used by such other databases that even encompass the concept of a transaction. It is also my impression that MVCC gives a better fit for doing transaction rollback and retry to resolve database update contention.
That said, any application doing bulk updates of a database that doesn't split the activity into coherent, commit-able sub-transactions is looking for trouble.
I would venture that if you have to do such an update then it would be best done by locking the whole database and ripping through the job and then letting everybody else back in. Rolling back and retrying a transaction involving very large numbers of database changes is not good stuff and if you get enough contention the big transaction will never succeed.

On Wednesday, August 28, 2002, at 08:49 AM, Barry A. Warsaw wrote:
It does on InnoDB files, which seem to be their file structure of the future. Not on MyISAM.
But for an archiving system, is that really a big deal? But in any event, using InnoDB files, it's a non-issue.
(frankly, I've found you can do a lot of stuff without transactions quite nicely. For this, I just can't believe you need transacations, the volume of inserts and reads just isn't that large.)
-- Chuq Von Rospach, Architech chuqui@plaidworks.com -- http://www.chuqui.com/
Stress is when you wake up screaming and you realize you haven't fallen asleep yet.
-- Chuq Von Rospach, Architech, Apple IS&T E-mail systems chuq@apple.com

On Wed, 28 Aug 2002, Dale Newfield wrote:
D'oh! Um, I guess I should've read the rest of the posts before responding.
Now to go read up on that new table type, and how they tables of differing types interact with one another.
I'm still concerned about having all of the operations in the lifetime of a mailman session reside in a single (possibly never confirmed) transaction, and whether that provides the interface we want (I.E., set values not actually getting set until .Save() time).
-Dale

"DN" == Dale Newfield <Dale@Newfield.org> writes:
DN> I'm still concerned about having all of the operations in the
DN> lifetime of a mailman session reside in a single (possibly
DN> never confirmed) transaction, and whether that provides the
DN> interface we want (I.E., set values not actually getting set
DN> until .Save() time).
The intent is that .Save() markes the transaction boundary and is equivalent to a transaction commit. The qrunners are careful to only lock the lists when they need to modify list attributes, and they're always save the list inside the try clause, but unlock the list in the finally clause, e.g.:
mlist.Lock()
try:
# do some stuff to mlist's attributes
mlist.Save()
finally:
mlist.Unlock()
Thus, if the code in the try causes an exception, the list will still get unlocked (otherwise the list would be hosed), but the transaction is aborted by virtue of not getting saved. A subsequent load of the mlist would begin a new transaction, with the old data.
It's this last bit that's dodgy. There should probably be an explicit abort on exceptions inside the try, but there's no way to spell that with the current, legacy persistence mechanism, so it isn't in any of the code. I /think/ that if your MailList.Load() implicitly aborts any active transaction, you should be okay, but of course, none of that's tested.
-Barry

On Thu, 29 Aug 2002, Barry A. Warsaw wrote:
The intent is that .Save() markes the transaction boundary and is equivalent to a transaction commit.
Right.
Right, but since .UnLock doesn't reload the data, any code refering to that mlist after unlocking would have different semantics depending upon which MemberAdaptor implementation is backing the list: With OldStyle, membership data in that mlist object is whatever it's been (temporarily) set to (but not saved); with SQL, membership data in that mlist has reverted to whatever it was when .Lock() was called.
Wouldn't that abort be triggered by a call to .UnLock() without a call to .Save()? I would think that all calls to .Lock() and any calls to .UnLock() without a prior call to .Save() should abort any current transaction.
I /think/ that if your MailList.Load() implicitly aborts any active transaction, you should be okay, but of course, none of that's tested.
MailList.Load() or MailList.Lock()? Having .Load() abort any active transaction means that you cannot load other mlists (even read-only) inside a .Lock();try....Save();finally .UnLock() block and have the transaction succeed...
The place these two models (SQL transactions and mailList load, lock/save/unlock) break down is what happens when there are more than one MailList object in memory at a time. SQL transactions assumes only one MailList can be modified at a time, and the lock/save/unlock model doesn't make that assumption.
If we can assume that only one mlist gets locked at a time, the SQL system will work, but I see no way to enforce mailman developers to abide by that assumption. (Except to implicitly abort active transactions as described above--and since the MailList.Save() method doesn't have a success/failure return code, there would be no indication that anything went wrong except silently ignoring requested DB changes.)
-Dale

"DN" == Dale Newfield <Dale@Newfield.org> writes:
>> Thus, if the code in the try causes an exception, the list will
>> still get unlocked (otherwise the list would be hosed), but the
>> transaction is aborted by virtue of not getting saved. A
>> subsequent load of the mlist would begin a new transaction,
>> with the old data.
DN> Right, but since .UnLock doesn't reload the data, any code
DN> refering to that mlist after unlocking would have different
DN> semantics depending upon which MemberAdaptor implementation is
DN> backing the list: With OldStyle, membership data in that mlist
DN> object is whatever it's been (temporarily) set to (but not
DN> saved); with SQL, membership data in that mlist has reverted
DN> to whatever it was when .Lock() was called.
You're right, and it should be cleaned up, but in practice it should be okay. For the one-shot scripts (cron, cgi), you're restarting the process each time anyway, so no worries. For the qrunner daemons, I basically had to solve the same problem, i.e. the next iteration through the loop needs to have consistent data or you're screwed. Each runner should either do a .Load() or a .Lock() at the top of their _dispose() methods, depending on if they only need read access to the list data or read-write access.
>> It's this last bit that's dodgy. There should probably be an
>> explicit abort on exceptions inside the try, but there's no way
>> to spell that with the current, legacy persistence mechanism,
>> so it isn't in any of the code.
DN> Wouldn't that abort be triggered by a call to .UnLock()
DN> without a call to .Save()? I would think that all calls to
DN> .Lock() and any calls to .UnLock() without a prior call to
DN> .Save() should abort any current transaction.
What about the qrunners that don't lock the list because they only need read access to the data? That's why I think we need an explicit abort, even if it's no-op'd for the old-style persistence.
>> I /think/ that if your MailList.Load() implicitly aborts any
>> active transaction, you should be okay, but of course, none of
>> that's tested.
DN> MailList.Load() or MailList.Lock()? Having .Load() abort any
DN> active transaction means that you cannot load other mlists
DN> (even read-only) inside a .Lock();try....Save();finally
DN> .UnLock() block and have the transaction succeed...
DN> The place these two models (SQL transactions and mailList
DN> load, lock/save/unlock) break down is what happens when there
DN> are more than one MailList object in memory at a time. SQL
DN> transactions assumes only one MailList can be modified at a
DN> time, and the lock/save/unlock model doesn't make that
DN> assumption.
Ah, so the problem probably isn't the transaction boundaries, but that Mailman assumes that each list's persistence is completely independent of other lists. I think the one place where you'll get hosed by this is in the cgi's where "global" operations loop through all the lists (yes, this sucks and is inefficient, but its the best we can currently do). I'm not sure how to get around this, except through some kind of elaborate nested transaction support.
DN> If we can assume that only one mlist gets locked at a time,
DN> the SQL system will work, but I see no way to enforce mailman
DN> developers to abide by that assumption. (Except to implicitly
DN> abort active transactions as described above--and since the
DN> MailList.Save() method doesn't have a success/failure return
DN> code, there would be no indication that anything went wrong
DN> except silently ignoring requested DB changes.)
Hmm. I'm going to have to think about this some more. I'm off line right now so can't look at code details.
-Barry

On Thu, 29 Aug 2002, Barry A. Warsaw wrote:
If it's read-only, I don't see the problem unless you need the data it is reading to remain unchanged (locked) for some span of time. Transactions are only related to changes made to the database. A read-only MailList would always be able to read, and at any given point in time the information returned would be that most recently committed. (I.E., If one process is reading and another modifying, the reading process will see the data from before the modifications until the writing process commits, then the reading process would see *all* those mods (no need to worry about incomplete changes--that's the point of an all-or-nothing transaction).)
Right.
I was hoping that these loops were always read-only, or that they could be serialized so that only one MailList is ever locked at a time.
I'm not sure how to get around this, except through some kind of elaborate nested transaction support.
Which is a road we really don't want to go down--Even if some random SQL implementation supported that, SQL doesn't support it in the standard.
Hmm. I'm going to have to think about this some more. I'm off line right now so can't look at code details.
OK. I'm going to go ahead and continue my development hoping that there's no show-stopper here. We do need to continue this conversation, but I also want to actually get stuff working :-)
-Dale

"DN" == Dale Newfield <Dale@newfield.org> writes:
>> What about the qrunners that don't lock the list because they
>> only need read access to the data? That's why I think we need
>> an explicit abort, even if it's no-op'd for the old-style
>> persistence.
DN> If it's read-only, I don't see the problem unless you need the
DN> data it is reading to remain unchanged (locked) for some span
DN> of time.
No, I was just thinking about read consistency here. I think you basically want the read state frozen at the start of the _dispose() method, which will either start with a Load() or a Lock(). It might be a bad thing (unless it isn't <wink>) for the state to change during _dispose(), even if the list is only reading its data.
DN> Transactions are only related to changes made to the database.
DN> A read-only MailList would always be able to read, and at any
DN> given point in time the information returned would be that
DN> most recently committed. (I.E., If one process is reading and
DN> another modifying, the reading process will see the data from
DN> before the modifications until the writing process commits,
DN> then the reading process would see *all* those mods (no need
DN> to worry about incomplete changes--that's the point of an
DN> all-or-nothing transaction).)
Part of the point is also to provide consistent state for read-only data. In ZODB for example, it's possible to get read-conflicts if the state of the objects aren't consistent. E.g. you read obj1 from transaction1, which has a reference to obj2. Before you read obj2, process2 has modified obj2 in transaction2. Now process1 reads obj2. Inconsistent state and a read-conflict occurs. (Aside: ZODB4 will likely have multiversion consistency control which assures that process1 will read obj2's state as it existed in transaction1).
This works in Mailman by doing a Load/Lock at the top of _dispose to sync the in-memory state with the on-disk state for the duration of the method.
>> I think the one place where you'll get hosed by this is in the
>> cgi's where "global" operations loop through all the lists
>> (yes, this sucks and is inefficient, but its the best we can
>> currently do).
DN> I was hoping that these loops were always read-only, or that
DN> they could be serialized so that only one MailList is ever
DN> locked at a time.
I believe they're serialized (they do write state), /except/ for the "parent" list for the process. E.g. I visit list1 to change my password, but click on "set globally". list1 remains locked while I cycle through the other lists, locking them in turn and making those changes.
We may have to rewrite a few of these loops.
>> I'm not sure how to get around this, except through some kind
>> of elaborate nested transaction support.
DN> Which is a road we really don't want to go down--Even if some
DN> random SQL implementation supported that, SQL doesn't support
DN> it in the standard.
Ok.
>> Hmm. I'm going to have to think about this some more. I'm off
>> line right now so can't look at code details.
DN> OK. I'm going to go ahead and continue my development hoping
DN> that there's no show-stopper here. We do need to continue
DN> this conversation, but I also want to actually get stuff
DN> working :-)
+1 :)
Let us know how it goes. -Barry

Just stumbled across this in Utils.py:
# TBD: what other characters should be disallowed? _badchars = re.compile('[][()<>|;^,/]')
and thought I'd suggest that " and ' get added to that list...
I recently wound up with a list subscriber of the form "foo@bar.baz" (*with* the quotes!) and had a more difficult time fixing it than you might expect.
-Dale

Dale Newfield <Dale@Newfield.org> writes:
and thought I'd suggest that " and ' get added to that list...
Nope, because both of those characters are valid e-mail address components. Not in the form you mention (i.e. the double quote is not allowed in the domain part), but certainly in the local part.
In fact this is a perfectly valid e-mail address:
"f@,'[& "@example.com
And the parser should be able to cope with them, or any other RFC 2822 compliant address.
Darrell

Dale Newfield <Dale@Newfield.org> wrote:
I have to focus on building this to meet my needs, and my customers' needs... PostgreSQL support isn't among these needs as far as I can see. However since there is a standard for Python database API's, it shouldn't be too hard for someone to replace MySQLdb with the PostgreSQL equivalent, or even make this a configuration option. (I'll welcome patches for making this a configuration option :-)
Greetings, Norbert.
-- Founder & Steering Committee member of http://gnu.org/projects/dotgnu/ Norbert Bollow, Weidlistr.18, CH-8624 Gruet (near Zurich, Switzerland) Tel +41 1 972 20 59 Fax +41 1 972 20 69 http://norbert.ch List hosting with GNU Mailman on your own domain name http://cisto.com

"NB" == Norbert Bollow <nb@cisto.com> writes:
>> If someone where to work up a patch <wink>
NB> I think it's not likely for any such patches to come from me
NB> anytime soon
Maybe lodge a feature request.
NB> , as I have bigger fish to fry. Specifically I'm
NB> going forward with implementing a MySQL-based archives system
NB> which can be used as a drop-in replacement for Pipermail which
NB> which will also provide the functionalities of a web board and
NB> a search engine optimization system.
NB> (Yes, it'll be 100% Python, and GPL'd Free Software).
Okay, that will be cool! You're off the hook. :)
-Barry

"JCL" == J C Lawrence <claw@kanga.nu> writes:
>> What do you think?
JCL> Specifically that I want to archive the exact message, down
JCL> to the byte, that subscriber's receive.
<wink> Of course, that's literally impossible, but I get the intention. Okay, no changes here.
-Barry
participants (10)
-
barry@python.org
-
barry@zope.com
-
Chuq Von Rospach
-
Dale Newfield
-
Darrell Fuhriman
-
J C Lawrence
-
Norbert Bollow
-
Richard Barrett
-
Simone Piunno
-
Tokio Kikuchi