I /think/ I've caught up on this thread, but I'm sure I've missed a bunch. As I see it there are really these issues to protecting email addresses in Mailman:
- list admin addresses
- public archives
- private archives
- raw archive
- list rosters
For #1, MM2.1 changes what gets included at the bottom of list pages. The admin's personal address is no longer included in the link's text or in mailto: href. In the mailto: you'll see something like firstname.lastname@example.org and in the text you'll see something like "barry at zope.com". I see no point in trying to obscure the former -- or put it behind a web form -- because it's easily guessed given a probe of existing lists, as is every other list-related email address. More on protecting the -owner from spam below. I claim that the guessability is a feature, btw.
MM3 will likely integrate admin addresses and list memberships into an object called a "roster" (essentially just a list of email addresses). This will let us define a pipeline for each roster, which could include a spam filter that performs an action based on some criteria (e.g. drop it, reject it, mark a header, etc.). So we can do more protection on the -owner address than we can do now (without hacking). Rosters and the improved user database will allow us to actually equate admin email addresses with Real Names, so you could conceivably see something like
List run by <a href="mailto:email@example.com">Barry Warsaw</a>
at the bottom of the pages. You'd be within your rights to argue that end users never even need know who admins the list, but I think it helps to avoid the "faceless droid" syndrome.
Mailman should avoid getting deeply into the spam detection and prevention business, except for some really really basic stuff (probably not much more or less than it does now). It should integrate well with external spam detection programs like SpamAssassin or commercial equivalents. E.g. if we always send the message through SA, and the message gets some score, we could decide to hold messages below say 5.0 on the Spamster Scale, discard anything about 5.0, etc.
As for #2, I'd go for the low-tech approach of simply discarding the hostname part of the email address in all public archives. Certainly this is easy in the headers, and we'll have to decide whether we're going to expend the resources to do body searches for email addresses, and obfuscate those as well. If people want to make contacts based on some public archive message, they can email the list. Until we've got web-posting, I don't think it matters if they lose the full email address in the public archives.
As for #3, I don't mind not obscuring the email addresses since a login will be required. If we think we don't trust the current private archive login procedures to be secure against bots, then we can fix that, but I don't see it as a high priority.
#4 is interesting too. I'm not against putting the raw archive behind a turing-test, since I suspect that very few people will ever want it. It means that we won't be able to write an automated wget-ish script to do off-site backups, but so be it.
Things to note for #'s 2-4:
The Pipermail implementation has lots of well-known problems. I'm personally not willing to spend a lot of time fixing them, and I still recommend Real Sites use a Real Archiver. I've just thrown the majority of the email obfuscation problems over the fence into someone else's back yard <wink>.
Adding public archive obfuscation is fine and dandy for new messages added to the archives but what about all the existing archived messages? Re-running Pipermail (i.e. bin/arch) to regenerate from scratch has two significant drawbacks. 1) Message url's can change, especially if you also fix broken From_ delimiters, and that in turn breaks bookmarks, 2) on large mboxes, you simply can't do bin/arch because of memory problems.
Someone needs to step up and "own" Pipermail if any of these problems are going to be fixed, or if the obfuscation is going to happen.
Remember that Pipermail itself is completely optional. An API is defined between Mailman and the archiver and that's all the interaction they have. Maybe the API needs to be more elaborate to support obfuscation. It definitely needs some changes if we ever want to add some of the features I'd like to add (but that's off-topic here).
I'll note that one of the early design decisions for Pipermail was that public archives should be vended directly from the file system for performance reasons. That decision may not be appropriate for today's operations. Certainly maintaining two static versions of the pages isn't feasible, so I think you have to vend one or the other (probably the obfuscated version) from a cgi.
Anybody who's really interested in archiving (I'm not) should take a look at Zest and think about contributing to it, so that it does stuff like this correctly from the start. I really like the Zest interface and think it ought to at least be a bundled alternative to Pipermail, some day.
Nobody's even mentioned #5, which are available publically via the "Visit Subscriber List" button, or the email command "who" to the -request address. If I were a spam harvester, I wouldn't even bother with scanning the archives if either of these were publically enabled. When you turn them off, especially the former, just remember that you've now made it much harder for Joe User to unsubscribe themselves. Catch 22.