[Mailman-Developers] Opening up a few can o' worms here...

Barry A. Warsaw barry@zope.com
Tue, 16 Jul 2002 17:37:45 -0400


>>>>> "CVR" == Chuq Von Rospach <chuqui@plaidworks.com> writes:

    CVR> First, a minor announcement. I'm no longer in charge of the
    CVR> mailing lists at apple, sort of. We've hired a person
    CVR> full-time, and he's been taking over the lists server as his
    CVR> full-time responsibility, allowing me to go off and work on
    CVR> other projects. I'm still in the loop, just not "it". I'm
    CVR> still going to be heavily involved as we move that box to
    CVR> Mailman 2.1, and after that, probably fade a bit more into
    CVR> the woodwork (I still run my Mailman box at home, however, so
    CVR> I'm not going away. JC, quite jeering)

Congratulations!  I think. ;)

    CVR> One thing we're definitely doing is moving to a cloaked
    CVR> archive. Since we already distribute all archives out of
    CVR> HTTP, not FTP, we're working on a CGI that'll strip all
    CVR> e-mail information out of messages on the fly (among other
    CVR> things, like header cleanup and some trivial formatting
    CVR> fixes). The idea is simple -- we've finally hit the point
    CVR> where you can't put an e-mail address up on a public site
    CVR> under any cirucmstance safely, so we're having to move to a
    CVR> system where we simply don't do that.

So these are public archives that need to be scrubbed, right?  Until
now, Mailman has taken the approach that public archives are feed
right off the file system by the http server.  We could still do that
if we scrubbed the messages before we archived them, although that
doesn't help with existing archives unless you re-generate them.

So one question is: does the performance trade-off we made 5 years ago
still make sense?  Should we just be vetting all archives through a
cgi, in which we can do fun stuff like cleanse it of email addresses?

We'd obviously have to get rid of the easy access to the raw mbox
file, so another question is whether that's still useful.
Occasionally it's damn handy if you're moving a list or gathering
statistics on it, but on the other hand, it's a rich source of
addresses to mine.  Again, if we scrubbed the messages pre-archiving
we likely be ok.

Also, what heuristic do you use to search for email addresses, and
what do you scrub them with?  Do you want to attempt to obscure the
address (e.g. "barry--at--python--dot--org") or replace it altogether
(e.g. "[hidden email address]"), or maybe just replace it with a
truncation (e.g. "[localpart's email address]").

    CVR> I think the Mailman stuff needs to think about this, also. It
    CVR> impacts the archiving setup and other issues, but the
    CVR> harvesters have hit the point where we simply can't risk
    CVR> disclosing that info. It creates other problems -- you can't
    CVR> see a posting in the archive and send email to that person
    CVR> with more questions (or answers), but that seems trivial
    CVR> compared to the problems the spammers are causing.

It kind of plays into Reply-To: munging doesn't it?  If you won't be
able to reply to the original author, because we're anonymizing
messages, then you might as well munge Reply-To: to go back to the
list because that's the only posting address that makes sense.  And
what if the original poster isn't a member of the list?

Or should Mailman get into the anonymous resender game?  There's
probably a lot we could do here, but given the political risks of
anonymous resenders, do we even want go there?

    CVR> A secondary issue here is the problem of disclosing admins
    CVR> and admin addresses.

Note that in MM2.1 we go about 1/2 way here.  We include the obscured
email addresses of the list owners as the text in a mailto: tag but we
actually use the list-owner@ address as the mailto: target.  That
might not be enough though.  When we actually have a Real Database
backend we can keep a roster of email+realname and then just include
the realname inside the href:mailto tag.
    
    CVR> I know we've hashed that through once, but we've come to the
    CVR> (somewhat reluctant) decision to whitelist all public,
    CVR> non-personal email addresses. We're going to be implementing
    CVR> TMDA to do this, and will be switching all admin to generic
    CVR> addresses that filter through TMDA, as well as things like
    CVR> postmaster@ and the like. While I hate making users jump
    CVR> through hoops to get through to a real person (for those that
    CVR> don't know, TMDA is an overt whitelist. If you're not on the
    CVR> whitelist, you get mail back telling you to take some action,
    CVR> and until you do, the mail isn't delivered), but the abuse by
    CVR> the spammers on admin addresses is now so bad I'm declaring
    CVR> defeat and going to the whitelist.

Have you looked at SpamAssassin Chuq?  It's really done wonders to
reduce the amount of spam actually getting through any python.org or
zope.org address.  I know 'cause I see the daily reports of
quarantined messages.  Very few false positives too (usually it's
email amongst our postmasters talking about spam or SA ;).  I feel a
lot better about this approach than TMDA'ing essential addresses like
postmaster or mailman-owner.

    CVR> I'm going to look and see if I can interface TMDA to the
    CVR> subscriber databases so that subscribers are by definition
    CVR> whitelisted, but we've hit the poiint where we have to do
    CVR> this. I'm not happy about it, but the war is lost, I think.

Sigh.

    CVR> So what he did was open up his address book and send his
    CVR> message to everyone in it. And he's running one of these new
    CVR> e-mail clients that happily caches addresses it sees in case
    CVR> you want them again. So all of the addresses of people
    CVR> posting to the mailing lists he subscribed to were in his
    CVR> address book cache, so when he grabbed his address book, he
    CVR> grabbed all of those addresses, too.

Wonderful.  I think this has been presaged by Klez which does
essentially the same thing w/o human intervention or such good
intent. ;)

    CVR> But now we're wondering if we have to go to some sort of
    CVR> address cloaking ON lists, maybe some kind of address
    CVR> remapping through the server for replies, something. And I'm
    CVR> gritting my teeth at the developers who created those
    CVR> @#$@$#@$#23 caches (which are nice in some ways) for not also
    CVR> creating some way to flag addresses as not
    CVR> cacheable. Because, IMHO, that'd solve this problem.

Yup, but of course it implies that the clients play by the rules, and
we know that they don't all, so the question is what we're willing to
give up for the security of our online personas.  Kinda mirrors
today's large questions in the WoT(tm), eh?  Maybe people are more
willing to give up their rights than their conveniences for some added
security.

    CVR> Are we hitting a point where mail list servers have to act as
    CVR> blind front ends for all of the subscribers, where replies
    CVR> are processed by those servers, and the server then takes on
    CVR> the job of acting as a troll-exterminator and spam blocker? 
    CVR> And what does that really mean for things like Mailman?

World domination of course.  Because we /could/ add that stuff fairly
easily if we had the resources to expend on it.  Would it still be
useable?  For some audiences yes, others no.  I'm fairly sure the
kind of anonymizing we're talking about would never fly in the Python
and Zope community, where as it's probably essential in a less
cloistered environment like lists.apple.com.  Which leads me to
believe that we need to make it much easier to install themes or
styles of lists, from the paranoid anonymizer to the laissez-faire
discussion list.

    CVR> Happy Macworld Expo week, all. If you need me, I'll be in the
    CVR> war room, beating my head against a wall.

Any chance you could make it down to DC for a side trip?  We could
have a Mailman hacking sprint over a few dozen steamed Maryland blue
crabs and some cold ones. :)

-Barry