"CVR" == Chuq Von Rospach <chuqui@plaidworks.com> writes:
CVR> Actually, yes. I won't be working 65+ hours a week any more,
CVR> so I sort of get my life back, and may actually have time to
CVR> think stuff through and do more than emergency
CVR> patching... (for more, see
CVR> <http://www.chuqui.com/cgi-bin/mwf/topic_show.pl?tid=348>). Also
CVR> means I can actually start some non-Apple hacking again, I
CVR> hope. And what I'll be doing is lots of fun, although the
CVR> next six weeks is going to be a crunch. Still doing email,
CVR> just off building a new custom system for stuff I can't talk
CVR> about...
Very interesting, and congrats on getting your life back. Also, my apologies for not responding earlier, but I think you understand probably as well as anybody. :)
CVR> One thing we're definitely doing is moving to a cloaked CVR> archive. Since we already distribute all archives out of
So these are public archives that need to be scrubbed, right? Until now, Mailman has taken the approach that public archives are feed right off the file system by the http server. We could still do that if we scrubbed the messages before we archived them, although that doesn't help with existing archives unless you re-generate them.
CVR> Here's why I won't do that. I want to keep ONE set of
CVR> archives. You can't scrub those archives for two
CVR> reasons. What if someone writes looking to get in contact
CVR> with the author of a message? If the archive is scrubbed,
CVR> that info is gone. And (god forbid), you get into a legal
CVR> tangle? That's your legal record of what was said on the mail
CVR> list and who said it. If you scrub it, and someone does
CVR> something actionable or libelous and you get a court order to
CVR> provide that data? You're hosed.
Excellent points all, I completely agree.
CVR> On a more likely note -- I can see where you might want the
CVR> option to show the archives unscrubbed to validated users,
CVR> and only scrub the public archives. As paranoid as I'm being
CVR> today, I'd STILL like to find a way to let subscribed users
CVR> see the archives unscrubbed. Which you could do by setting a
CVR> cookie that the CGI could accept and change it's behavior.
Yup, all possible if we give up the notion of vending the public archives from disk. We pay in cpu, but oh well, that's cheap these days, isn't it?
CVR> So I really like leaving the archives unmodified, and doing
CVR> the scrubbing via CGI. It also allows you do to other things,
CVR> like header cleanups (and you could potentially let a user
CVR> set a cookie to define minimal or full headers, say...) and
CVR> some quickie cleanup against unwrapped text and some other
CVR> incidental archive glitches.
CVR> I come from a newspaper family, so I have a bias towards "you
CVR> don't unpublish stuff, you don't change it once it's
CVR> published". But I think there are good reasons to avoid
CVR> sanitizing the archives, and instead sanitizing the delivery
CVR> of those archives -- if only because if your policies change,
CVR> all you need to change is the CGI. And it gives you the
CVR> ability to set up different sets of abilities per user or per
CVR> list if you want, too.
Again, excellent points.
So one question is: does the performance trade-off we made 5 years ago still make sense? Should we just be vetting all archives through a cgi, in which we can do fun stuff like cleanse it of email addresses?
CVR> One of the big things I dislike about Mhonarc is that
CVR> archives are a rather low-usage system, but maintaining the
CVR> Mhonarc index pages is rather intensive use of system
CVR> resources. Sort of like usenet -- you do a lot of work on
CVR> everything, in case someone wants anything. I think simply
CVR> storing the archives and sanitizing on demand is lower
CVR> overhead. It also means pipermail won't need ANY changes --
CVR> you simply feed it out through the CGI instead of directly,
CVR> and everything magically sanitizes...
Yup. Wanna help write the script?
We'd obviously have to get rid of the easy access to the raw mbox file, so another question is whether that's still useful.
CVR> Honestly? I don't think so. I find them real kludgy. I ended
CVR> up doing a new archiving system (one file per message) via a
CVR> perl script. We're about to take our new search engine out of
CVR> beta with the thing, finally.
I find the mboxes really handy for gathering statistics, but maybe because Python has some really nice tools to troll through them (e.g. we use the python-list mbox to stress ZODB). And it's also handy if you move lists, but I think that's about it. I'm sure "regular users" wouldn't care if we hid the mboxes. BTW, that's all true even if you go to a one-file-per-message layout a la mh.
Also, what heuristic do you use to search for email addresses, and what do you scrub them with?
CVR> Still being worked on. Right now, I'm basically doing a
CVR> <wordboundary><nonwhitespace>@<nonwhitespaceordot><dot>nonwhitespace><wordbo
CVR> undary>. I don't know how strongly we'll refine it.
Cool.
Do you want to attempt to obscure the address (e.g. "barry--at--python--dot--org")
CVR> Anything you programmatically obscure will be
CVR> programmatically de-obscured. This technique is bogus and
CVR> guaranteed to fail as soon as the spammers care enough. It's
CVR> pretty clear even the "randomized obscuring" of slashdot is a
CVR> failed technique, since spambots don't have to decode ALL of
CVR> those formats, just some of them, and then cycle throug the
CVR> site enough times....
CVR> Sorry, I find this is a false security. Makes the users feel
CVR> better, accomplishes nothing useful, so in reality, users get
CVR> lazy and careless. So to some degree, I feel it's worse than
CVR> nothing. I'm planning on replacing email addresses with
CVR> something useful like [email address deleted].
Agreed.
CVR> disclosing that info. It creates other problems -- you can't CVR> see a posting in the archive and send email to that person CVR> with more questions (or answers), but that seems trivial CVR> compared to the problems the spammers are causing.
It kind of plays into Reply-To: munging doesn't it? If you won't be able to reply to the original author, because we're anonymizing messages, then you might as well munge Reply-To: to go back to the list because that's the only posting address that makes sense.
CVR> Yes (he says, grimacing).
:)
CVR> If you sanitize the archives, I don't think it affects the
CVR> list. There are simply NO mailtos any more in the archives.
CVR> If you go the step further and anonymize the postings ON the
CVR> list, so subscriber email addresses simply are never shown to
CVR> other subscribers under any circumstances (ugh. Urp. I can't
CVR> believe I'm saying that. This is so anti-community it hurts),
CVR> you have no choice and reply-to has to point to the list,
CVR> since it's the only contact point left.
Yup.
CVR> If you instead turn the list server into a forwarding agent,
CVR> as in:
Or should Mailman get into the anonymous resender game? There's probably a lot we could do here, but given the political risks of anonymous resenders, do we even want go there?
CVR> Is it an anonymous remailer? We're making no pretense of
CVR> anonymity here. We're acting as a forwarding agent, ala
CVR> hotmail.com or mac.com. You mail to id13194@python.org, and
CVR> it ends up in my mailbox. The fact that we're not explicitly
CVR> denoting the real email address doesn't make us an anonymous
CVR> remailer -- that'd be a policy issue, actually. I suppose you
CVR> could take it that step further, but you could also set it up
CVR> so validated subscribers could get to the real addresses.
CVR> The model I'm thinking of is like many forum systems. If
CVR> you're a guest, you don't get access to email info. If you're
CVR> a subscriber, you log on, and they magically appear. In the
CVR> case of mailing lists, since oyu lose control of the e-mail
CVR> address once it leaves the site again, you handle this by
CVR> only using the remailer address in mail that leaves the site,
CVR> but a subscriber could go to the list system and look a user
CVR> up. That gets us away from the politics of the anonymous
CVR> stuff.
Hmm, maybe you're right. You've got to keep those random forwarding addresses alive for a long (configurable) time so that replies will continue to work.
CVR> You have nailed it on the head. Which is why I brought it
CVR> up. Not because this is the way it has to be in the future,
CVR> but because all this is making Mailman's job a whole lot more
CVR> complex (we were whining about that at work today, or at
CVR> least I was and everyone was nodding sympathetically and
CVR> looking for an open window -- email used to be pretty easy
CVR> and straight forward. And now.....). But not just because all
CVR> this crap is getting in the way, but also that fixing this
CVR> crap is overkill for some environments, and going to be NOT
CVR> ENOUGH in others.
Exactly. Here's the trick: for those who it is not enough, get them to pay enough for it that it could sustain a business. That way, you keep the overkill crowd happy with the free stuff, which the super paranoid help subsidize.
CVR> Damn, that sounds good, but -- I've had to give up crab and
CVR> shellfish (I've developed an intermitten sensitivity to
CVR> it. Sigh!) and I'm staying in cupertino where I'll be manning
CVR> the war room this week making sure buttons get pushed when
CVR> they need pushed, and not a minute before....
Ah too bad (about both!). The offer of some cold ones (of a liquid of your choice) stands if you ever make it to DC. :)
-Barry