[Mailman-Developers] Proposed: remove address-obfuscation code from Mailman 3
rsk at gsp.org
Thu Aug 27 15:58:17 CEST 2009
On Wed, Aug 26, 2009 at 10:57:06AM +0100, Ian Eiloart wrote:
> There's recently published research which suggests that simple
> obfuscation can be effective. Concealment, presumably, is more effective.
> At <http://www.ceas.cc/> you can download "Spamology: A Study of Spam
> Origins" <http://www.ceas.cc/papers-2009/ceas2009-paper-18.pdf>
I'm composing a combined reply to all of the comments here, but wish to
reply to this single point separately.
This paper seems well-intentioned, but has some very serious problems --
any one of which is sufficient to dismiss its conclusions entirely.
Let me just enumerate a few of them; I'll spare you the entire list.
1. The authors presume that they can tell that an address has been
harvested *and* added to at least one spammer database (or not) by
observing spam sent to it. But that's wrong: we know that many addresses
are harvested and never spammed, or not spammed for a very long time
(as in "years"). Conversely, many addresses are spammed that have
*never* been harvested. And some addresses that are harvested are
spammed, but not because they were harvested.  And some addresses
are picked up by routine/ordinary web crawlers, and then subsequently
spammed, but not by the people running those crawlers. 
This invalidates their measurement technique.
2. There's a major methodology error here:
"We began by registering a dedicated domain for this project,
which we hosted on servers in our department."
We know that some spammers -- the competent ones, who are the ones
that matter -- use suppression lists based not just on domains, but
TLDs, IP addresses, network allocations, ASNs, NS records, MX records,
etc. We further know that anything tracing to a .edu or a network
allocation/ASN associated with a .edu is quite likely to appear on those
suppression lists. (This is an "old tradition" among spammers. Not all
of them follow it, but quite a few do.)
This also invalidates their measurement technique.
3. Statistics from any single domain are often wildly skewed one
way or another. For example: I happen to host three domains which
have the same name, but in three different TLDs. Everything else about
them is exactly the same: NS, MX, web content, valid email addresses,
etc. The spam they receive varies over three orders of magnitude.
4. And then there's this: it doesn't cover use of the single
largest current vector for address harvesting -- zombied systems.
No discussion of contemporary address harvesting techniques can even
be begun without considering this. It's like writing a paper on tides
without factoring in the moon's gravitation. 
(I checked to see if perhaps this paper's publication predated the
rise of the zombies earlier this decade, but it's from 2009.)
To put it another way: yes, there are still address harvesters using
the techniques that these researchers were looking for. But these
harvesters are outdated and unimportant; they're only used by spammers
who don't have the expertise and resources to do better. And not only
is that class of spammer is steadily shrinking, that's NOT the class
of spammer we need to worry about, as it's quite easy to block just
about all their traffic whether they have valid addresses or not.
(C'mon, these are people who can't decode rskATgsp.org, do you really
think they constitute a serious threat?)
So like I said above, I'll spare you points 5-N, but they're similar.
None of what I've said here is new or novel: it's common knowledge among
experienced people working in the field. I think perhaps in the future
that people trying to conduct this kind of research should spend a few
years reading spam-l and other similar lists before diving in.
The bottom line is that (a) the numbers they've produced have no meaning
and (b) their conclusions are all wrong.
 As an example: conside joe at example.com, and let's suppose that
it's been deliberately exposed to one method of harvesting because
it's published at http://www.example.com. If spam arrives, then
it may be because the address was harvested by a web crawler and
added to a spammer database -- or it may be because "joe" is very
common LHS string and thus one that spammers are very likely to try
in *any* domain. Note that while spammers' list of such likely LHS
were quite limited years ago, they're not any more: spammers now
have the resources to try all known and all plausible LHS strings if
they wish. And they are: check your logs. You may be surprised at
which LHS strings are being tried: what was computationally infeasible
a decade ago is now routine.
 It's not difficult to figure out who's running a web crawler:
just setting up a web site, making sure it's linked to, waiting,
and then analyzing logs will reveal a candidate list. It's somewhat
more work to figure out which of those crawler operations can be
broken into, but it has significant advantages: it allows one to
mine all their data without the expense/hassle of collecting it,
and it conceals the source/use of that data.
There are a lot of crawler operations out there. It would be silly
to think that they're *all* secure.
 Harvesting addresses on zombies has quite a few advantages over
other techniques: It uses the host's own resources. It's unlikely
to be detected. It won't be stopped by firewalls or rate-limiting
at the network level. It provides social graph information. It provides
timestamp information. It provides MUA information. It may yield
useful phishing information. It may yield useful identity theft
information. It may yield useful blackmail information. And all of
this can be bundled up by suitable extraction software and delivered
as a package back to a C&C node.
For example, from a single email message sitting on Fred's computer:
Fred last received email from Barney at 2009-08-11 07:32:12 UTC,
thus Barney's address is known-good as of that time, Fred will
probably accept suitably-forged mail from Barney, and vice-versa.
And of course since Fred's computer is now owned by spammers,
no anti-forgery mechanism of any kind will detect the latter.
And maybe an appropriate malware payload from Fred to Barney
will yield another zombie, where "appropriate" may be partially
inferred by checking the headers and seeing what MUA Barney
is using. Maybe those headers will also identify what MTA
and associated anti-malware software Barney's site is using, so
that the payload can be appropriately chosen. Phishing bonus if
Barney's address is barney at some-bank or similar. Blackmail bonus
if Barney's address is on an "adult dating" or "escort" site.
Identity theft bonus if regexp matching on message-body turns
up NNN-NN-NNNN (US social security number) and the like. &etc.
Now multiply this by a billion. At least -- because there are at least
a hundred million zombies and estimating only 10 stored messages per zombie
gets us to a billion. This is why the serious/"professional" address
harvesting operations have shifted from some of the older and less
efficient techniques to this one, and why defending against those methods
is now pointless.
More information about the Mailman-Developers