On 11 Nov 2002 at 21:16, Robert Woodhead wrote:

> My hunch, based on things I've done in the past, is that as the total 
> volume of mail increases, the rate of increase in the number of 

> analysis on a quarter-gig of ham and spam I was seeing, IIRC, about 
> 300,000 distinct tokens (including the aforementioned gibberish).

My training/testing set of ... 13,000 messages resulted in pickles with 320,000 words.

