Suggestions for handling archive growth

Hi Folks,
As our archives approach a terabyte in size, I was wondering if anyone had suggestions or tips for handling archive growth and storage. I've got some ideas, but am wondering what others might be doing. Just as background, we have a few thousand lists, and support a mid-sized university population, with list creation open to faculty, staff, and students.
If there is a better place to ask this question, please point me there.
Thanks!
Gretchen Beck
Carnegie Mellon

On 4/14/2016 10:36 AM, Gretchen R Beck wrote:
As our archives approach a terabyte in size, I was wondering if anyone had suggestions or tips for handling archive growth and storage. I've got some ideas, but am wondering what others might be doing. Just as background, we have a few thousand lists, and support a mid-sized university population, with list creation open to faculty, staff, and students.
My first thoughts involve purging old/unused lists, message retention policies, whether attachments have been scrubbed- that sort of thing. Are there single lists that have -huge- archives or are there just a lot of lists with small archives?
For instance, if none of a list's members are still active accounts, does it make sense to keep the archive online, or to retain it at all?
z!

On 04/14/2016 10:36 AM, Gretchen R Beck wrote:
As our archives approach a terabyte in size, I was wondering if anyone had suggestions or tips for handling archive growth and storage. I've got some ideas, but am wondering what others might be doing. Just as background, we have a few thousand lists, and support a mid-sized university population, with list creation open to faculty, staff, and students.
It won't help a lot, but remove all the periodic .txt.gz files and remove the cron/nightly_gzip job from Mailman's crontab. While the .txt.gz files conceivably save bandwidth when the files are downloaded, they serve no other useful purpose. The .txt files they come from are all still there.
If you want to 'prune' older messages from the archives, there is a script at <https://www.msapiro.net/scripts/prune_arch> (mirrored at https://fog.ccsf.edu/~msapiro/scripts/prune_arch) that can help with that.
Depending on list configuration, but with normal defaults, there will be two copies of each scrubbed attachment in the archives/private/LISTNAME/attachments/ directory. This is because when scrub_nondigest is No and the list is digestable, the non-plain-text attachments are scrubbed both from the archive and from the plain text digest. After a while, the ones whose links were in the plain text digest are probably not needed any more as few if any copies of the original digests still exist, and the attachment can always be found via the archive link.
The trick here is to identify which attachments were scrubbed from a digest and can therefore be removed.
On the other hand, these days you can buy a couple of terabytes worth of HDD for $100 US so maybe that's an easier way to go.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
participants (3)
-
Carl Zwanzig
-
Gretchen R Beck
-
Mark Sapiro