Re: [Mailman-Developers] Requirements for a new archiver

On Thu, 30 Oct 2003 04:45:37 +0100 Brad Knowles <brad.knowles@skynet.be> wrote:
At 10:27 PM -0500 2003/10/29, J C Lawrence wrote:
Actually the two cases are considerably different. In the delete case I have to do pool management, with some eye toward fragmentation control and optimisations of average latency for free heap searches, as well as heap integrity audits. In the write-only case I just build on the end and need pay no mind to prior data once it is allocated.
Not really. You still have to maintain all the indexes, make sure that if things get moved around that all the links get updated, etc....
With a write-once system you don't actually need to ever move anything. At its core it is: Open one file, repetitively append to end until file size exceeds size N, create new file, repeat. You can do object size clustering across files or other optimisation techniques, but the basic pattern remains the same. For the few cases you have to support delete you either just NULL the byte stream for the pointed-to object, or you invalidate the key. As the frequency and number of such deletes is infinitesimal, they require no special management complexity. You can afford to just swallow the lost free space as the cost of attempting to manage it is simply never rewarded.
True, you don't have to worry about fragementation control or other more complex aspects of heap management, but that's a further cost savings over other techniques and not a "drawback" to using this technique for this purpose.
True. I'm not lableing it a drawback, just a boon of dubious advantage.
Now, if you want to consider what would happen to you if the Scientologists ever came after you, or if you had court orders to remove postings that linked to bomb-making instructions, you'd probably want to keep all those other tools related to heap management around anyway.
Not really. The percentage of such deleted posts over the lifetime of the store can be generally assumed to be less than 1 in 10^5, and is probably considerably lower, if not in the 1:10^8 range. Add a simple invalid key semantic and you're done.
Caveat: Continual addition and deletion of SPAM from an archive would change this balance.
They'd be less likely to be used, but at least you wouldn't have to take the entire site down while you went and wrote the tools from scratch to handle a situation that you had not foreseen.
You're going to need tools when the percentage of such deleted postings is sufficiently high that the cost of the lost free space and its overhead exceeds the cost of managing that free space. That's not a quick thing.
--
J C Lawrence
---------(*) Satan, oscillate my metallic sonatas.
claw@kanga.nu He lived as a devil, eh?
http://www.kanga.nu/~claw/ Evil is a name of a foeman, as I live.

At 11:01 PM -0500 2003/10/29, J C Lawrence wrote:
With a write-once system you don't actually need to ever move anything.
Depends on how you manage the storage of those large files. If
you have an infinitely large filesystem that is guaranteed 100% reliable in all possible circumstances, you're right. Otherwise, you might find that the filesystem is getting full and things need to be moved around, or you suffer a disk or storage system crash and you have to restore from backups, or you use an HSM solution to move older files to slower/higher capacity storage, or you have issues with too many large files in a single directory and need to implement your own directory hashing scheme, etc....
Not really. The percentage of such deleted posts over the lifetime of the store can be generally assumed to be less than 1 in 10^5, and is probably considerably lower, if not in the 1:10^8 range. Add a simple invalid key semantic and you're done.
It depends on whether or not the court order allows you to just
mark things as "deleted" and be done with it. If they force you to actually expunge all copies of that data from your systems, you will have to do more work.
You're going to need tools when the percentage of such deleted postings is sufficiently high that the cost of the lost free space and its overhead exceeds the cost of managing that free space. That's not a quick thing.
True enough, but as you've pointed out, there have been a number
of implementations of this sort of solution, and you've worked on at least a couple yourself. These sorts of tools should already be reasonably well understood and not too difficult to write or "borrow" from other sources.
-- Brad Knowles, <brad.knowles@skynet.be>
"They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, Historical Review of Pennsylvania.
GCS/IT d+(-) s:+(++)>: a C++(+++)$ UMBSHI++++$ P+>++ L+ !E-(---) W+++(--) N+ !w--- O- M++ V PS++(+++) PE- Y+(++) PGP>+++ t+(+++) 5++(+++) X++(+++) R+(+++) tv+(+++) b+(++++) DI+(++++) D+(++) G+(++++) e++>++++ h--- r---(+++)* z(+++)
participants (2)
-
Brad Knowles
-
J C Lawrence