What strategy for random accession of records in massive FASTA file?

Thu Jan 13 13:36:21 EST 2005

Chris Lasher wrote:

>>Given that the information content is 2 bits per character
>>that is taking up 8 bits of storage, there must be a good reason
>>for storing and/or transmitting them this way? I.e., it it easy
>>to think up a count-prefixed compressed format packing 4:1 in
>>subsequent data bytes (except for the last byte which have
>>less than 4 2-bit codes).
> 
> My guess for the inefficiency in storage size is because it is
> human-readable, and because most in-silico molecular biology is just a
> bunch of fancy string algorithms. This is my limited view of these
> things at least.

Yeah, that pretty much matches my guess (not that I'm involved in 
anything related to computational molecular biology or genetics). 
Given the current technology, the cost of the extra storage size is 
presumably lower than the cost of translating into/out of a packed 
format.  Heck, hard drives cost less than $1/GB now.

And besides, for long-term archiving purposes, I'd expect that zip et 
al on a character-stream would provide significantly better 
compression than a 4:1 packed format, and that zipping the packed 
format wouldn't be all that much more efficient than zipping the 
character stream.

Jeff Shannon
Technician/Programmer
Credit International