What strategy for random accession of records in massive FASTA file?
jeff at ccvcorp.com
Thu Jan 13 19:36:21 CET 2005
Chris Lasher wrote:
>>Given that the information content is 2 bits per character
>>that is taking up 8 bits of storage, there must be a good reason
>>for storing and/or transmitting them this way? I.e., it it easy
>>to think up a count-prefixed compressed format packing 4:1 in
>>subsequent data bytes (except for the last byte which have
>>less than 4 2-bit codes).
> My guess for the inefficiency in storage size is because it is
> human-readable, and because most in-silico molecular biology is just a
> bunch of fancy string algorithms. This is my limited view of these
> things at least.
Yeah, that pretty much matches my guess (not that I'm involved in
anything related to computational molecular biology or genetics).
Given the current technology, the cost of the extra storage size is
presumably lower than the cost of translating into/out of a packed
format. Heck, hard drives cost less than $1/GB now.
And besides, for long-term archiving purposes, I'd expect that zip et
al on a character-stream would provide significantly better
compression than a 4:1 packed format, and that zipping the packed
format wouldn't be all that much more efficient than zipping the
More information about the Python-list