What strategy for random accession of records in massive FASTA file?
steve at holdenweb.com
Fri Jan 14 23:48:43 CET 2005
Jeff Shannon wrote:
> Chris Lasher wrote:
>>> And besides, for long-term archiving purposes, I'd expect that zip et
>>> al on a character-stream would provide significantly better
>>> compression than a 4:1 packed format, and that zipping the packed
>>> format wouldn't be all that much more efficient than zipping the
>>> character stream.
>> This 105MB FASTA file is 8.3 MB gzip-ed.
> And a 4:1 packed-format file would be ~26MB. It'd be interesting to see
> how that packed-format file would compress, but I don't care enough to
> write a script to convert the FASTA file into a packed-format file to
> experiment with... ;)
If your compression algorithm's any good then both, when compressed,
should be approximately equal in size, since the size should be
determined by the information content rather than the representation.
> Short version, then, is that yes, size concerns (such as they may be)
> are outweighed by speed and conceptual simplicity (i.e. avoiding a huge
> mess of bit-masking every time a single base needs to be examined, or a
> human-(semi-)readable display is needed).
> (Plus, if this format might be used for RNA sequences as well as DNA
> sequences, you've got at least a fifth base to represent, which means
> you need at least three bits per base, which means only two bases per
> byte (or else base-encodings split across byte-boundaries).... That gets
> ugly real fast.)
Steve Holden http://www.holdenweb.com/
Python Web Programming http://pydish.holdenweb.com/
Holden Web LLC +1 703 861 4237 +1 800 494 3119
More information about the Python-list