What strategy for random accession of records in massive FASTA file?

Michael Maibaum mike at maibaum.org
Fri Jan 14 05:02:11 EST 2005

On Thu, Jan 13, 2005 at 04:41:45PM -0800, Robert Kern wrote:
>Jeff Shannon wrote:
>>(Plus, if this format might be used for RNA sequences as well as DNA 
>>sequences, you've got at least a fifth base to represent, which means 
>>you need at least three bits per base, which means only two bases per 
>>byte (or else base-encodings split across byte-boundaries).... That gets 
>>ugly real fast.)
>Not to mention all the IUPAC symbols for incompletely specified bases 
>(e.g. R = A or G).

Or, for those of us working with proteins as well, all the single letter codes for proteins:


lots more bits.

I have a db with approx 3 million proteins in it and would not want to be using a pure python approach :)


