What strategy for random accession of records in massive FASTA file?

Neil Benn benn at cenix-bioscience.com
Fri Jan 14 10:26:05 CET 2005

Jeff Shannon wrote:

> Chris Lasher wrote:
>>> And besides, for long-term archiving purposes, I'd expect that zip et
>>> al on a character-stream would provide significantly better
>>> compression than a 4:1 packed format, and that zipping the packed
>>> format wouldn't be all that much more efficient than zipping the
>>> character stream.
>> This 105MB FASTA file is 8.3 MB gzip-ed.
> And a 4:1 packed-format file would be ~26MB.  It'd be interesting to 
> see how that packed-format file would compress, but I don't care 
> enough to write a script to convert the FASTA file into a 
> packed-format file to experiment with... ;)
> Short version, then, is that yes, size concerns (such as they may be) 
> are outweighed by speed and conceptual simplicity (i.e. avoiding a 
> huge mess of bit-masking every time a single base needs to be 
> examined, or a human-(semi-)readable display is needed).
> (Plus, if this format might be used for RNA sequences as well as DNA 
> sequences, you've got at least a fifth base to represent, which means 
> you need at least three bits per base, which means only two bases per 
> byte (or else base-encodings split across byte-boundaries).... That 
> gets ugly real fast.)
> Jeff Shannon
> Technician/Programmer
> Credit International

             Just to clear up a few things on the topic :

    If the file denotes DNA sequences there are five basic identifiers

AGCT and X (where X means 'dunno!').

    If the files denoites RNA sequences, you will still only need five 
basic indentifiers the issue is that the T is replaced by a U. 

    One very good way I have found to parse large files of this nature 
(I've done it with many a use case) is to write a sax parser for the 
file.  Therefore you can register a content handler, receive events from 
the sax parser and do whatever you like with it.  Basically, using the 
sax framework to read the files - if your write the sax parser carefully 
then you stream the files and remove old lines from memory, therefore 
you have a scalable solution (rather than keeping everything in memory).

    As an aside, I would seriously consider parsing your files and 
putting this information in a small local db - it's really not much work 
to do and the 'pure' python thing is a misnomer, whichever persistence 
mechanism you use (file,DB,etching it on the floor with a small robot 
accepting logo commands,etc) is unlikely to be pure python.

    The advantage of putting it in a DB will show up later when you have 
fast and powerful retrieval capability.




Neil Benn
Senior Automation Engineer
Cenix BioScience
BioInnovations Zentrum
Tatzberg 47

Tel : +49 (0)351 4173 154
e-mail : benn at cenix-bioscience.com
Cenix Website : http://www.cenix-bioscience.com

More information about the Python-list mailing list