What strategy for random accession of records in massive FASTA file?
Neil Benn
benn at cenix-bioscience.com
Fri Jan 14 04:26:05 EST 2005
Jeff Shannon wrote:
> Chris Lasher wrote:
>
>>> And besides, for long-term archiving purposes, I'd expect that zip et
>>> al on a character-stream would provide significantly better
>>> compression than a 4:1 packed format, and that zipping the packed
>>> format wouldn't be all that much more efficient than zipping the
>>> character stream.
>>
>>
>> This 105MB FASTA file is 8.3 MB gzip-ed.
>
>
> And a 4:1 packed-format file would be ~26MB. It'd be interesting to
> see how that packed-format file would compress, but I don't care
> enough to write a script to convert the FASTA file into a
> packed-format file to experiment with... ;)
>
> Short version, then, is that yes, size concerns (such as they may be)
> are outweighed by speed and conceptual simplicity (i.e. avoiding a
> huge mess of bit-masking every time a single base needs to be
> examined, or a human-(semi-)readable display is needed).
>
> (Plus, if this format might be used for RNA sequences as well as DNA
> sequences, you've got at least a fifth base to represent, which means
> you need at least three bits per base, which means only two bases per
> byte (or else base-encodings split across byte-boundaries).... That
> gets ugly real fast.)
>
> Jeff Shannon
> Technician/Programmer
> Credit International
>
Hello,
Just to clear up a few things on the topic :
If the file denotes DNA sequences there are five basic identifiers
AGCT and X (where X means 'dunno!').
If the files denoites RNA sequences, you will still only need five
basic indentifiers the issue is that the T is replaced by a U.
One very good way I have found to parse large files of this nature
(I've done it with many a use case) is to write a sax parser for the
file. Therefore you can register a content handler, receive events from
the sax parser and do whatever you like with it. Basically, using the
sax framework to read the files - if your write the sax parser carefully
then you stream the files and remove old lines from memory, therefore
you have a scalable solution (rather than keeping everything in memory).
As an aside, I would seriously consider parsing your files and
putting this information in a small local db - it's really not much work
to do and the 'pure' python thing is a misnomer, whichever persistence
mechanism you use (file,DB,etching it on the floor with a small robot
accepting logo commands,etc) is unlikely to be pure python.
The advantage of putting it in a DB will show up later when you have
fast and powerful retrieval capability.
Cheers,
Neil
--
Neil Benn
Senior Automation Engineer
Cenix BioScience
BioInnovations Zentrum
Tatzberg 47
D-01307
Dresden
Germany
Tel : +49 (0)351 4173 154
e-mail : benn at cenix-bioscience.com
Cenix Website : http://www.cenix-bioscience.com
More information about the Python-list
mailing list