What strategy for random accession of records in massive FASTA file?

Wed Jan 12 19:34:04 EST 2005

Chris Lasher wrote:
> Hello,
> I have a rather large (100+ MB) FASTA file from which I need to
> access records in a random order. The FASTA format is a standard
format
> for storing molecular biological sequences. Each record contains a
> header line for describing the sequence that begins with a '>'
> (right-angle bracket) followed by lines that contain the actual
> sequence data. Three example FASTA records are below:
>
> >CW127_A01
> TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
> TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA
> GCATTAAACAT
[snip]
> Since the file I'm working with contains tens of thousands of these
> records, I believe I need to find a way to hash this file such that I
> can retrieve the respective sequence more quickly than I could by
> parsing through the file request-by-request. However, I'm very new to
> Python and am still very low on the learning curve for programming
and
> algorithms in general; while I'm certain there are ubiquitous
> algorithms for this type of problem, I don't know what they are or
> where to look for them. So I turn to the gurus and accost you for
help
> once again. :-) If you could help me figure out how to code a
solution
> that won't be a resource whore, I'd be _very_ grateful. (I'd prefer
to
> keep it in Python only, even though I know interaction with a
> relational database would provide the fastest method--the group I'm
> trying to write this for does not have access to a RDBMS.)
> Thanks very much in advance,
> Chris

Before you get too carried away, how often do you want to do this and
how grunty is the box you will be running on? Will the data be on a
server? If the server is on a WAN or at the other end of a radio link
between buildings, you definitely need an index so that you can access
the data randomly!

By way of example, to read all of a 157MB file into memory from a local
(i.e. not networked) disk using readlines() takes less than 4 seconds
on a 1.4Ghz Athlon processor (see below). The average new corporate
desktop box is about twice as fast as that. Note that Windows Task
Manager showed 100% CPU utilisation for both read() and readlines().

My guess is that you don't need anything much fancier than the effbot's
index method -- which by now you have probably found works straight out
of the box and is more than fast enough for your needs.

BTW, you need to clarify "don't have access to an RDBMS" ... surely
this can only be due to someone stopping them from installing good free
software freely available on the Internet.

HTH,
John

C:\junk>python -m timeit -n 1 -r 6 "print
len(file('bigfile.csv').read())"
157581595
157581595
157581595
157581595
157581595
157581595
1 loops, best of 6: 3.3e+006 usec per loop

C:\junk>python -m timeit -n 1 -r 6 "print
len(file('bigfile.csv').readlines())"
1118870
1118870
1118870
1118870
1118870
1118870
1 loops, best of 6: 3.57e+006 usec per loop