What strategy for random accession of records in massive FASTA file?

Michael Hoffman cam.ac.uk at mh391.invalid
Sat Jan 15 12:52:11 EST 2005


Chris Lasher wrote:

> I have a rather large (100+ MB) FASTA file from which I need to
> access records in a random order.

I just came across this thread today and I don't understand why you are
trying to reinvent the wheel instead of using Biopython which already
has a solution to this problem, among others.

But actually I usually use formatdb, which comes with NCBI-BLAST to
create blastdb files that can also be used for BLAST.

mh5 at ecs4a /data/blastdb/Users/mh5
$ python
Python 2.3.3 (#1, Jan 20 2004, 17:39:36) [C] on osf1V5
Type "help", "copyright", "credits" or "license" for more information.
>>> import blastdb
 >>> from tools2 import LightIterator
>>> temp_file = blastdb.Database("mammals.peptides.faa").fetch_to_tempfile("004/04/m00404.peptide.faa")
>>> LightIterator(temp_file).next()
('lcl|004/04/m00404.peptide.faa ENSMUSG00000022297 peptide', 'MERSPFLLACILLPLVRGHSLFTCEPITVPRCMKMTYNMTFFPNLMGHYDQGIAAVEMGHFLHLANLECSPNIEMFLCQAFIPTCTEQIHVVLPCRKLCEKIVSDCKKLMDTFGIRWPEELECNRLPHCDDTVPVTSHPHTELSGPQKKSDQVPRDIGFWCPKHLRTSGDQGYRFLGIEQCAPPCPNMYFKSDELDFAKSFIGIVSIFCLCATLFTFLTFLIDVRRFRYPERPIIYYSVCYSIVSLMYFVGFLLGNSTACNKADEKLELGDTVVLGSKNKACSVVFMFLYFFTMAGTVWWVILTITWFLAAGRKWSCEAIEQKAVWFHAVAWGAPGFLTVMLLAMNKVEGDNISGVCFVGLYDLDASRYFVLLPLCLCVFVGLSLLLAGIISLNHVRQVIQHDGRNQEKLKKFMIRIGVFSGLYLVPLVTLLGCYVYELVNRITWEMTWFSDHCHQYRIPCPYQANPKARPELALFMIKYLMTLIVGISAVFWVGSKKTCTEWAGFFKRNRKRDPISESRRVLQESCEFFLKHNSKVKHKKKHGAPGPHRLKVISKSMGTSTGATTNHGTSAMAIADHDYLGQETSTEVHTSPEASVKEGRADRANTPSAKDRDCGESAGPSSKLSGNRNGRESRAGGLKERSNGSEGAPSEGRVSPKSSVPETGLIDCSTSQAASSPEPTSLKGSTSLPVHSASRARKEQGAGSHSDA')

tools2 has this in it:

class LightIterator(object):
     def __init__(self, handle):
         self._handle = handle
         self._defline = None

     def __iter__(self):
         return self

     def next(self):
         lines = []
         defline_old = self._defline

         while 1:
             line = self._handle.readline()
             if not line:
                 if not defline_old and not lines:
                     raise StopIteration
                 if defline_old:
                     self._defline = None
                     break
             elif line[0] == '>':
                 self._defline = line[1:].rstrip()
                 if defline_old or lines:
                     break
                 else:
                     defline_old = self._defline
             else:
                 lines.append(line.rstrip())

         return defline_old, ''.join(lines)

blastdb.py:

#!/usr/bin/env python
from __future__ import division

__version__ = "$Revision: 1.3 $"

"""
blastdb.py

access blastdb files
Copyright 2005 Michael Hoffman
License: GPL
"""

import os
import sys

try:
     from poly import NamedTemporaryFile # http://www.ebi.ac.uk/~hoffman/software/poly/
except ImportError:
     from tempfile import NamedTemporaryFile

FASTACMD_CMDLINE = "fastacmd -d %s -s %s -o %s"

class Database(object):
     def __init__(self, filename):
         self.filename = filename

     def fetch_to_file(self, query, filename):
         status = os.system(FASTACMD_CMDLINE % (self.filename, query, filename))
         if status:
             raise RuntimeError, "fastacmd returned %d" % os.WEXITSTATUS(status)

     def fetch_to_tempfile(self, query):
         temp_file = NamedTemporaryFile()
         self.fetch_to_file(query, temp_file.name)
         return temp_file
-- 
Michael Hoffman



More information about the Python-list mailing list