[Tutor] FASTA parsing, biological sequence analysis

Danny Yoo dyoo at hashcollision.org
Tue Apr 1 05:04:39 CEST 2014

On Tue, Mar 25, 2014 at 8:36 AM, Sydney Shall <s.shall at virginmedia.com> wrote:
> I did not know about biopython, but then I am a debutant.
> I tried to import biopython and I get the message that the name is unknown.

No problem.  It is an external library; I hope that you were able to
find it!  I just want to make sure no one else tries to write yet
another FASTA parser badly.  It's all too easy to code something
quick-and-dirty that almost solves the issue.  The devil's in the

It might be instructive to look at source code.  You can look at:


and see all the implementation details the Biopython community has had
to consider in the real world.

These include things like skipping crazy garbage at the beginning of files,


and providing a stream-like interface by using generators (using the
"yield" command):


But also consider data validation facilities.  At least, the Biopython
folks have.  They provide a way to declare the genomic alphabet to be


where if the input data doesn't match the allowed alphabet, you'll get
a good warning about it ahead of time.  This is checked in places


In short, in the presence of potentially messy data, the developers
have thought about these sorts of issues and have programmed for those

As the commit history demonstrates:


they started work in the last century or so (since at least
1999-12-07), and continue to work on it even now.  So taking advantage
of their generous and hard work is a good idea.

