[Tutor] FASTA parsing, biological sequence analysis

Tue Apr 1 05:04:39 CEST 2014

On Tue, Mar 25, 2014 at 8:36 AM, Sydney Shall <s.shall at virginmedia.com> wrote:
> I did not know about biopython, but then I am a debutant.
> I tried to import biopython and I get the message that the name is unknown.

No problem.  It is an external library; I hope that you were able to
find it!  I just want to make sure no one else tries to write yet
another FASTA parser badly.  It's all too easy to code something
quick-and-dirty that almost solves the issue.  The devil's in the
details.

It might be instructive to look at source code.  You can look at:

    https://github.com/biopython/biopython/blob/master/Bio/SeqIO/FastaIO.py

and see all the implementation details the Biopython community has had
to consider in the real world.

These include things like skipping crazy garbage at the beginning of files,

    https://github.com/biopython/biopython/blob/master/Bio/SeqIO/FastaIO.py#L40-L45

and providing a stream-like interface by using generators (using the
"yield" command):

    https://github.com/biopython/biopython/blob/master/Bio/SeqIO/FastaIO.py#L65

But also consider data validation facilities.  At least, the Biopython
folks have.  They provide a way to declare the genomic alphabet to be
used:

    https://github.com/biopython/biopython/blob/master/Bio/SeqIO/FastaIO.py#L73
    https://github.com/biopython/biopython/blob/master/Bio/Alphabet/

where if the input data doesn't match the allowed alphabet, you'll get
a good warning about it ahead of time.  This is checked in places
like:

    https://github.com/biopython/biopython/blob/master/Bio/Alphabet/__init__.py#L375
    https://github.com/biopython/biopython/blob/master/Bio/Seq.py#L336

In short, in the presence of potentially messy data, the developers
have thought about these sorts of issues and have programmed for those
situations.

As the commit history demonstrates:

    https://github.com/biopython/biopython/commits/master

they started work in the last century or so (since at least
1999-12-07), and continue to work on it even now.  So taking advantage
of their generous and hard work is a good idea.