[Tutor] sorting and editing large data files

Thu Dec 16 20:40:02 CET 2004

On Thu, 16 Dec 2004, Scott Melnyk wrote:

> I recently suffered a loss of programming files (and I had been
> putting off my backups...)

Hi Scott,

[Side note that's not really related to Python: if you don't use a version
control system to manage your software yet, please learn to use one.
There's a very good one called Subversion:

    http://subversion.tigris.org/

A good version control system is invaluable to programmers, since many
programs are not really written once, but are rather maintained and
revised for a long time.]

Rich Krauter already caught the bug that was occurring: the intersection()
method of sets produces a brand new set, rather than do a mutation on the
old set.

Here are some more comments on the programs you've shown us:

> A sample of the file format is:
>
> >ENSE00001384652.1|ENSG00000166157.5|ENST00000359693.1
> assembly=NCBI35|chr=21|strand=reverse|bases 10012801 to 10012624|exons
> plus upstream and downstream regions for exon
> GCGGCCGTTCAAGGCAGGGGGCGGGGCGTCTCCGAGCGGCGGGGCCAAGGGAGGGCACAACAGCTGCTACCTGAACAGTTTCTGACCCAACAGTTACCCAGCGCCGGACTCGCTGCGCCCCGGCGGCTCTAGGGACCCCCGGCGCCTACACTTAGCTCCGCGCCCGAGGTGAGCCCAG

Ah, ok, so this is a FASTA file.

(For others on the list, see:

    http://ngfnblast.gbf.de/docs/fasta.html

for a description of the BLAST format.)

> ENSExxxxxxx is an exon id tag  followed by a ENSGxxxxxgene id tag then
> a ENSTxxxxxxx transcript id tag followed by information about the
> location of exon.

[text cut]

Ok, so it sounds like your program will mostly pay attention to each
description line in the FASTA file.

> In order to visually understand the data better I made a script to
> organize it into sections with Gene ID, then Trancript ID followed by
> the different Exon IDs like so:

[lots of text cut]

There's one big assumption in the code to OrganizeData.py that may need to
be explicitely stated: the code appears to assume that the sequences are
already sorted and grouped in order, since the code maintains a variable
named 'sOldGene' and maintains some state on the last FASTA sequence that
has been seen.  Is your data already sorted?

As a long term suggestion: you may find it easier to write the extracted
data out as something that will be easy to work with for the next stages
of your pipeline.  Human readability is important too, of course, so
there's a balance necessary between machine and numan convenience.

If possible, you may want to make every record a single line, rather than
have a record spread across several lines.  Your program does do this in
some part, but it also adds other output, like announcements to signal the
start of new genes.  That can complicates later stages of your analysis.

Eric Raymond has summarized the rules-of-thumb that the Unix utitilies try
to follow:

    http://www.faqs.org/docs/artu/ch05s02.html#id2907428

As a concrete counterexample, the 'BLAST' utility that bioinformaticians
use has a default output that's very human readable, but so ad-hoc that
programs that try to use BLAST output often have to resort to fragile,
hand-crafted BLAST parsers.  The situation's a lot better now, since newer
versions of BLAST finally support a structured format.

So I'd strongly recommend dropping the "NEW GENE", "end of gene group",
and "NEW TRANSCRIPT" lines out of your output.  And if you really want to
keep them, you can write a separate program that adds those notes back
into the output of OrganizeData.py for a human reader.

If you drop those decorations out, then the other parts of your pipeline
can be simplified, since you can assume that each line of the input file
is data.  The other stages of your analysis pipeline appear to try to
ignore those lines anyway, so why put them in?  *grin*