[Tutor] sorting and editing large data files

Thu Dec 16 12:15:18 CET 2004

Hello!

I recently suffered a loss of programming files (and I had been
putting off my backups...)

The project I am working on involved the use of programs which I had
been assisted with by some of the group.

I am rewritten the programs however my results are not what I expected.

I would appreciate anyone who could follow through the steps and
programs below and tell me if it seems correct, which it does to me. 
It could be the error is based on a flaw that I had made before.

I will give very simplified explanations of the data to help.

Starting a file with genetic data about 95 mb

A sample of the file format is:

>ENSE00001384652.1|ENSG00000166157.5|ENST00000359693.1
assembly=NCBI35|chr=21|strand=reverse|bases 10012801 to 10012624|exons
plus upstream and downstream regions for exon
GCGGCCGTTCAAGGCAGGGGGCGGGGCGTCTCCGAGCGGCGGGGCCAAGGGAGGGCACAACAGCTGCTACCTGAACAGTTTCTGACCCAACAGTTACCCAGCGCCGGACTCGCTGCGCCCCGGCGGCTCTAGGGACCCCCGGCGCCTACACTTAGCTCCGCGCCCGAGGTGAGCCCAG

It is hard to tell due to email formating however  from the ">" to the
words "regions for exon" is one line, all the GCATs are on one line.

ENSExxxxxxx is an exon id tag  followed by a ENSGxxxxxgene id tag then
a ENSTxxxxxxx transcript id tag followed by information about the
location of exon.

the letters GCTA are the actually exon sequence.

1 gene can have different versions, these are called transcripts and
each transcript contains a number of exons

In order to visually understand the data better I made a script to
organize it into sections with Gene ID, then Trancript ID followed by
the different  Exon IDs like so:

NEW GENE
ENSG00000166157.5 ENST00000359693.1 ENSE00001384652.1
ENSE00001365174.1 ENSE00001372282.1 ENSE00001369818.1
ENSE00001371166.1 ENSE00001369770.1 ENSE00001122873.3
ENSE00001156126.1 ENSE00001156118.1 ENSE00001378322.1
ENSE00001156105.1 ENSE00001156099.1 ENSE00001156092.1
ENSE00001156083.1 ENSE00001156075.1 ENSE00001156069.1
ENSE00001156063.1 ENSE00001156057.1 ENSE00001100323.1
ENSE00001156046.1 ENSE00001126180.1 ENSE00001100365.1
ENSE00001156031.1 ENSE00001231719.5

	NEW TRANSCRIPT
ENSG00000166157.5 ENST00000298232.4 ENSE00001384652.1
ENSE00001365174.1 ENSE00001372282.1 ENSE00001369818.1
ENSE00001371166.1 ENSE00001369770.1 ENSE00001156126.1
ENSE00001156118.1 ENSE00001378322.1 ENSE00001156105.1
ENSE00001156099.1 ENSE00001156092.1 ENSE00001156083.1
ENSE00001156075.1 ENSE00001156069.1 ENSE00001156063.1
ENSE00001156057.1 ENSE00001100323.1 ENSE00001156046.1
ENSE00001126180.1 ENSE00001100365.1 ENSE00001156031.1
ENSE00001231719.5

	NEW TRANSCRIPT
ENSG00000166157.5 ENST00000342420.2 ENSE00001384652.1
ENSE00001365174.1 ENSE00001372282.1 ENSE00001369818.1
ENSE00001371166.1 ENSE00001369770.1 ENSE00001156118.1
ENSE00001378322.1 ENSE00001156105.1 ENSE00001156099.1
ENSE00001156092.1 ENSE00001156083.1 ENSE00001156075.1
ENSE00001156069.1 ENSE00001156063.1 ENSE00001156057.1
ENSE00001100323.1 ENSE00001156046.1 ENSE00001126180.1
ENSE00001100365.1 ENSE00001156031.1 ENSE00001231719.5

	NEW TRANSCRIPT
ENSG00000166157.5 ENST00000339704.2 ENSE00001384652.1
ENSE00001364671.1 ENSE00001387876.1 ENSE00001384389.1
ENSE00001156111.1 ENSE00001156105.1 ENSE00001156099.1
ENSE00001156092.1 ENSE00001156083.1 ENSE00001156075.1
ENSE00001156069.1 ENSE00001156063.1 ENSE00001156057.1
ENSE00001100323.1 ENSE00001156046.1 ENSE00001126180.1
ENSE00001100365.1 ENSE00001156031.1 ENSE00001231719.5
end of gene group

NEW GENE
ENSG00000187172.4 ENST00000335369.2 ENSE00001428001.1
ENSE00001331994.3 ENSE00001398768.1 ENSE00001429773.1
etc.

This was accomplished with the following program I called OrganizeData.py

#########################################################3
#execute with python datafile outputfile

import re,sys,string

#regular expression to pull out gene, transcript and exon ids

info=re.compile('^>(ENSE\d+\.\d).+(ENSG\d+\.\d).+(ENST\d+\.\d)')
#	sample from the file #>>ENSE00001373825.1|ENSG00000187908.1|ENST00000344467.1

file = open( sys.argv[1], 'r' )			#file to read from	
WriteFile=open(sys.argv[2], 'w')		#file to write to

#initialize some variables  GENEID is the name of the gene, TRANSID is
name of the transcript etc
sOldGene=""
sGENEID=""
sOldTrans=""
sTRANSID=""
sEXONID=""

for line in file:
	if info.match(line):
		Matched= info.match(line)

		sEXONID=Matched.group(1)
		sGENEID=Matched.group(2)
		sTRANSID=Matched.group(3)

		if sOldGene==sGENEID:	#if current line is the same gene as last line
			if sOldTrans == sTRANSID:  #if current line is same transcript as last line
				print >> WriteFile, sEXONID,	#add new exon to the current transcript line

			else:
				print >> WriteFile,"\n\n\tNEW TRANSCRIPT\n", sGENEID, sTRANSID,
sEXONID,   #new transcript of same gene print 2 													   #lines
down

		else:
			print  >> WriteFile,"\nend of gene group\n\nNEW GENE\n", sGENEID ,
sTRANSID, sEXONID, 	#new gene drop down 4 lines for
																#readability
		sOldGene=sGENEID
		sOldTrans=sTRANSID

##########################################################

Within each gene if an exon is present in each transcript it is non
informative for my purposes and with help I wrote a script to create
two files, a list 1 exonID per line of the exons that are present in
each transcript and another that let me look at the data in a
different way, namely the gene with the set of redundant exons.

THe script is as follows:
##########################################################3

# exon editing script 
#use set to use calculus style editing
import sys, string
#

#split it up so each exon is checked against the other transcripts for
the same gene
#if the exon occurs in all transcripts write it to file
#
#use the format on a line in the file of 

#  NEW GENE
#  ENSG00000184895.2 ENST00000327563.1 ENSE00001299380.1 

# 	 	NEW TRANSCRIPT
#  ENSG00000184895.2 ENST00000341596.1 ENSE00001366589.1 ENSE00001364470.1 
#  end of gene group
#  NEW GENE
#  ENSG00000099715.5 ENST00000333703.3 ENSE00001324247.1
ENSE00001300317.1 ENSE00001317536.1 ENSE00000981568.3
ENSE00000652232.1 ENSE00001091626.2

#  	NEW TRANSCRIPT
#  ENSG00000099715.5 ENST00000342434.1 ENSE00001365337.1
ENSE00001387281.1 ENSE00001368624.1 ENSE00001389148.1
ENSE00000981568.3 ENSE00000652232.1 ENSE00001377109.1

#regular expression to pull out gene, transcript and exon ids

info=re.compile('^(ENSG\d+\.\d).+(ENST\d+\.\d).+(ENSE\d+\.\d)+')
#above is match gene, transcript, then one or more exons

#TFILE = open(sys.argv[1], 'r' )			    #read the various transcripts from
WFILE=open(sys.argv[1], 'w')			    # file to write 2 careful with 'w'
will overwrite old info in file
W2FILE=open(sys.argv[2], 'w')			    #this file will have the names of
redundant exons
import sets
def getintersections(fname='Z:\datasets\h35GroupedDec15b.txt'):
	exonSets = {}
	f = open(fname)
	for line in f:
	    if line.startswith('ENS'):
	        parts = line.split()
	        gene = parts[0]
	        transcript = parts[1]
	        exons = parts[2:]
	        exonSets.setdefault(gene,
	                 sets.Set(exons)).intersection(sets.Set(exons))

	return exonSets

if __name__ == '__main__':
    es = getintersections('Z:\datasets\h35GroupedDec15b.txt')
    for k,v in es.items():
       print >> WFILE, k,v,"\n\n"
       for Rexon in v:
	       print >> W2FILE, Rexon

####################################################################

The ouput files look like so:
First is:
ENSG00000160818.4 Set(['ENSE00001054700.1', 'ENSE00001054696.4',
'ENSE00001054703.1', 'ENSE00001377097.1', 'ENSE00001376166.1',
'ENSE00001363386.1', 'ENSE00001054698.3', 'ENSE00001054701.1'])

ENSG00000102081.4 Set(['ENSE00000677348.1', 'ENSE00000677356.1',
'ENSE00000677362.1', 'ENSE00000677344.1', 'ENSE00000677352.1',
'ENSE00001256572.2', 'ENSE00001378539.2', 'ENSE00001378572.1',
'ENSE00000677358.1', 'ENSE00001385529.1', 'ENSE00000677354.1',
'ENSE00000677346.1', 'ENSE00000677350.1', 'ENSE00000677366.2',
'ENSE00000677364.1', 'ENSE00000677360.1', 'ENSE00001023649.1'])

ENSG00000124159.4 Set(['ENSE00001401662.1', 'ENSE00001366493.2',
'ENSE00000844941.1', 'ENSE00001306994.2', 'ENSE00000844932.1',
'ENSE00000844938.1', 'ENSE00000844930.1', 'ENSE00000906910.1',
'ENSE00000906908.1', 'ENSE00001430846.1', 'ENSE00001224295.6'])

etc

the gene with the set of what should be exons found in each
transcript, ie if there are 4 transcripts the exon must be in all four
to be included in the set

the second output file should be exons only found in each transcript:

ENSE00001054700.1
ENSE00001054696.4
ENSE00001054703.1
ENSE00001377097.1
ENSE00001376166.1
ENSE00001363386.1
ENSE00001054698.3
ENSE00001054701.1
etc.

First problem.  When I look at the dataset from which these two
outputs are created the exons are not present in all transcripts. 
When I look at the first gene listed in the first output file the
first exon ENSE00001054700.1 is present in both the transcripts listed
in Z:\datasets\h35GroupedDec15b.txt however the second exon in that
list (ENSE00001054696.4) is only found in the first of the two
transcripts.

as below:
NEW GENE
ENSG00000160818.4 ENST00000292338.3 ENSE00001363386.1
ENSE00001054698.3 ***ENSE00001054700.1*** ENSE00001054701.1
ENSE00001054703.1 ENSE00001377097.1 ENSE00001376166.1
***ENSE00001054696.4****

	NEW TRANSCRIPT
ENSG00000160818.4 ENST00000334588.2 ENSE00001434292.1
ENSE00001054698.3 ***ENSE00001054700.1*** ENSE00001054701.1
ENSE00001054703.1 ENSE00001336475.3
end of gene group

Any suggestions on where I am going wrong?

Scott