[Tutor] sorting and editing large data files

Thu Dec 16 14:53:11 CET 2004

Scott Melnyk wrote:
> Hello!
> 
> I recently suffered a loss of programming files (and I had been
> putting off my backups...)
> 
[snip]
>
> #regular expression to pull out gene, transcript and exon ids
> 
> info=re.compile('^(ENSG\d+\.\d).+(ENST\d+\.\d).+(ENSE\d+\.\d)+')
> #above is match gene, transcript, then one or more exons
> 
> 
> #TFILE = open(sys.argv[1], 'r' )			    #read the various transcripts from
> WFILE=open(sys.argv[1], 'w')			    # file to write 2 careful with 'w'
> will overwrite old info in file
> W2FILE=open(sys.argv[2], 'w')			    #this file will have the names of
> redundant exons
> import sets
> def getintersections(fname='Z:\datasets\h35GroupedDec15b.txt'):
> 	exonSets = {}
> 	f = open(fname)
> 	for line in f:
> 	    if line.startswith('ENS'):
> 	        parts = line.split()
> 	        gene = parts[0]
> 	        transcript = parts[1]
> 	        exons = parts[2:]
> 	        exonSets.setdefault(gene,
> 	                 sets.Set(exons)).intersection(sets.Set(exons))
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 	return exonSets

Hi Scott,

There may be other problems, but here's one thing I noticed:

exonSets.setdefault(gene,
     sets.Set(exons)).intersection(sets.Set(exons))

should be

exonSets.setdefault(gene,
    sets.Set(exons)).intersection_update(sets.Set(exons))

Hope that helps.

Rich