[Tutor] sorting and editing large data files
Rich Krauter
rmkrauter at yahoo.com
Thu Dec 16 14:53:11 CET 2004
Scott Melnyk wrote:
> Hello!
>
> I recently suffered a loss of programming files (and I had been
> putting off my backups...)
>
[snip]
>
> #regular expression to pull out gene, transcript and exon ids
>
> info=re.compile('^(ENSG\d+\.\d).+(ENST\d+\.\d).+(ENSE\d+\.\d)+')
> #above is match gene, transcript, then one or more exons
>
>
> #TFILE = open(sys.argv[1], 'r' ) #read the various transcripts from
> WFILE=open(sys.argv[1], 'w') # file to write 2 careful with 'w'
> will overwrite old info in file
> W2FILE=open(sys.argv[2], 'w') #this file will have the names of
> redundant exons
> import sets
> def getintersections(fname='Z:\datasets\h35GroupedDec15b.txt'):
> exonSets = {}
> f = open(fname)
> for line in f:
> if line.startswith('ENS'):
> parts = line.split()
> gene = parts[0]
> transcript = parts[1]
> exons = parts[2:]
> exonSets.setdefault(gene,
> sets.Set(exons)).intersection(sets.Set(exons))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> return exonSets
Hi Scott,
There may be other problems, but here's one thing I noticed:
exonSets.setdefault(gene,
sets.Set(exons)).intersection(sets.Set(exons))
should be
exonSets.setdefault(gene,
sets.Set(exons)).intersection_update(sets.Set(exons))
Hope that helps.
Rich
More information about the Tutor
mailing list