[Tutor] variation of Unique items question

Scott Melnyk melnyk at gmail.com
Fri Feb 4 12:38:51 CET 2005


Hello once more.

I am stuck on how best to tie the finding Unique Items in Lists ideas to my file

I am stuck at level below:  What I have here taken from the unique
items thread does not work as I need to separate each grouping to the
hg chain it is in (see below for examples)

import sys
WFILE=open(sys.argv[1], 'w') 
def get_list_dup_dict(fname='Z:/datasets/fooyoo.txt', threshold=2):
    a_list=open(fname, 'r')
   #print "beginning get_list_dup"
    items_dict, dup_dict = {}, {}
    
    for i in a_list:
        items_dict[i] = items_dict.get(i, 0) + 1

    for k, v in items_dict.iteritems():
        if v==threshold:
            dup_dict[k] = v    

    return dup_dict

def print_list_dup_report(fname='Z:/datasets/fooyoo.txt', threshold=2):
    #print "Beginning report generation"
    dup_dict = get_list_dup_dict(fname='Z:/datasets/fooyoo.txt', threshold=2)
    for k, v in sorted(dup_dict.iteritems()):
        print WFILE,'%s occurred %s times' %(k, v)

if __name__ == '__main__':
        print_list_dup_report()


My issue is that my file is as follows:
hg17_chainMm5_chr15 range=chr7:148238502-148239073
ENST00000339563.1
ENST00000342196.1
ENST00000339563.1
ENST00000344055.1

hg17_chainMm5_chr13 range=chr5:42927967-42928726
ENST00000279800.3
ENST00000309556.3

hg17_chainMm5_chr6 range=chr1:155548627-155549517
ENST00000321157.3
ENST00000256324.4
  
I need a print out that would give the line hg17.... and then any
instances of the ENST that occur more than once only for that chain
section.  Even better it only prints the hg17 line if it is followed
by an instance of ENST that occurs more than once

I am hoping for something that gives me an out file roughly like:

hg17_chainMm5_chr15 range=chr7:148238502-148239073
ENST00000339563.1 occurs 2 times

hg17_chainMm5_chr13 range=chr5:42927967-42928726
ENST00000279800.3 occurs 2 times
 

All help and ideas appreciated, I am trying to get this finished as
soon as possible, the output file will be used to go back to my 2 gb
file and pull out the rest of the data I need.

Thanks,
Scott


More information about the Tutor mailing list