[Tutor] Huge list comprehension

Peter Otten __peter__ at web.de
Tue Jun 6 03:58:00 EDT 2017


syed zaidi wrote:

> 
> hi,
> 
> I would appreciate if you can help me suggesting a quick and efficient
> strategy for comparing multiple lists with one principal list
> 
> I have about 125 lists containing about 100,000 numerical entries in each
> 
> my principal list contains about 6 million entries.
> 
> I want to compare each small list with main list and append yes/no or 0/1
> in each new list corresponding to each of 125 lists
> 
> 
> The program is working but it takes ages to process huge files,
> Can someone pleases tell me how can I make this process fast. Right now it
> takes arounf 2 weeks to complete this task
> 
> 
> the code I have written and is working is as under:
> 
> 
> sample_name = []
> 
> main_op_list,principal_list = [],[]
> dictionary = {}
> 
> with open("C:/Users/INVINCIBLE/Desktop/T2D_ALL_blastout_batch.txt", 'r')
> as f:
>     reader = csv.reader(f, dialect = 'excel', delimiter='\t')
>     list2 = filter(None, reader)
>     for i in range(len(list2)):
>         col1 = list2[i][0]
>         operon = list2[i][1]
>         main_op_list.append(operon)
>         col1 = col1.strip().split("_")
>         sample_name = col1[0]
>         if dictionary.get(sample_name):
>             dictionary[sample_name].append(operon)
>         else:
>             dictionary[sample_name] = []
>             dictionary[sample_name].append(operon)
> locals().update(dictionary) ## converts dictionary keys to variables

Usually I'd refuse to go beyond the line above.
DO NOT EVER WRITE CODE LIKE THAT.
You have your data in a nice dict -- keep it there where it belongs.

> ##print DLF004
> dict_values = dictionary.values()
> dict_keys = dictionary.keys()
> print dict_keys
> print len(dict_keys)
> main_op_list_np = np.array(main_op_list)
> 
> 
DLF002_1,DLF004_1,DLF005_1,DLF006_1,DLF007_1,DLF008_1,DLF009_1,DLF010_1,DLF012_1,DLF013_1,DLF014_1,DLM001_1,DLM002_1,DLM003_1,DLM004_1,DLM005_1,DLM006_1,DLM009_1,DLM011_1,DLM012_1,DLM018_1,DOF002_1,DOF003_1
> =[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]

This is mind-numbing...

> for i in main_op_list_np:
>     if i in DLF002: DLF002_1.append('1')
>     else:DLF002_1.append('0')
>     if i in DLF004: DLF004_1.append('1')
>     else:DLF004_1.append('0')
>     if i in DLF005: DLF005_1.append('1')
>     else:DLF005_1.append('0')
>     if i in DLF006: DLF006_1.append('1')
>     else:DLF006_1.append('0')

... and this is, too. 

Remember, we are volunteers and keep your code samples small. Whether there 
are three if-else checks or one hundred -- the logic remains the same.

Give us a small sample script and a small dataset to go with it, use dicts 
instead of dumping everything into the module namespace, explain the 
script's purpose in plain english, and identify the parts that take too long 
-- then I'll take another look.



More information about the Tutor mailing list