Creating Dict of Dict of Lists with joblib and Multiprocessing
Sims, David (NIH/NCI) [C]
david.sims2 at nih.gov
Wed Apr 20 10:43:05 EDT 2016
Hi,
Cross posted at http://stackoverflow.com/questions/36726024/creating-dict-of-dicts-with-joblib-and-multiprocessing, but thought I'd try here too as no responses there so far.
A bit new to python and very new to parallel processing in python. I have a script that will process a datafile and generate a dict of dicts. However, as I need to run this task on hundreds to thousands of these files and ultimately collate the data, I thought parallel processing made a lot of sense. However, I can't seem to figure out how to create a data structure. Minimal script without all the helper functions:
#!/usr/bin/python
import sys
import os
import re
import subprocess
import multiprocessing
from joblib import Parallel, delayed
from collections import defaultdict
from pprint import pprint
def proc_vcf(vcf,results):
sample_name = vcf.rstrip('.vcf')
results.setdefault(sample_name, {})
# Run Helper functions 'run_cmd()' and 'parse_variant_data()' to generate a list of entries. Expect a dict of dict of lists
all_vars = run_cmd('vcfExtractor',vcf)
results[sample_name]['all_vars'] = parse_variant_data(all_vars,'all')
# Run Helper functions 'run_cmd()' and 'parse_variant_data()' to generate a different list of data based on a different set of criteria.
mois = run_cmd('moi_report', vcf)
results[sample_name]['mois'] = parse_variant_data(mois, 'moi')
return results
def main():
input_files = sys.argv[1:]
# collected_data = defaultdict(lambda: defaultdict(dict))
collected_data = {}
# Parallel Processing version
# num_cores = multiprocessing.cpu_count()
# Parallel(n_jobs=num_cores)(delayed(proc_vcf)(vcf,collected_data) for vcf in input_files)
# for vcf in input_files:
# proc_vcf(vcf, collected_data)
pprint(dict(collected_data))
return
if __name__=="__main__":
main()
Hard to provide source data as it's very large, but basically, the dataset will generate a dict of dicts of lists that contain two sets of data for each input keyed by sample and data type:
{ 'sample1' : {
'all_vars' : [
'data_val1',
'data_val2',
'etc'],
'mois' : [
'data_val_x',
'data_val_y',
'data_val_z']
}
'sample2' : {
'all_vars' : [
.
.
.
]
}
}
If I run it without trying to multiprocess, not a problem. I can't figure out how to parallelize this and create the same data structure. I've tried to use defaultdict to create a defaultdict in main() to pass along, as well as a few other iterations, but I can't seem to get it right (getting key errors, pickle errors, etc.). Can anyone help me with the proper way to do this? I think I'm not making / initializing / working with the data structure correctly, but maybe my whole approach is ill conceived?
More information about the Python-list
mailing list