[Tutor] R:reformatting data and traspose dictionary

Wed Apr 20 10:05:07 EDT 2016

jarod_v6--- via Tutor wrote:

> Dear All,
> sorry for my not  good presentation of the code.
> 
> I read a txt file  and I prepare a ditionary
> 
> files = os.listdir(".")
> tutto={}
> annotatemerge = {}
> for i in files:

By the way `i` is the one of the worst choices to denote a filename, only to 
be beaten by `this_is_not_a_filename` ;)

> with open(i,"r") as f:
> for it in f:
> lines = it.rstrip("\n").split("\t")
> 
> if len(lines) >2 and lines[0] != '#CHROM':
> 
> conte = [lines[0],lines[1],lines[3],lines[4]]
> 
> 
> tutto.setdefault(i+"::"+"-".join(conte)+"::"+str(lines),[]).append(1)
> annotatemerge.setdefault("-".join(conte),set()).add(i)
> 
> 
> 
> I create two dictionary one
> 
> annotatemerge  with use as key some coordinate ( chr3-195710967-C-CG)  and
> connect with a set container with the name of file names
> 'chr3-195710967-C-CG': {'M8.vcf'},
>  'chr17-29550645-T-C': {'M8.vcf'},
>  'chr7-140434541-G-A': {'M8.vcf'},
>  'chr14-62211578-CGTGT-C': {'M8.vcf', 'R76.vcf'},
>  'chr3-197346770-GA-G': {'M8.vcf', 'R76.vcf'},
>  'chr17-29683975-C-T': {'M8.vcf'},
>  'chr13-48955585-T-A': {'R76.vcf'},
> 
>  the other dictionary report more information with as key a list of
>  separated
> using this symbol "::"
> 
> 
>   {["M8.vcf::chr17-29665680-A-G::['chr17', '29665680', '.', 'A', 'G',
>   {['70.00',
> 'PASS', 'DP=647;TI=NM_001042492,NM_000267;GI=NF1,NF1;FC=Silent,Silent',
> 'GT:GQ: AD:VF:NL:SB:GQX', '0/1:70:623,24:0.
> 0371:20:-38.2744:70']": [1],...}
> 
> 
> What I want to obtaine is  a list  whith this format:
> 
> coordinate\tM8.vcf\tR76.vcf\n
> chr3-195710967-C-CG\t1\t0\n
> chr17-29550645-T-C\t1\t0\n
> chr3-197346770-GA-G\t\1\t1\n
> chr13-48955585-T-A\t0\t1\n
> 
> 
> When I have that file I want to traspose that table so have the coordinate
> on columns and names of samples on rows

(1) Here's a generic way to create a pivot table:

def add(x, y):
    return x + y

def pivot(
        data,
        get_column, get_row, get_value=lambda item: 1,
        accu=add,
        default=0, empty="-/-"):
    rows = {}
    columnkeys = set()
    for item in data:
        rowkey = get_row(item)
        columnkey = get_column(item)
        value = get_value(item)
        column = rows.setdefault(rowkey, {})
        column[columnkey] = accu(column.get(columnkey, default), value)
        columnkeys.add(columnkey)

    columnkeys = sorted(columnkeys)
    result = [
        [""] + columnkeys
    ]
    for rowkey in sorted(rows):
        row = rows[rowkey]
        result.append([rowkey] + [row.get(ck, empty) for ck in columnkeys])
    return result

if __name__ == "__main__":
    import csv
    import sys
    from operator import itemgetter

    data = [
        ("alpha", "one"),
        ("beta", "two"),
        ("gamma", "three"),
        ("alpha", "one"),
        ("gamma", "one"),
    ]

    csv.writer(sys.stdout, delimiter="\t").writerows(
        pivot(
            data, itemgetter(0), itemgetter(1)))
    print("")
    csv.writer(sys.stdout, delimiter="\t").writerows(
        pivot(
            data, itemgetter(1), itemgetter(0)))

As you can see when you run the above code transposing the table is done by 
swapping the get_column() and get_row() arguments.

Instead of the sample data you can feed it something like

# Untested. This is basically a copy of the code you posted wrapped into a 
# generator. I used csv.reader() instead of splitting the lines manually.

import csv

def gen_data():
    for filename in os.listdir():
        with open(filename, "r") as f:
            for fields in csv.reader(f, delimiter="\t"):
                if len(fields) > 2 and fields[0] != '#CHROM':
                    conte = "-".join(
                        [fields[0], fields[1], fields[3], fields[4]])
                    yield conte, filename

(2) What you want to do with the other dict is still unclear to me.