[Tutor] R:reformatting data and traspose dictionary
Peter Otten
__peter__ at web.de
Wed Apr 20 10:05:07 EDT 2016
jarod_v6--- via Tutor wrote:
> Dear All,
> sorry for my not good presentation of the code.
>
> I read a txt file and I prepare a ditionary
>
> files = os.listdir(".")
> tutto={}
> annotatemerge = {}
> for i in files:
By the way `i` is the one of the worst choices to denote a filename, only to
be beaten by `this_is_not_a_filename` ;)
> with open(i,"r") as f:
> for it in f:
> lines = it.rstrip("\n").split("\t")
>
> if len(lines) >2 and lines[0] != '#CHROM':
>
> conte = [lines[0],lines[1],lines[3],lines[4]]
>
>
> tutto.setdefault(i+"::"+"-".join(conte)+"::"+str(lines),[]).append(1)
> annotatemerge.setdefault("-".join(conte),set()).add(i)
>
>
>
> I create two dictionary one
>
> annotatemerge with use as key some coordinate ( chr3-195710967-C-CG) and
> connect with a set container with the name of file names
> 'chr3-195710967-C-CG': {'M8.vcf'},
> 'chr17-29550645-T-C': {'M8.vcf'},
> 'chr7-140434541-G-A': {'M8.vcf'},
> 'chr14-62211578-CGTGT-C': {'M8.vcf', 'R76.vcf'},
> 'chr3-197346770-GA-G': {'M8.vcf', 'R76.vcf'},
> 'chr17-29683975-C-T': {'M8.vcf'},
> 'chr13-48955585-T-A': {'R76.vcf'},
>
> the other dictionary report more information with as key a list of
> separated
> using this symbol "::"
>
>
> {["M8.vcf::chr17-29665680-A-G::['chr17', '29665680', '.', 'A', 'G',
> {['70.00',
> 'PASS', 'DP=647;TI=NM_001042492,NM_000267;GI=NF1,NF1;FC=Silent,Silent',
> 'GT:GQ: AD:VF:NL:SB:GQX', '0/1:70:623,24:0.
> 0371:20:-38.2744:70']": [1],...}
>
>
> What I want to obtaine is a list whith this format:
>
> coordinate\tM8.vcf\tR76.vcf\n
> chr3-195710967-C-CG\t1\t0\n
> chr17-29550645-T-C\t1\t0\n
> chr3-197346770-GA-G\t\1\t1\n
> chr13-48955585-T-A\t0\t1\n
>
>
> When I have that file I want to traspose that table so have the coordinate
> on columns and names of samples on rows
(1) Here's a generic way to create a pivot table:
def add(x, y):
return x + y
def pivot(
data,
get_column, get_row, get_value=lambda item: 1,
accu=add,
default=0, empty="-/-"):
rows = {}
columnkeys = set()
for item in data:
rowkey = get_row(item)
columnkey = get_column(item)
value = get_value(item)
column = rows.setdefault(rowkey, {})
column[columnkey] = accu(column.get(columnkey, default), value)
columnkeys.add(columnkey)
columnkeys = sorted(columnkeys)
result = [
[""] + columnkeys
]
for rowkey in sorted(rows):
row = rows[rowkey]
result.append([rowkey] + [row.get(ck, empty) for ck in columnkeys])
return result
if __name__ == "__main__":
import csv
import sys
from operator import itemgetter
data = [
("alpha", "one"),
("beta", "two"),
("gamma", "three"),
("alpha", "one"),
("gamma", "one"),
]
csv.writer(sys.stdout, delimiter="\t").writerows(
pivot(
data, itemgetter(0), itemgetter(1)))
print("")
csv.writer(sys.stdout, delimiter="\t").writerows(
pivot(
data, itemgetter(1), itemgetter(0)))
As you can see when you run the above code transposing the table is done by
swapping the get_column() and get_row() arguments.
Instead of the sample data you can feed it something like
# Untested. This is basically a copy of the code you posted wrapped into a
# generator. I used csv.reader() instead of splitting the lines manually.
import csv
def gen_data():
for filename in os.listdir():
with open(filename, "r") as f:
for fields in csv.reader(f, delimiter="\t"):
if len(fields) > 2 and fields[0] != '#CHROM':
conte = "-".join(
[fields[0], fields[1], fields[3], fields[4]])
yield conte, filename
(2) What you want to do with the other dict is still unclear to me.
More information about the Tutor
mailing list