[Tutor] Iterating over columns in a tab-delimited text and writing SQL statements

Kent Johnson kent_johnson at skillsoft.com
Sat Nov 6 06:33:21 CET 2004


You need two nested loops to do this. zip() is also very handy, it lets you 
iterate two lists at the same time. I think this does what you want:

f1 = open('GeneName.txt', 'r')
header = f1.readline().split()[3:]

for line in f1:
     data = line.split()
     for sample, expr in zip(header, data[2:]):
         sql = '''INSERT into expression (gene_name,probe_name,sample,exprs)
             VALUES ('%s', '%s', '%s', %s)''' % (data[0], data[1], sample, 
expr)
         print sql

For your sample data, it prints
INSERT into expression (gene_name,probe_name,sample,exprs)
             VALUES ('xyz', '1000_at', 'AL21', 272.3)
INSERT into expression (gene_name,probe_name,sample,exprs)
             VALUES ('xyz', '1000_at', 'AL11', 345.2)
INSERT into expression (gene_name,probe_name,sample,exprs)
             VALUES ('xyz', '1000_at', 'AL12', -32.45)
INSERT into expression (gene_name,probe_name,sample,exprs)
             VALUES ('abc', '1001_at', 'AL21', 134.4)
INSERT into expression (gene_name,probe_name,sample,exprs)
             VALUES ('abc', '1001_at', 'AL11', 45.3)
INSERT into expression (gene_name,probe_name,sample,exprs)
             VALUES ('abc', '1001_at', 'AL12', 342.59)


By the way, why not update the database directly instead of generating a 
file with ~4,000,000 insert statements? (OK, I can think of one reason - it 
may well be faster to execute the inserts in batches rather than one at a 
time...)

Kent

At 08:34 AM 11/5/2004 -0800, kumar s wrote:
>Dear group,
>  I have a file with ~200 Columns and 20,000 rows.
>
>Every column designates a patient sample.  Each such
>patient sample data should be made as an experiment in
>my database. So,
>I have to write SQL insert statements on these.
>
>My data resembles likes this in its simplest form:
>
>Gene Name  probe_name   AL21     AL11   AL12
>xyz         1000_at     272.3   345.2   -32.45
>abc         1001_at     134.4    45.3   342.59
>...............................................
>
>
>up to 20K rows.
>
>
>Now I want to create an sql statement like the
>following:
>
>INSERT into expression (gene_name,probe_name,sample,
>exprs) VALUES ('xyz', '1000_at','AL21',272.3)
>
>
>
>
>
>import string
> >>> from string import split
> >>> f1 = open('insert_test.txt','r')
> >>> lines = f1.read()
> >>> rows = split(lines,'\n')
> >>> rows
>['Gene Name\tprobe_name\tAL10\tAL21\tAL23',
>'xyz\t1000_at\t272.3\t345.2\t-32.45',
>'abc\t1001_at\t134.4\t45.3\t342.59', '']
> >>> f2 = open('sql.txt','w')
> >>> for i in range(len(lst1)):
>         cols = split(lst1[i],'\t')
>         f2.write('(INSERT INTO
>expr(gene_name,probe_name,sample,expr)
>VALUES('+cols[0]+','+cols[1]+','+sample+','+cols[2]+
>')''\n')
>
>
>
>Traceback (most recent call last):
>   File "<pyshell#63>", line 3, in -toplevel-
>     f2.write('(INSERT INTO
>expr(gene_name,probe_name,sample,expr)
>VALUES('+cols[0]+','+cols[1]+','+sample+','+cols[2]+
>')''\n')
>IndexError: list index out of range
> >>> f2.close()
>
>Result from sql.txt:
>
>
>(INSERT INTO expr(gene_name,probe_name,sample,expr)
>VALUES(Gene Name,probe_name,AL21,AL10)
>(INSERT INTO expr(gene_name,probe_name,sample,expr)
>VALUES(xyz,1000_at,AL21,272.3)
>(INSERT INTO expr(gene_name,probe_name,sample,expr)
>VALUES(abc,1001_at,AL21,134.4)
>
>
>
>My problem:
>
>I am unable to write a function that will take my
>matrix and write statements for every sample such as
>AL21, AL11,AL12.
>
>I know I am not good at looping over columns here.
>Take column 3 first and column 4 and then column 5
>like that... However
>column 1 and 2 remains same for every sample, because
>they are common for evert sample column.
>Something like:
>
>(INSERT INTO expr(gene_name,probe_name,sample,expr)
>VALUES(Gene Name,probe_name,AL21,AL10)
>(INSERT INTO expr(gene_name,probe_name,sample,expr)
>VALUES(xyz,1000_at,AL21,272.3)
>(INSERT INTO expr(gene_name,probe_name,sample,expr)
>VALUES(abc,1001_at,AL21,134.4)
>
>
>(INSERT INTO expr(gene_name,probe_name,sample,expr)
>VALUES(xyz,1000_at,AL11,345.2)
>(INSERT INTO expr(gene_name,probe_name,sample,expr)
>VALUES(abc,1001_at,AL11,45.3)
>
>(INSERT INTO expr(gene_name,probe_name,sample,expr)
>VALUES(xyz,1000_at,AL12,-32.5)
>(INSERT INTO expr(gene_name,probe_name,sample,expr)
>VALUES(abc,1001_at,AL12,345.59)
>
>
>
>2. Also why am I getting INdex out of range ,
>IndexError: ? I have no explanation to it.
>
>
>
>
>Can any one please help me.
>
>Thank you.
>
>Kumar.
>
>
>
>
>
>__________________________________
>Do you Yahoo!?
>Check out the new Yahoo! Front Page.
>www.yahoo.com
>
>
>_______________________________________________
>Tutor maillist  -  Tutor at python.org
>http://mail.python.org/mailman/listinfo/tutor



More information about the Tutor mailing list