Organize large DNA txt files

MRAB google at mrabarnett.plus.com
Fri Mar 20 11:36:25 EDT 2009


thomasvangurp at gmail.com wrote:
> Dear Fellow programmers,
> 
> I'm using Python scripts too organize some rather large datasets
> describing DNA variation. Information is read, processed and written
> too a file in a sequential order, like this
> 1+
> 1-
> 2+
> 2-
> 
> etc.. The files that i created contain positional information
> (nucleotide position) and some other info, like this:
> 
> file 1+:
> --------------------------------------------
> 1	73	0	1	0	0
> 1	76	1	0	0	0
> 1	77	0	1	0	0
> --------------------------------------------
> file 1-
> --------------------------------------------
> 1	74	0	0	6	0
> 1	78	0	0	4	0
> 1	89	0	0	0	2
> 
> Now the trick is that i want this:
> 
> File 1+ AND File 1-
> --------------------------------------------
> 1	73	0	1	0	0
> 1	74	0	0	6	0
> 1	76	1	0	0	0
> 1	77	0	1	0	0
> 1	78	0	0	4	0
> 1	89	0	0	0	2
> -------------------------------------------
> 
> So the information should be sorted onto position. Right now I've
> written some very complicated scripts that read a number of lines from
> file 1- and 1+ and then combine this output. The problem is of course
> that the running number of file 1- can be lower then 1+, resulting in
> a incorrect order. Since both files are too large to input in a
> dictionary at once (both are 100 MB+) I need some sort of a
> alternative that can quickly sort everything without crashing my pc..
> 
Here's my attempt:

line_1 = input_1.readline()
line_2 = input_2.readline()
while line_1 and line_2:
     pos_1 = int(line_1.split(None, 2)[1])
     pos_2 = int(line_2.split(None, 2)[1])
     if pos_1 < pos_2:
         output.write(line_1)
         line_1 = input_1.readline()
     else:
         output.write(line_2)
         line_2 = input_2.readline()
while line_1:
     output.write(line_1)
     line_1 = input_1.readline()
while line_2:
     output.write(line_2)
     line_2 = input_2.readline()




More information about the Python-list mailing list