[Tutor] Merging Text Files

Wed Oct 13 23:14:59 CEST 2010

On 10/13/2010 1:16 PM Ara Kooser said...
> Hello all,
>
>    I am working on merging two text files with fields separated by commas.
> The files are in this format:
>
> File ONE:
> *Species, Protein ID, E value, Length*
> Streptomyces sp. AA4, ZP_05482482, 2.8293600000000001e-140, 5256,
> Streptomyces sp. AA4, ZP_05482482, 8.0333299999999997e-138, 5256,
> Streptomyces sp. AA4, ZP_05482482, 1.08889e-124, 5256,
> Streptomyces sp. AA4, ZP_07281899, 2.9253900000000001e-140, 5260,
>
> File TWO:
> *Protein ID, Locus Tag, Start/Stop*
> ZP_05482482, StAA4_010100030484, complement(NZ_ACEV01000078.1:25146..40916)
> ZP_07281899, SSMG_05939, complement(NZ_GG657746.1:6565974..6581756)
>
> I looked around for other posts about merging text files and I have this
> program:
> one = open("final.txt",'r')
> two = open("final_gen.txt",'r')
>
> merge = open("merged.txt",'w')
> merge.write("Species,  Locus_Tag,  E_value,  Length, Start/Stop\n")
>
> for line in one:
>       print(line.rstrip() + two.readline().strip())
>       merge.write(str([line.rstrip() + two.readline().strip()]))
>       merge.write("\n")
> merge.close()
>
> inc = file("merged.txt","r")
> outc = open("final_merge.txt","w")
> for line in inc:
>      line = line.replace('[','')
>      line = line.replace(']','')
>      line = line.replace('{','')
>      line = line.replace('}','')
>      outc.write(line)
>
> inc.close()
> outc.close()
> one.close()
> two.close()
>
> This does merge the files.
> Streptomyces sp. AA4, ZP_05482482, 2.8293600000000001e-140,
> 5256,ZP_05482482, StAA4_010100030484,
> complement(NZ_ACEV01000078.1:25146..40916)
> Streptomyces sp. AA4, ZP_05482482, 8.0333299999999997e-138,
> 5256,ZP_05477599, StAA4_010100005861, NZ_ACEV01000013.1:86730..102047
>
> But file one has multiple instances of the same Protein ID such as
> ZP_05482482. So the data doesn't line up anymore.  I would like the program
> to search for each Protein ID number and write the entry from file 2 in each
> place and then move on to the next ID number.
>
> Example of desired output:
> Streptomyces sp. AA4, ZP_05482482, StAA4_010100030484,
> 2.8293600000000001e-140, 5256, complement(NZ_ACEV01000078.1:25146..40916)
> Streptomyces sp. AA4, ZP_05482482, StAA4_010100030484,
> 8.0333299999999997e-138, 5256, complement(NZ_ACEV01000078.1:25146..40916)
>
> I was thinking about writing the text files into a dictionary and then
> searching for each ID and then insert the content from file TWO into where
> the IDs match. But I am not sure how to start. Is there a more pythony way
> to go about doing this?
>

I would read in file two and build a dict from the Protein IDs, then 
pass file one, break out the Protein ID, and write the concatenated 
result out.  Something like:

[pyseudocode]

PIDs = {}
for proteinVals in FileTwo:
   ID = proteinVals.split()[0]
   PIDS[ID]=proteinVals

for eachline in FileOne:
   ID = proteinVals.split()[1]
   rslt = "%s,%s" % (eachline,PIDS[ID])
   outfile.write(rslt]

HTH,

Emile