[Tutor] Merging Text Files
Emile van Sebille
emile at fenx.com
Wed Oct 13 23:14:59 CEST 2010
On 10/13/2010 1:16 PM Ara Kooser said...
> Hello all,
>
> I am working on merging two text files with fields separated by commas.
> The files are in this format:
>
> File ONE:
> *Species, Protein ID, E value, Length*
> Streptomyces sp. AA4, ZP_05482482, 2.8293600000000001e-140, 5256,
> Streptomyces sp. AA4, ZP_05482482, 8.0333299999999997e-138, 5256,
> Streptomyces sp. AA4, ZP_05482482, 1.08889e-124, 5256,
> Streptomyces sp. AA4, ZP_07281899, 2.9253900000000001e-140, 5260,
>
> File TWO:
> *Protein ID, Locus Tag, Start/Stop*
> ZP_05482482, StAA4_010100030484, complement(NZ_ACEV01000078.1:25146..40916)
> ZP_07281899, SSMG_05939, complement(NZ_GG657746.1:6565974..6581756)
>
> I looked around for other posts about merging text files and I have this
> program:
> one = open("final.txt",'r')
> two = open("final_gen.txt",'r')
>
> merge = open("merged.txt",'w')
> merge.write("Species, Locus_Tag, E_value, Length, Start/Stop\n")
>
> for line in one:
> print(line.rstrip() + two.readline().strip())
> merge.write(str([line.rstrip() + two.readline().strip()]))
> merge.write("\n")
> merge.close()
>
> inc = file("merged.txt","r")
> outc = open("final_merge.txt","w")
> for line in inc:
> line = line.replace('[','')
> line = line.replace(']','')
> line = line.replace('{','')
> line = line.replace('}','')
> outc.write(line)
>
> inc.close()
> outc.close()
> one.close()
> two.close()
>
> This does merge the files.
> Streptomyces sp. AA4, ZP_05482482, 2.8293600000000001e-140,
> 5256,ZP_05482482, StAA4_010100030484,
> complement(NZ_ACEV01000078.1:25146..40916)
> Streptomyces sp. AA4, ZP_05482482, 8.0333299999999997e-138,
> 5256,ZP_05477599, StAA4_010100005861, NZ_ACEV01000013.1:86730..102047
>
> But file one has multiple instances of the same Protein ID such as
> ZP_05482482. So the data doesn't line up anymore. I would like the program
> to search for each Protein ID number and write the entry from file 2 in each
> place and then move on to the next ID number.
>
> Example of desired output:
> Streptomyces sp. AA4, ZP_05482482, StAA4_010100030484,
> 2.8293600000000001e-140, 5256, complement(NZ_ACEV01000078.1:25146..40916)
> Streptomyces sp. AA4, ZP_05482482, StAA4_010100030484,
> 8.0333299999999997e-138, 5256, complement(NZ_ACEV01000078.1:25146..40916)
>
> I was thinking about writing the text files into a dictionary and then
> searching for each ID and then insert the content from file TWO into where
> the IDs match. But I am not sure how to start. Is there a more pythony way
> to go about doing this?
>
I would read in file two and build a dict from the Protein IDs, then
pass file one, break out the Protein ID, and write the concatenated
result out. Something like:
[pyseudocode]
PIDs = {}
for proteinVals in FileTwo:
ID = proteinVals.split()[0]
PIDS[ID]=proteinVals
for eachline in FileOne:
ID = proteinVals.split()[1]
rslt = "%s,%s" % (eachline,PIDS[ID])
outfile.write(rslt]
HTH,
Emile
More information about the Tutor
mailing list