[Tutor] Merging Text Files

Matt Williams mhw at doctors.net.uk
Wed Oct 13 22:56:39 CEST 2010


Dear Ara,

I have been working on something similar.

In the end I used a dictionary for each line in the file, and stored 
data from each file in a different set. I then matched using one (or 
more) element from each dictionary. This is really very close doing a 
join in a database, though, and if I had more time you might want to 
explore that route (csv -> sqlite, manipulate using sqlobject/ 
sqlalchemy/ django/ etc.)
 the csv module has some good facilities for reading/ writing csv files. 
However, as yet I don't think it, or csvutilities, lets you do the sort 
of merging you say.

HTH,

Matt

Robert Jackiewicz wrote:
> On Wed, 13 Oct 2010 14:16:21 -0600, Ara Kooser wrote:
>
>   
>> Hello all,
>>
>>   I am working on merging two text files with fields separated by
>>   commas.
>> The files are in this format:
>>
>> File ONE:
>> *Species, Protein ID, E value, Length* Streptomyces sp. AA4,
>> ZP_05482482, 2.8293600000000001e-140, 5256, Streptomyces sp. AA4,
>> ZP_05482482, 8.0333299999999997e-138, 5256, Streptomyces sp. AA4,
>> ZP_05482482, 1.08889e-124, 5256, Streptomyces sp. AA4, ZP_07281899,
>> 2.9253900000000001e-140, 5260,
>>
>> File TWO:
>> *Protein ID, Locus Tag, Start/Stop*
>> ZP_05482482, StAA4_010100030484,
>> complement(NZ_ACEV01000078.1:25146..40916) ZP_07281899, SSMG_05939,
>> complement(NZ_GG657746.1:6565974..6581756)
>>
>> I looked around for other posts about merging text files and I have this
>> program:
>> one = open("final.txt",'r')
>> two = open("final_gen.txt",'r')
>>
>> merge = open("merged.txt",'w')
>> merge.write("Species,  Locus_Tag,  E_value,  Length, Start/Stop\n")
>>
>> for line in one:
>>      print(line.rstrip() + two.readline().strip())
>>      merge.write(str([line.rstrip() + two.readline().strip()]))
>>      merge.write("\n")
>> merge.close()
>>
>> inc = file("merged.txt","r")
>> outc = open("final_merge.txt","w")
>> for line in inc:
>>     line = line.replace('[','')
>>     line = line.replace(']','')
>>     line = line.replace('{','')
>>     line = line.replace('}','')
>>     outc.write(line)
>>
>> inc.close()
>> outc.close()
>> one.close()
>> two.close()
>>
>> This does merge the files.
>> Streptomyces sp. AA4, ZP_05482482, 2.8293600000000001e-140,
>> 5256,ZP_05482482, StAA4_010100030484,
>> complement(NZ_ACEV01000078.1:25146..40916) Streptomyces sp. AA4,
>> ZP_05482482, 8.0333299999999997e-138, 5256,ZP_05477599,
>> StAA4_010100005861, NZ_ACEV01000013.1:86730..102047
>>
>> But file one has multiple instances of the same Protein ID such as
>> ZP_05482482. So the data doesn't line up anymore.  I would like the
>> program to search for each Protein ID number and write the entry from
>> file 2 in each place and then move on to the next ID number.
>>
>> Example of desired output:
>> Streptomyces sp. AA4, ZP_05482482, StAA4_010100030484,
>> 2.8293600000000001e-140, 5256,
>> complement(NZ_ACEV01000078.1:25146..40916) Streptomyces sp. AA4,
>> ZP_05482482, StAA4_010100030484, 8.0333299999999997e-138, 5256,
>> complement(NZ_ACEV01000078.1:25146..40916)
>>
>> I was thinking about writing the text files into a dictionary and then
>> searching for each ID and then insert the content from file TWO into
>> where the IDs match. But I am not sure how to start. Is there a more
>> pythony way to go about doing this?
>>
>> Thank you for your time and help.
>>
>> Regards,
>> Ara
>>     
>
> Why don't you try using the csv library which is part of the standard 
> python library to parse you files.  It allows simple and efficient 
> manipulation of comma separated value files.
>
> -Rob Jackiewicz
>
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
>   



More information about the Tutor mailing list