[Tutor] Merging Text Files
Ara Kooser
ghashsnaga at gmail.com
Thu Oct 14 16:48:19 CEST 2010
Morning all,
I took the pseudocode that Emile provided and tried to write a python
program. I may have taken the pseudocode to literally.
So what I wrote was this:
xml = open("final.txt",'r')
gen = open("final_gen.txt",'r')
PIDS = {}
for proteinVals in gen:
ID = proteinVals.split()[0]
PIDS[ID] = proteinVals
print PIDS
for line in xml:
ID = proteinVals.split()[1]
rslt = "%s,%s"% (line,PIDS[ID])
print rslt
So the first part I get. I read in gen that has this format as a text file:
*Protein ID, Locus Tag, Start/Stop*
ZP_05482482, StAA4_010100030484, complement(NZ_ACEV01000078.1:25146..40916)
ZP_07281899, SSMG_05939, complement(NZ_GG657746.1:6565974..6581756)
ZP_05477599, StAA4_010100005861, NZ_ACEV01000013.1:86730..102047
...
Put that into a dictionary with a key that is the Protein ID at position 0
in the dictionary.
The second part reads in the file xml which has this format:
*Species, Protein ID, E Value, Length*
Streptomyces sp. AA4, ZP_05482482, 2.8293600000000001e-140, 5256,
Streptomyces sp. AA4, ZP_05482482, 8.0333299999999997e-138, 5256,
Streptomyces sp. AA4, ZP_05482482, 1.08889e-124, 5256,
Streptomyces sp. AA4, ZP_07281899, 2.9253900000000001e-140, 5260,
Streptomyces sp. AA4, ZP_07281899, 8.2369599999999995e-138, 5260,
....
*same protein id multiple entries
The program splits the file and does something with the 1 position which is
the proten id in the xml file. After that I am not really sure what is
happening. I can't remember what the %s means. Something with a string?
When this runs I get the following error:
Traceback (most recent call last):
File "/Users/ara/Desktop/biopy_programs/merge2.py", line 18, in <module>
rslt = "%s,%s"% (line,PIDS[ID])
KeyError: 'StAA4_010100017400,'
>From what I can tell it's not happy about the dictionary key.
In the end I am looking for a way to merge these two files and for each
protein ID add the locus tag and start/stop like this:
*Species, Protein ID, Locus Tag, E Value, Length*, *Start/Stop*
Streptomyces sp. AA4, ZP_05482482, StAA4_010100030484,
2.8293600000000001e-140, 5256, complement(NZ_ACEV01000078.1:25146..40916)
Streptomyces sp. AA4, ZP_05482482, StAA4_010100030484,
8.0333299999999997e-138, 5256, complement(NZ_ACEV01000078.1:25146..40916)
Streptomyces sp. AA4, ZP_05482482, StAA4_010100030484, 1.08889e-124, 5256,
complement(NZ_ACEV01000078.1:25146..40916)
Streptomyces sp. AA4, ZP_07281899, SSMG_05939, 2.9253900000000001e-140,
5260, complement(NZ_GG657746.1:6565974..6581756)
Streptomyces sp. AA4, ZP_07281899, SSMG_05939, 8.2369599999999995e-138,
5260, complement(NZ_GG657746.1:6565974..6581756)
Do you have any suggestions for how to proceed. It feels like I am getting
closer. :)
Note:
When I change this part of the code to 0
for line in xml:
ID = proteinVals.split()[0]
rslt = "%s,%s"% (line,PIDS[ID])
print rslt
I get the following output:
Streptomyces sp. AA4, ZP_05482482, 8.0333299999999997e-138, 5256,
,ZP_05479896, StAA4_010100017400, NZ_ACEV01000043.1:241968..>242983
Streptomyces sp. AA4, ZP_05482482, 1.08889e-124, 5256,
,ZP_05479896, StAA4_010100017400, NZ_ACEV01000043.1:241968..>242983
Streptomyces sp. AA4, ZP_07281899, 2.9253900000000001e-140, 5260,
,ZP_05479896, StAA4_010100017400, NZ_ACEV01000043.1:241968..>242983
Which seems closer but all it's doing is repeating the same Locus Tag and
Start/Stop for each entry.
Thank you!
Ara
--
Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an sub
cardine glacialis ursae.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20101014/70061a7e/attachment.html>
More information about the Tutor
mailing list