[Tutor] Merging Text Files

Ara Kooser ghashsnaga at gmail.com
Thu Oct 14 16:48:19 CEST 2010


Morning all,

  I took the pseudocode that Emile provided and tried to write a python
program. I may have taken the pseudocode to literally.

So what I wrote was this:
xml = open("final.txt",'r')
gen = open("final_gen.txt",'r')

PIDS = {}
for proteinVals in gen:
    ID = proteinVals.split()[0]
    PIDS[ID] = proteinVals

print PIDS

for line in xml:
    ID = proteinVals.split()[1]
    rslt = "%s,%s"% (line,PIDS[ID])
    print rslt

So the first part I get. I read in gen that has this format as a text file:
*Protein ID, Locus Tag, Start/Stop*
ZP_05482482, StAA4_010100030484, complement(NZ_ACEV01000078.1:25146..40916)
ZP_07281899, SSMG_05939, complement(NZ_GG657746.1:6565974..6581756)
ZP_05477599, StAA4_010100005861, NZ_ACEV01000013.1:86730..102047
...
Put that into a dictionary with a key that is the Protein ID at position 0
in the dictionary.

The second part reads in the file xml which has this format:
*Species, Protein ID, E Value, Length*
Streptomyces sp. AA4, ZP_05482482, 2.8293600000000001e-140, 5256,
Streptomyces sp. AA4, ZP_05482482, 8.0333299999999997e-138, 5256,
Streptomyces sp. AA4, ZP_05482482, 1.08889e-124, 5256,
Streptomyces sp. AA4, ZP_07281899, 2.9253900000000001e-140, 5260,
Streptomyces sp. AA4, ZP_07281899, 8.2369599999999995e-138, 5260,
....
*same protein id multiple entries

The program splits the file and does something with the 1 position which is
the proten id in the xml file. After that I am not really sure what is
happening. I can't remember what the %s means. Something with a string?

When this runs I get the following error:
Traceback (most recent call last):
  File "/Users/ara/Desktop/biopy_programs/merge2.py", line 18, in <module>
    rslt = "%s,%s"% (line,PIDS[ID])
KeyError: 'StAA4_010100017400,'

>From what I can tell it's not happy about the dictionary key.

In the end I am looking for a way to merge these two files and for each
protein ID add the locus tag and start/stop like this:
*Species, Protein ID, Locus Tag, E Value, Length*, *Start/Stop*
Streptomyces sp. AA4, ZP_05482482, StAA4_010100030484,
2.8293600000000001e-140, 5256, complement(NZ_ACEV01000078.1:25146..40916)
Streptomyces sp. AA4, ZP_05482482, StAA4_010100030484,
8.0333299999999997e-138, 5256, complement(NZ_ACEV01000078.1:25146..40916)
Streptomyces sp. AA4, ZP_05482482, StAA4_010100030484, 1.08889e-124, 5256,
complement(NZ_ACEV01000078.1:25146..40916)
Streptomyces sp. AA4, ZP_07281899, SSMG_05939, 2.9253900000000001e-140,
5260, complement(NZ_GG657746.1:6565974..6581756)
Streptomyces sp. AA4, ZP_07281899, SSMG_05939, 8.2369599999999995e-138,
5260, complement(NZ_GG657746.1:6565974..6581756)

Do you have any suggestions for how to proceed. It feels like I am getting
closer. :)


Note:
When I change this part of the code to 0
for line in xml:
    ID = proteinVals.split()[0]
    rslt = "%s,%s"% (line,PIDS[ID])
    print rslt

I get the following output:
Streptomyces sp. AA4, ZP_05482482, 8.0333299999999997e-138, 5256,
,ZP_05479896, StAA4_010100017400, NZ_ACEV01000043.1:241968..>242983

Streptomyces sp. AA4, ZP_05482482, 1.08889e-124, 5256,
,ZP_05479896, StAA4_010100017400, NZ_ACEV01000043.1:241968..>242983

Streptomyces sp. AA4, ZP_07281899, 2.9253900000000001e-140, 5260,
,ZP_05479896, StAA4_010100017400, NZ_ACEV01000043.1:241968..>242983

Which seems closer but all it's doing is repeating the same Locus Tag and
Start/Stop for each entry.

Thank you!
Ara


-- 
Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an sub
cardine glacialis ursae.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20101014/70061a7e/attachment.html>


More information about the Tutor mailing list