[Tutor] Merging Text Files
Adam Lucas
ademlookes at gmail.com
Thu Oct 14 20:43:42 CEST 2010
Whoops:
1) dictionary.has_key() ???
2) I don't know if it's a typo or oversight, but there's a comma in you
dictionary key, line.split(',')[0].
3) Forget the database if it's part of a larger workflow unless your job is
to adapt a biological workflow database for your lab.
On Thu, Oct 14, 2010 at 09:48, Ara Kooser <ghashsnaga at gmail.com> wrote:
> Morning all,
>
> I took the pseudocode that Emile provided and tried to write a python
> program. I may have taken the pseudocode to literally.
>
> So what I wrote was this:
> xml = open("final.txt",'r')
> gen = open("final_gen.txt",'r')
>
> PIDS = {}
> for proteinVals in gen:
>
> ID = proteinVals.split()[0]
> PIDS[ID] = proteinVals
>
> print PIDS
>
> for line in xml:
> ID = proteinVals.split()[1]
> rslt = "%s,%s"% (line,PIDS[ID])
> print rslt
>
> So the first part I get. I read in gen that has this format as a text file:
>
> *Protein ID, Locus Tag, Start/Stop*
> ZP_05482482, StAA4_010100030484, complement(NZ_ACEV01000078.1:25146..40916)
> ZP_07281899, SSMG_05939, complement(NZ_GG657746.1:6565974..6581756)
> ZP_05477599, StAA4_010100005861, NZ_ACEV01000013.1:86730..102047
> ...
> Put that into a dictionary with a key that is the Protein ID at position 0
> in the dictionary.
>
> The second part reads in the file xml which has this format:
>
> *Species, Protein ID, E Value, Length*
> Streptomyces sp. AA4, ZP_05482482, 2.8293600000000001e-140, 5256,
> Streptomyces sp. AA4, ZP_05482482, 8.0333299999999997e-138, 5256,
> Streptomyces sp. AA4, ZP_05482482, 1.08889e-124, 5256,
> Streptomyces sp. AA4, ZP_07281899, 2.9253900000000001e-140, 5260,
> Streptomyces sp. AA4, ZP_07281899, 8.2369599999999995e-138, 5260,
> ....
> *same protein id multiple entries
>
> The program splits the file and does something with the 1 position which is
> the proten id in the xml file. After that I am not really sure what is
> happening. I can't remember what the %s means. Something with a string?
>
> When this runs I get the following error:
> Traceback (most recent call last):
> File "/Users/ara/Desktop/biopy_programs/merge2.py", line 18, in <module>
> rslt = "%s,%s"% (line,PIDS[ID])
> KeyError: 'StAA4_010100017400,'
>
> From what I can tell it's not happy about the dictionary key.
>
> In the end I am looking for a way to merge these two files and for each
> protein ID add the locus tag and start/stop like this:
> *Species, Protein ID, Locus Tag, E Value, Length*, *Start/Stop*
>
> Streptomyces sp. AA4, ZP_05482482, StAA4_010100030484,
> 2.8293600000000001e-140, 5256, complement(NZ_ACEV01000078.1:25146..40916)
> Streptomyces sp. AA4, ZP_05482482, StAA4_010100030484,
> 8.0333299999999997e-138, 5256, complement(NZ_ACEV01000078.1:25146..40916)
> Streptomyces sp. AA4, ZP_05482482, StAA4_010100030484, 1.08889e-124, 5256,
> complement(NZ_ACEV01000078.1:25146..40916)
> Streptomyces sp. AA4, ZP_07281899, SSMG_05939, 2.9253900000000001e-140,
> 5260, complement(NZ_GG657746.1:6565974..6581756)
> Streptomyces sp. AA4, ZP_07281899, SSMG_05939, 8.2369599999999995e-138,
> 5260, complement(NZ_GG657746.1:6565974..6581756)
>
> Do you have any suggestions for how to proceed. It feels like I am getting
> closer. :)
>
>
> Note:
> When I change this part of the code to 0
> for line in xml:
> ID = proteinVals.split()[0]
> rslt = "%s,%s"% (line,PIDS[ID])
> print rslt
>
> I get the following output:
> Streptomyces sp. AA4, ZP_05482482, 8.0333299999999997e-138, 5256,
> ,ZP_05479896, StAA4_010100017400, NZ_ACEV01000043.1:241968..>242983
>
>
> Streptomyces sp. AA4, ZP_05482482, 1.08889e-124, 5256,
> ,ZP_05479896, StAA4_010100017400, NZ_ACEV01000043.1:241968..>242983
>
>
> Streptomyces sp. AA4, ZP_07281899, 2.9253900000000001e-140, 5260,
> ,ZP_05479896, StAA4_010100017400, NZ_ACEV01000043.1:241968..>242983
>
> Which seems closer but all it's doing is repeating the same Locus Tag and
> Start/Stop for each entry.
>
> Thank you!
>
> Ara
>
>
> --
> Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an
> sub cardine glacialis ursae.
>
> _______________________________________________
> Tutor maillist - Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
>
>
--
Data is not information, information is not knowledge, knowledge is not
understanding, understanding is not wisdom.
--Clifford Stoll
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20101014/bc98675f/attachment-0001.html>
More information about the Tutor
mailing list