[Tutor] Merging Text Files

Thu Oct 14 22:12:53 CEST 2010

I sent both emails and may have confused things:

1. PIDS.has_key(ID) returns True/False. you need to make sure the dictionary
has the key before you fetch PIDS[NotAKey] and get a KeyError.
2. line.split() splits at and removes whitespace, leaving commas.
line.split(",") splits at and removes commas.

On Thu, Oct 14, 2010 at 13:43, Adam Lucas <ademlookes at gmail.com> wrote:

> Whoops:
>
> 1) dictionary.has_key() ???
> 2) I don't know if it's a typo or oversight, but there's a comma in you
> dictionary key, line.split(',')[0].
> 3) Forget the database if it's part of a larger workflow unless your job is
> to adapt a biological workflow database for your lab.
>
>
>
> On Thu, Oct 14, 2010 at 09:48, Ara Kooser <ghashsnaga at gmail.com> wrote:
>
>> Morning all,
>>
>>   I took the pseudocode that Emile provided and tried to write a python
>> program. I may have taken the pseudocode to literally.
>>
>> So what I wrote was this:
>> xml = open("final.txt",'r')
>> gen = open("final_gen.txt",'r')
>>
>> PIDS = {}
>> for proteinVals in gen:
>>
>>     ID = proteinVals.split()[0]
>>     PIDS[ID] = proteinVals
>>
>> print PIDS
>>
>> for line in xml:
>>     ID = proteinVals.split()[1]
>>     rslt = "%s,%s"% (line,PIDS[ID])
>>     print rslt
>>
>> So the first part I get. I read in gen that has this format as a text
>> file:
>>
>> *Protein ID, Locus Tag, Start/Stop*
>> ZP_05482482, StAA4_010100030484,
>> complement(NZ_ACEV01000078.1:25146..40916)
>> ZP_07281899, SSMG_05939, complement(NZ_GG657746.1:6565974..6581756)
>> ZP_05477599, StAA4_010100005861, NZ_ACEV01000013.1:86730..102047
>> ...
>> Put that into a dictionary with a key that is the Protein ID at position 0
>> in the dictionary.
>>
>> The second part reads in the file xml which has this format:
>>
>> *Species, Protein ID, E Value, Length*
>> Streptomyces sp. AA4, ZP_05482482, 2.8293600000000001e-140, 5256,
>> Streptomyces sp. AA4, ZP_05482482, 8.0333299999999997e-138, 5256,
>> Streptomyces sp. AA4, ZP_05482482, 1.08889e-124, 5256,
>> Streptomyces sp. AA4, ZP_07281899, 2.9253900000000001e-140, 5260,
>> Streptomyces sp. AA4, ZP_07281899, 8.2369599999999995e-138, 5260,
>> ....
>> *same protein id multiple entries
>>
>> The program splits the file and does something with the 1 position which
>> is the proten id in the xml file. After that I am not really sure what is
>> happening. I can't remember what the %s means. Something with a string?
>>
>> When this runs I get the following error:
>> Traceback (most recent call last):
>>   File "/Users/ara/Desktop/biopy_programs/merge2.py", line 18, in <module>
>>     rslt = "%s,%s"% (line,PIDS[ID])
>> KeyError: 'StAA4_010100017400,'
>>
>> From what I can tell it's not happy about the dictionary key.
>>
>> In the end I am looking for a way to merge these two files and for each
>> protein ID add the locus tag and start/stop like this:
>> *Species, Protein ID, Locus Tag, E Value, Length*, *Start/Stop*
>>
>> Streptomyces sp. AA4, ZP_05482482, StAA4_010100030484,
>> 2.8293600000000001e-140, 5256, complement(NZ_ACEV01000078.1:25146..40916)
>> Streptomyces sp. AA4, ZP_05482482, StAA4_010100030484,
>> 8.0333299999999997e-138, 5256, complement(NZ_ACEV01000078.1:25146..40916)
>> Streptomyces sp. AA4, ZP_05482482, StAA4_010100030484, 1.08889e-124, 5256,
>> complement(NZ_ACEV01000078.1:25146..40916)
>> Streptomyces sp. AA4, ZP_07281899, SSMG_05939, 2.9253900000000001e-140,
>> 5260, complement(NZ_GG657746.1:6565974..6581756)
>> Streptomyces sp. AA4, ZP_07281899, SSMG_05939, 8.2369599999999995e-138,
>> 5260, complement(NZ_GG657746.1:6565974..6581756)
>>
>> Do you have any suggestions for how to proceed. It feels like I am getting
>> closer. :)
>>
>>
>> Note:
>> When I change this part of the code to 0
>> for line in xml:
>>     ID = proteinVals.split()[0]
>>     rslt = "%s,%s"% (line,PIDS[ID])
>>     print rslt
>>
>> I get the following output:
>> Streptomyces sp. AA4, ZP_05482482, 8.0333299999999997e-138, 5256,
>> ,ZP_05479896, StAA4_010100017400, NZ_ACEV01000043.1:241968..>242983
>>
>>
>> Streptomyces sp. AA4, ZP_05482482, 1.08889e-124, 5256,
>>  ,ZP_05479896, StAA4_010100017400, NZ_ACEV01000043.1:241968..>242983
>>
>>
>> Streptomyces sp. AA4, ZP_07281899, 2.9253900000000001e-140, 5260,
>> ,ZP_05479896, StAA4_010100017400, NZ_ACEV01000043.1:241968..>242983
>>
>> Which seems closer but all it's doing is repeating the same Locus Tag and
>> Start/Stop for each entry.
>>
>> Thank you!
>>
>> Ara
>>
>>
>> --
>> Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an
>> sub cardine glacialis ursae.
>>
>> _______________________________________________
>> Tutor maillist  -  Tutor at python.org
>> To unsubscribe or change subscription options:
>> http://mail.python.org/mailman/listinfo/tutor
>>
>>
>
>
> --
> Data is not information, information is not knowledge, knowledge is not
> understanding, understanding is not wisdom.
> --Clifford Stoll
>

-- 
Data is not information, information is not knowledge, knowledge is not
understanding, understanding is not wisdom.
--Clifford Stoll
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20101014/0ae99112/attachment.html>