[Baypiggies] string to list question

Glen Jarvis glen at glenjarvis.com
Fri Aug 6 07:44:01 CEST 2010


Vikram,

    Thank you for this. I really appreciate it. I didn't catch that this was
SNP data. And, I do see snip data represented as, for example, 'C/G'.

    I see the IUPAC extended genetic alphabet used when we are uncertain in
DNA modeling. For example, if I was to see 'ACGRC', I would know that this
means either 'ACGGC' or 'ACGAC'. Since R has this built-in meaning,
according to this extended genetic alphabet:

Symbol   Meaning
G        G
A        A
T        T
C        C
R        G or A
Y        T or C
M        A or C
K        G or T
S        G or C
W        A or T
H        A, C or T
B        G, T or C
V        G, C or A
D        G, A or T
N        any of the four bases

    With that said, I don't see SNP data represented this way. I don't yet
undertand why not. I don't see how this symbology can be used for data that
I see in FASTA files, but not in SNPs.

    From a computer science perspective, it makes much more sense to me to
store this in an 'already tokenized form where the tokens are easy to parse'
(that is, each letter in a string representing a token already).

    Using your previous example, you want to represent 'A/G' as a single
character. That is 'R' as defined by the IUPAC extended genetic alphabet.
Thus, string 'ARG' maps to codon AUG(mRNA) and thus to a Methionine residue.

    For what it's worth, I work at a phylogenomic lab at UC Berkeley and we
deal with proteins instead of DNA. So, I may be missing something. Our
mapping (our amino acid symbol of X for any amino acid, or B for
either Asparagine or aspartic acid, for example) is similar. I totally don't
get how to read SNPs, so, I admit, I could be missing something big.

    I'm very curious why the IUPAC extended genetic alphabet is not
applicable to the relevant portions of SNPs. Do you know why this is the
case? Could you explain what I'm missing on why this couldn't be represented
this way?


Cheers,


Glen

On Thu, Aug 5, 2010 at 9:50 PM, Vikram K <kpguy1975 at gmail.com> wrote:

> Hi Glen,
> thanks for your response. I am afraid i did not present the problem with
> clarity in my original query. The more generalized query is what if:
>
> z = 'ATC/GACTGAGC/TAG'
>
> and  i want
> zlist = ['ATC/G','ACT','GAG','C/TAG']
>
> The biology behind this is not as what you have understood. Here is the
> problem for you and others interested (i am simplying this as much as i can
> since i dont know your biological background):
> 'C/G' and 'C/T' are SNPs (single nucleotide polymorphisms, which can be
> thought of simply as 'change') in a particular genome being studied when
> compared to the NCBI reference genome. A specific nucleotide (say 'A') is
> being represented by two alternative nucleotides (say 'A/G') in the genome
> being investigated. The alternative nucleotides could occur because at that
> position there is a difference in the coding and complementary DNA strands
> (think of this as a difference between the paternal and maternal DNA strands
> at that position).
>
> When i take the exon regions of a gene (that are making proteins)  in the
> genome being studied i need to break up the dna string corresponding to the
> exon region in groups of  three to get the codons and then find the
> corresponding amino acid sequence using the genetic code. In doing this i
> want something like 'A/G' to be taken as a single character. ['AT/CG'] will
> be then correspond to two alternative amino acids corresponding to ATC and
> ATG. [ATG (DNA) corresponds to AUG(mRNA). ]
>
> On Thu, Aug 5, 2010 at 9:31 PM, Glen Jarvis <glen at glenjarvis.com> wrote:
>
>> Vikram,
>>
>>     I recognize this domain in many of the questions that have been asked.
>> There are several times where I've thought, "That *so* isn't the most ideal
>> 'Computer Science' way to do something." But, I also recognize that,
>> especially in the Biological world, we have no control how we receive the
>> data and thus, we still have to solve problems like those reviewed.
>>
>>    So, I normally don't challenge the base assumption in the question
>> because I know from experience, we don't always get the most ideal inputs to
>> work with. HOWEVER, I do want to challenge this one because I know there's a
>> standard way that this is represented in the Biological community without
>> using three characters for a single base. I recognize your original question
>> of z = 'AT/CG' to mean, In Biological terms, that:
>>
>> "Zee equals the string of three nucleotide bases. The first base is
>> Adenine. The second base is either Thymine or Cytosine. The third base is
>> Guanine."
>>
>> There's a *much* better (and commonly accepted) way to represent this.
>>
>> The way this is traditionally is represented is with the extended
>> genetic alphabet (
>> http://www.hrbc-genomics.net/training/bcd/Curric/PrwAli/node7.html). In
>> this case, the middle base would be represented by the letter Y as that
>> means either Thymine or Cytosine.
>>
>> I feel it's much better to represent this as:
>>
>> z = 'AYG'
>>
>> Then, the string will work without any expected manipulations. I would
>> always work with the alphabet and not put the three character string back in
>> as this alphabet is defined and accepted in the community. However, if one
>> wanted to they still could later represent this in a 'lookup dictionary'
>> such as follows if the output ever needed to be in a the format in question.
>>
>> lookup = {'R': 'G/A',
>>               'Y': 'T/C',
>>               'M': 'A/C',....}
>>
>> Cheers,
>>
>>
>> Glen
>>
>>
>> On Wed, Aug 4, 2010 at 9:37 PM, Vikram K <kpguy1975 at gmail.com> wrote:
>>
>>> Suppose i have this string:
>>> z = 'AT/CG'
>>>
>>> How do i get this list:
>>>
>>> zlist = ['A','T/C','G']
>>>
>>>
>>> _______________________________________________
>>> Baypiggies mailing list
>>> Baypiggies at python.org
>>> To change your subscription options or unsubscribe:
>>> http://mail.python.org/mailman/listinfo/baypiggies
>>>
>>
>>
>>
>> --
>> Whatever you can do or imagine, begin it;
>> boldness has beauty, magic, and power in it.
>>
>> -- Goethe
>>
>
>


-- 
Whatever you can do or imagine, begin it;
boldness has beauty, magic, and power in it.

-- Goethe
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/baypiggies/attachments/20100805/8931ffa2/attachment.html>


More information about the Baypiggies mailing list