[Baypiggies] string to list question

Vikram K kpguy1975 at gmail.com
Fri Aug 6 08:48:31 CEST 2010


Because if you represent a SNP with a single character you would not be able
to distinguish between homozygous and heterozygous SNPs.

On Fri, Aug 6, 2010 at 11:14 AM, Glen Jarvis <glen at glenjarvis.com> wrote:

> Vikram,
>
>     Thank you for this. I really appreciate it. I didn't catch that this
> was SNP data. And, I do see snip data represented as, for example, 'C/G'.
>
>     I see the IUPAC extended genetic alphabet used when we are uncertain in
> DNA modeling. For example, if I was to see 'ACGRC', I would know that this
> means either 'ACGGC' or 'ACGAC'. Since R has this built-in meaning,
> according to this extended genetic alphabet:
>
> Symbol   Meaning
> G        G
> A        A
> T        T
> C        C
> R        G or A
> Y        T or C
> M        A or C
> K        G or T
> S        G or C
> W        A or T
> H        A, C or T
> B        G, T or C
> V        G, C or A
> D        G, A or T
> N        any of the four bases
>
>     With that said, I don't see SNP data represented this way. I don't yet
> undertand why not. I don't see how this symbology can be used for data that
> I see in FASTA files, but not in SNPs.
>
>     From a computer science perspective, it makes much more sense to me to
> store this in an 'already tokenized form where the tokens are easy to parse'
> (that is, each letter in a string representing a token already).
>
>     Using your previous example, you want to represent 'A/G' as a single
> character. That is 'R' as defined by the IUPAC extended genetic alphabet.
> Thus, string 'ARG' maps to codon AUG(mRNA) and thus to a Methionine residue.
>
>     For what it's worth, I work at a phylogenomic lab at UC Berkeley and we
> deal with proteins instead of DNA. So, I may be missing something. Our
> mapping (our amino acid symbol of X for any amino acid, or B for
> either Asparagine or aspartic acid, for example) is similar. I totally don't
> get how to read SNPs, so, I admit, I could be missing something big.
>
>     I'm very curious why the IUPAC extended genetic alphabet is not
> applicable to the relevant portions of SNPs. Do you know why this is the
> case? Could you explain what I'm missing on why this couldn't be represented
> this way?
>
>
> Cheers,
>
>
> Glen
>
> On Thu, Aug 5, 2010 at 9:50 PM, Vikram K <kpguy1975 at gmail.com> wrote:
>
>> Hi Glen,
>> thanks for your response. I am afraid i did not present the problem with
>> clarity in my original query. The more generalized query is what if:
>>
>> z = 'ATC/GACTGAGC/TAG'
>>
>> and  i want
>> zlist = ['ATC/G','ACT','GAG','C/TAG']
>>
>> The biology behind this is not as what you have understood. Here is the
>> problem for you and others interested (i am simplying this as much as i can
>> since i dont know your biological background):
>> 'C/G' and 'C/T' are SNPs (single nucleotide polymorphisms, which can be
>> thought of simply as 'change') in a particular genome being studied when
>> compared to the NCBI reference genome. A specific nucleotide (say 'A') is
>> being represented by two alternative nucleotides (say 'A/G') in the genome
>> being investigated. The alternative nucleotides could occur because at that
>> position there is a difference in the coding and complementary DNA strands
>> (think of this as a difference between the paternal and maternal DNA strands
>> at that position).
>>
>> When i take the exon regions of a gene (that are making proteins)  in the
>> genome being studied i need to break up the dna string corresponding to the
>> exon region in groups of  three to get the codons and then find the
>> corresponding amino acid sequence using the genetic code. In doing this i
>> want something like 'A/G' to be taken as a single character. ['AT/CG'] will
>> be then correspond to two alternative amino acids corresponding to ATC and
>> ATG. [ATG (DNA) corresponds to AUG(mRNA). ]
>>
>> On Thu, Aug 5, 2010 at 9:31 PM, Glen Jarvis <glen at glenjarvis.com> wrote:
>>
>>> Vikram,
>>>
>>>     I recognize this domain in many of the questions that have been
>>> asked. There are several times where I've thought, "That *so* isn't the most
>>> ideal 'Computer Science' way to do something." But, I also recognize that,
>>> especially in the Biological world, we have no control how we receive the
>>> data and thus, we still have to solve problems like those reviewed.
>>>
>>>    So, I normally don't challenge the base assumption in the question
>>> because I know from experience, we don't always get the most ideal inputs to
>>> work with. HOWEVER, I do want to challenge this one because I know there's a
>>> standard way that this is represented in the Biological community without
>>> using three characters for a single base. I recognize your original question
>>> of z = 'AT/CG' to mean, In Biological terms, that:
>>>
>>> "Zee equals the string of three nucleotide bases. The first base is
>>> Adenine. The second base is either Thymine or Cytosine. The third base is
>>> Guanine."
>>>
>>> There's a *much* better (and commonly accepted) way to represent this.
>>>
>>> The way this is traditionally is represented is with the extended
>>> genetic alphabet (
>>> http://www.hrbc-genomics.net/training/bcd/Curric/PrwAli/node7.html). In
>>> this case, the middle base would be represented by the letter Y as that
>>> means either Thymine or Cytosine.
>>>
>>> I feel it's much better to represent this as:
>>>
>>> z = 'AYG'
>>>
>>> Then, the string will work without any expected manipulations. I would
>>> always work with the alphabet and not put the three character string back in
>>> as this alphabet is defined and accepted in the community. However, if one
>>> wanted to they still could later represent this in a 'lookup dictionary'
>>> such as follows if the output ever needed to be in a the format in question.
>>>
>>> lookup = {'R': 'G/A',
>>>               'Y': 'T/C',
>>>               'M': 'A/C',....}
>>>
>>> Cheers,
>>>
>>>
>>> Glen
>>>
>>>
>>> On Wed, Aug 4, 2010 at 9:37 PM, Vikram K <kpguy1975 at gmail.com> wrote:
>>>
>>>> Suppose i have this string:
>>>> z = 'AT/CG'
>>>>
>>>> How do i get this list:
>>>>
>>>> zlist = ['A','T/C','G']
>>>>
>>>>
>>>> _______________________________________________
>>>> Baypiggies mailing list
>>>> Baypiggies at python.org
>>>> To change your subscription options or unsubscribe:
>>>> http://mail.python.org/mailman/listinfo/baypiggies
>>>>
>>>
>>>
>>>
>>> --
>>> Whatever you can do or imagine, begin it;
>>> boldness has beauty, magic, and power in it.
>>>
>>> -- Goethe
>>>
>>
>>
>
>
> --
> Whatever you can do or imagine, begin it;
> boldness has beauty, magic, and power in it.
>
> -- Goethe
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/baypiggies/attachments/20100806/5959af2e/attachment-0001.html>


More information about the Baypiggies mailing list