[Baypiggies] string to list question

Vikram K kpguy1975 at gmail.com
Fri Aug 6 09:11:29 CEST 2010

thought a bit more about this and i have a further comment to make. The
genome data files i am working with are representing a SNP in the A/T or A/G
or C/G, etc. form. Makes more sense *not* to convert a T/C to Y because
eventually i  really want the amino acid and i would finally want AT/CG to
be converted into ATG and ACG and convert ATG and ACG into the two
corresponding amino acids using the genetic code.

On Fri, Aug 6, 2010 at 12:18 PM, Vikram K <kpguy1975 at gmail.com> wrote:

> Because if you represent a SNP with a single character you would not be
> able to distinguish between homozygous and heterozygous SNPs.
> On Fri, Aug 6, 2010 at 11:14 AM, Glen Jarvis <glen at glenjarvis.com> wrote:
>> Vikram,
>>     Thank you for this. I really appreciate it. I didn't catch that this
>> was SNP data. And, I do see snip data represented as, for example, 'C/G'.
>>     I see the IUPAC extended genetic alphabet used when we are uncertain
>> in DNA modeling. For example, if I was to see 'ACGRC', I would know that
>> this means either 'ACGGC' or 'ACGAC'. Since R has this built-in meaning,
>> according to this extended genetic alphabet:
>> Symbol   Meaning
>> G        G
>> A        A
>> T        T
>> C        C
>> R        G or A
>> Y        T or C
>> M        A or C
>> K        G or T
>> S        G or C
>> W        A or T
>> H        A, C or T
>> B        G, T or C
>> V        G, C or A
>> D        G, A or T
>> N        any of the four bases
>>     With that said, I don't see SNP data represented this way. I don't yet
>> undertand why not. I don't see how this symbology can be used for data that
>> I see in FASTA files, but not in SNPs.
>>     From a computer science perspective, it makes much more sense to me to
>> store this in an 'already tokenized form where the tokens are easy to parse'
>> (that is, each letter in a string representing a token already).
>>     Using your previous example, you want to represent 'A/G' as a single
>> character. That is 'R' as defined by the IUPAC extended genetic alphabet.
>> Thus, string 'ARG' maps to codon AUG(mRNA) and thus to a Methionine residue.
>>     For what it's worth, I work at a phylogenomic lab at UC Berkeley and
>> we deal with proteins instead of DNA. So, I may be missing something. Our
>> mapping (our amino acid symbol of X for any amino acid, or B for
>> either Asparagine or aspartic acid, for example) is similar. I totally don't
>> get how to read SNPs, so, I admit, I could be missing something big.
>>     I'm very curious why the IUPAC extended genetic alphabet is not
>> applicable to the relevant portions of SNPs. Do you know why this is the
>> case? Could you explain what I'm missing on why this couldn't be represented
>> this way?
>> Cheers,
>> Glen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/baypiggies/attachments/20100806/d15f73bb/attachment.html>

More information about the Baypiggies mailing list