[Baypiggies] string to list question
Vikram K
kpguy1975 at gmail.com
Fri Aug 6 09:11:29 CEST 2010
thought a bit more about this and i have a further comment to make. The
genome data files i am working with are representing a SNP in the A/T or A/G
or C/G, etc. form. Makes more sense *not* to convert a T/C to Y because
eventually i really want the amino acid and i would finally want AT/CG to
be converted into ATG and ACG and convert ATG and ACG into the two
corresponding amino acids using the genetic code.
On Fri, Aug 6, 2010 at 12:18 PM, Vikram K <kpguy1975 at gmail.com> wrote:
> Because if you represent a SNP with a single character you would not be
> able to distinguish between homozygous and heterozygous SNPs.
>
> On Fri, Aug 6, 2010 at 11:14 AM, Glen Jarvis <glen at glenjarvis.com> wrote:
>
>> Vikram,
>>
>> Thank you for this. I really appreciate it. I didn't catch that this
>> was SNP data. And, I do see snip data represented as, for example, 'C/G'.
>>
>> I see the IUPAC extended genetic alphabet used when we are uncertain
>> in DNA modeling. For example, if I was to see 'ACGRC', I would know that
>> this means either 'ACGGC' or 'ACGAC'. Since R has this built-in meaning,
>> according to this extended genetic alphabet:
>>
>> Symbol Meaning
>> G G
>> A A
>> T T
>> C C
>> R G or A
>> Y T or C
>> M A or C
>> K G or T
>> S G or C
>> W A or T
>> H A, C or T
>> B G, T or C
>> V G, C or A
>> D G, A or T
>> N any of the four bases
>>
>> With that said, I don't see SNP data represented this way. I don't yet
>> undertand why not. I don't see how this symbology can be used for data that
>> I see in FASTA files, but not in SNPs.
>>
>> From a computer science perspective, it makes much more sense to me to
>> store this in an 'already tokenized form where the tokens are easy to parse'
>> (that is, each letter in a string representing a token already).
>>
>> Using your previous example, you want to represent 'A/G' as a single
>> character. That is 'R' as defined by the IUPAC extended genetic alphabet.
>> Thus, string 'ARG' maps to codon AUG(mRNA) and thus to a Methionine residue.
>>
>> For what it's worth, I work at a phylogenomic lab at UC Berkeley and
>> we deal with proteins instead of DNA. So, I may be missing something. Our
>> mapping (our amino acid symbol of X for any amino acid, or B for
>> either Asparagine or aspartic acid, for example) is similar. I totally don't
>> get how to read SNPs, so, I admit, I could be missing something big.
>>
>> I'm very curious why the IUPAC extended genetic alphabet is not
>> applicable to the relevant portions of SNPs. Do you know why this is the
>> case? Could you explain what I'm missing on why this couldn't be represented
>> this way?
>>
>>
>> Cheers,
>>
>>
>> Glen
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/baypiggies/attachments/20100806/d15f73bb/attachment.html>
More information about the Baypiggies
mailing list