thought a bit more about this and i have a further comment to make. The genome data files i am working with are representing a SNP in the A/T or A/G or C/G, etc. form. Makes more sense *not* to convert a T/C to Y because eventually i  really want the amino acid and i would finally want AT/CG to be converted into ATG and ACG and convert ATG and ACG into the two corresponding amino acids using the genetic code. <br>

<br><br><div class="gmail_quote">On Fri, Aug 6, 2010 at 12:18 PM, Vikram K <span dir="ltr">&lt;<a href="mailto:kpguy1975@gmail.com">kpguy1975@gmail.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

<div class="im">Because if you represent a SNP with a single character you would not be able to distinguish between homozygous and heterozygous SNPs. <br><br></div><div><div></div><div class="h5"><div class="gmail_quote">

On Fri, Aug 6, 2010 at 11:14 AM, Glen Jarvis <span dir="ltr">&lt;<a href="mailto:glen@glenjarvis.com" target="_blank">glen@glenjarvis.com</a>&gt;</span> wrote:<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Vikram,<div><br><div>    Thank you for this. I really appreciate it. I didn&#39;t catch that this was SNP data. And, I do see snip data represented as, for example, &#39;C/G&#39;.  </div>


<div><br></div><div>    I see the IUPAC extended genetic alphabet used when we are uncertain in DNA modeling. For example, if I was to see &#39;ACGRC&#39;, I would know that this means either &#39;ACGGC&#39; or &#39;ACGAC&#39;. Since R has this built-in meaning, according to this extended genetic alphabet:</div>


<div><br></div><div><font face="&#39;courier new&#39;, monospace">Symbol   Meaning</font></div><div><font face="&#39;courier new&#39;, monospace">G        G</font></div><div>

<font face="&#39;courier new&#39;, monospace">A        A</font></div><div><font face="&#39;courier new&#39;, monospace">T        T</font></div><div><font face="&#39;courier new&#39;, monospace">C        C</font></div>

<div><font face="&#39;courier new&#39;, monospace">R        G or A</font></div><div><font face="&#39;courier new&#39;, monospace">Y        T or C</font></div><div><font face="&#39;courier new&#39;, monospace">M        A or C</font></div>


<div><font face="&#39;courier new&#39;, monospace">K        G or T</font></div><div><font face="&#39;courier new&#39;, monospace">S        G or C</font></div><div><font face="&#39;courier new&#39;, monospace">W        A or T</font></div>


<div><font face="&#39;courier new&#39;, monospace">H        A, C or T</font></div><div><font face="&#39;courier new&#39;, monospace">B        G, T or C</font></div><div><font face="&#39;courier new&#39;, monospace">V        G, C or A</font></div>


<div><font face="&#39;courier new&#39;, monospace">D        G, A or T</font></div><div><font face="&#39;courier new&#39;, monospace">N        any of the four bases</font></div>

<div><br></div><div>    With that said, I don&#39;t see SNP data represented this way. I don&#39;t yet undertand why not. I don&#39;t see how this symbology can be used for data that I see in FASTA files, but not in SNPs. </div>


<div><br></div><div>    From a computer science perspective, it makes much more sense to me to store this in an &#39;already tokenized form where the tokens are easy to parse&#39; (that is, each letter in a string representing a token already).</div>


<div><br></div><div>    Using your previous example, you want to represent &#39;A/G&#39; as a single character. That is &#39;R&#39; as defined by the IUPAC extended genetic alphabet. Thus, string &#39;ARG&#39; maps to codon AUG(mRNA) and thus to a Methionine residue.</div>


<div><br></div><div>    For what it&#39;s worth, I work at a phylogenomic lab at UC Berkeley and we deal with proteins instead of DNA. So, I may be missing something. Our mapping (our amino acid symbol of X for any amino acid, or B for either Asparagine or aspartic acid, for example) is similar. I totally don&#39;t get how to read SNPs, so, I admit, I could be missing something big.</div>


<div><br></div><div>    I&#39;m very curious why the IUPAC extended genetic alphabet is not applicable to the relevant portions of SNPs. Do you know why this is the case? Could you explain what I&#39;m missing on why this couldn&#39;t be represented this way?</div>


<div><br></div><div><br></div><div>Cheers,</div><div><br></div><font color="#888888"><div><br></div><div>Glen</div></font><div><div></div><div><div>   </div><br></div></div></div></blockquote></div></div></div></blockquote>

</div><br>