[Tutor] Character counting again, was Re: Tutor Digest, Vol 121, Issue 56

Peter Otten __peter__ at web.de
Mon Mar 24 22:21:35 CET 2014


Jumana yousef wrote:

[Please don't reply to the digest. At the very least change the subject to 
its original text. Thank you.]

> just a reminder of my data:
> it cossets of multiple sequences of DNA that I need to count the 
bases(characters) and calculate the percentage of C+G and calculate the 
entropy.
> before each sequence there is a header or identifier (lets say ID)
> so it is like
> >ID 1…etc
> AAGGTAACCATATATACCGGG….etc (up to or even more than 3000 characters)
> >ID 2…etc
> AAATTTTTAAATTTTTTAAAATATATATACGCGCGCATGCCCCGGGGG….. etc
> … etc
> I need the out pu to be like this:
> > ID…1.. etc
> sequence length = a value
> G & G content: a value
> Entropy = a value
> > ID…2.. etc
> sequence length = a value
> G & G content: a value
> Entropy = a value
> ….etc
> 
> 
> I wrote a program close to what Denis suggested , however it works only if 
I have one sequence (one header and one sequence), I can not modify it to 
work if I have several sequences (like above). I also get an incorrect value 
for entropy (H) 
> 
> #!/usr/bin/python

If you put the following into a function, say show_stats(seq)

> print ' Sequence length : ', len(seq)
> counters = {}
> for char in seq:
>     char = char.strip()
>     if counters.has_key(char):
>         counters[char] += 1
>     else:
>         counters[char] = 1
> c_g = 100*(counters['C']+counters['G'])/len(seq)
> print ' The C & G content: ' '%.1f'%  c_g, '%'
> import math
> all = len(seq)
> Pa = (counters['A'])/all
> Pc = counters['C']/all
> Pg = counters['G']/all
> Pt = counters['T']/all
> 
> H =-1*(Pa*math.log(Pa,2) + Pc*math.log(Pc,2) + Pg*math.log(Pg,2) + 
Pt*math.log(Pt,2))
> 
> print ' H = ' , H

you can invoke that function in and after the while loop like so:

> seq = ''
> while True:
>     try:
>         line = raw_input()
>         index = line.find('>')
>         if index > -1:
              if seq:
                  show_stats(seq)
                  seq = ""
>             print line
>         else:
>             line = line.rstrip()
>             line = line.upper()
>             seq = seq + line
>     except:
>         break

if seq:
    show_stats()

> I do not know why Pa, Pc, Pg, Pt give me a value of 0, although when I 
type counters['A'] or counters['C']. counters[T'] , counters['G'] or all I 
get values > 0.

When you divide an integer by an integer Python 2 gives you an integer by 
default:

>>> 1/3
0

You can avoid that by converting at least one operand to float

>>> float(1)/3
0.3333333333333333
>>> 1/float(3)
0.3333333333333333

or by putting the following magic import at the beginning of every module 
where you want float or "true" division rather than integer division:

>>> from __future__ import division
>>> 1/3
0.3333333333333333

> So please how I can fix this calculations and how I modify this program to 
read each sequence, print the results then read the second one and print the 
results and so on..
> 
> Many thanks for your help and support.

 





More information about the Tutor mailing list