[Tutor] Character counting again, was Re: Tutor Digest, Vol 121, Issue 56
Peter Otten
__peter__ at web.de
Mon Mar 24 22:21:35 CET 2014
Jumana yousef wrote:
[Please don't reply to the digest. At the very least change the subject to
its original text. Thank you.]
> just a reminder of my data:
> it cossets of multiple sequences of DNA that I need to count the
bases(characters) and calculate the percentage of C+G and calculate the
entropy.
> before each sequence there is a header or identifier (lets say ID)
> so it is like
> >ID 1…etc
> AAGGTAACCATATATACCGGG….etc (up to or even more than 3000 characters)
> >ID 2…etc
> AAATTTTTAAATTTTTTAAAATATATATACGCGCGCATGCCCCGGGGG….. etc
> … etc
> I need the out pu to be like this:
> > ID…1.. etc
> sequence length = a value
> G & G content: a value
> Entropy = a value
> > ID…2.. etc
> sequence length = a value
> G & G content: a value
> Entropy = a value
> ….etc
>
>
> I wrote a program close to what Denis suggested , however it works only if
I have one sequence (one header and one sequence), I can not modify it to
work if I have several sequences (like above). I also get an incorrect value
for entropy (H)
>
> #!/usr/bin/python
If you put the following into a function, say show_stats(seq)
> print ' Sequence length : ', len(seq)
> counters = {}
> for char in seq:
> char = char.strip()
> if counters.has_key(char):
> counters[char] += 1
> else:
> counters[char] = 1
> c_g = 100*(counters['C']+counters['G'])/len(seq)
> print ' The C & G content: ' '%.1f'% c_g, '%'
> import math
> all = len(seq)
> Pa = (counters['A'])/all
> Pc = counters['C']/all
> Pg = counters['G']/all
> Pt = counters['T']/all
>
> H =-1*(Pa*math.log(Pa,2) + Pc*math.log(Pc,2) + Pg*math.log(Pg,2) +
Pt*math.log(Pt,2))
>
> print ' H = ' , H
you can invoke that function in and after the while loop like so:
> seq = ''
> while True:
> try:
> line = raw_input()
> index = line.find('>')
> if index > -1:
if seq:
show_stats(seq)
seq = ""
> print line
> else:
> line = line.rstrip()
> line = line.upper()
> seq = seq + line
> except:
> break
if seq:
show_stats()
> I do not know why Pa, Pc, Pg, Pt give me a value of 0, although when I
type counters['A'] or counters['C']. counters[T'] , counters['G'] or all I
get values > 0.
When you divide an integer by an integer Python 2 gives you an integer by
default:
>>> 1/3
0
You can avoid that by converting at least one operand to float
>>> float(1)/3
0.3333333333333333
>>> 1/float(3)
0.3333333333333333
or by putting the following magic import at the beginning of every module
where you want float or "true" division rather than integer division:
>>> from __future__ import division
>>> 1/3
0.3333333333333333
> So please how I can fix this calculations and how I modify this program to
read each sequence, print the results then read the second one and print the
results and so on..
>
> Many thanks for your help and support.
More information about the Tutor
mailing list