[Tutor] binning data and calculating means of classes
Wolfgang Maier
wolfgang.maier at biologie.uni-freiburg.de
Wed Jul 23 11:28:15 CEST 2014
On 07/23/2014 03:36 AM, LN A-go-go wrote:
>
> with your help. I have been working the last few days, I am sorry to
> say, unsuccessfully, to calculate the mean (that's easy), split the data
> into sub-groups or secondary means - which are the break values between
> 4 classes. Create data-sets with incursive values. I can do it with
> brute force (copy and paste) but need to rise to the pythonic way and
> use a while loop and a nested if-else structure. My attempts have been
> lame enough that I don't even want to put them here.
A while loop with an if inside is indeed a very plausible solution, so
it would be interesting to see your attempts.
> int_list
> [36, 39, 39, 45, 61, 54, 61, 93, 62, 51, 47, 72, 54, 36, 62, 50, 41, 41,
> 40, 62, 62, 58, 57, 54, 49, 43, 47, 50, 45, 41, 54, 57, 57, 55, 62, 51,
> 34, 57, 55, 63, 45, 45, 42, 44, 34, 53, 67, 58, 56, 43, 33]
>>>> int_list.sort()
>>>> int_list
> [33, 34, 34, 36, 36, 39, 39, 40, 41, 41, 41, 42, 43, 43, 44, 45, 45, 45,
> 45, 47, 47, 49, 50, 50, 51, 51, 53, 54, 54, 54, 54, 55, 55, 56, 57, 57,
> 57, 57, 58, 58, 61, 61, 62, 62, 62, 62, 62, 63, 67, 72, 93]
>>>> flo_list = [float(integral) for integral in int_list]
While this last line shows that you've started using list
comprehensions, which is a good thing, converting your data to floating
point is not a good idea. It is completely unnecessary and (though
probably not relevant here) can compromise the accuracy of calculations
due to inherent rounding errors.
I guess you are doing this to prevent subsequent rounding of the result
of sum(int_list)/len(int_list).
This is a Python2-specific issue and, personally, I think that as a
beginner you should use Python3, where (among other things) this is not
a problem.
If you want to stick to Python2 for whatever reason then do:
from __future__ import division
after which integer divisions return a float if required just as in Python3.
>>> sum(int_list)/len(int_list)
51.31372549019608
>>>> flo_list
> [33.0, 34.0, 34.0, 36.0, 36.0, 39.0, 39.0, 40.0, 41.0, 41.0, 41.0, 42.0,
> 43.0, 43.0, 44.0, 45.0, 45.0, 45.0, 45.0, 47.0, 47.0, 49.0, 50.0, 50.0,
> 51.0, 51.0, 53.0, 54.0, 54.0, 54.0, 54.0, 55.0, 55.0, 56.0, 57.0, 57.0,
> 57.0, 57.0, 58.0, 58.0, 61.0, 61.0, 62.0, 62.0, 62.0, 62.0, 62.0, 63.0,
> 67.0, 72.0, 93.0]
>>>> sum(flo_list)
> 2617.0
>>>> totalnum = sum(flo_list)
stop generating references if you're not going to use them later!
Confuses you and others.
>>>> len(flo_list)
> 51
>>>> mean = sum(flo_list)/len(flo_list)
>>>> mean
> 51.31372549019608
So, you know how to calculate the total mean. For the means of
subsamples what you have to do is to apply that same logic to subsamples
of the data, which you have to generate.
Without going through the lists of values several times, however, I
cannot think of any simple implementation of this, which does not
involve plenty of novel concepts.
One fairly simple approach would be through a while loop as you
suggested, but as said before, for loops are often more elegant in
Python. I guess the following code is roughly what you had in mind ?
breakpoints = [your_list_of breakpoints]
large_value_buffer = []
int_list_iter = iter(int_list) # see comment below
for breakpoint in breakpoints:
sublist = large_value_buffer
for value in int_list_iter:
if value < breakpoint:
sublist.append(value)
if large_value_buffer:
large_value_buffer = []
else:
if sublist:
print(sum(sublist)/len(sublist))
large_value_buffer.append(value)
break
Essentially, you should know all elements of this small program except
the iter(int_list). Essentially, this gives you a one-time iterator,
which cannot be reused or reset, to use in the inner for loop. This
prevents starting from the beginning of the list every time.
Since this is probably too complicated for you to work it out by
yourself at this stage, I decided to give you the complete code, but
make sure you understand what it does, especially think about what the
large_value_buffer is doing.
One problem with this code is that it silently skips empty bins. Maybe
that's something for you to work on ?
Best,
Wolfgang
More information about the Tutor
mailing list