[Tutor] binning data and calculating means of classes

Wed Jul 23 11:28:15 CEST 2014

On 07/23/2014 03:36 AM, LN A-go-go wrote:
 >
 > with your help.  I have been working the last few days, I am sorry to
 > say, unsuccessfully, to calculate the mean (that's easy), split the data
 > into sub-groups or secondary means - which are the break values between
 > 4 classes.  Create data-sets with incursive values.  I can do it with
 > brute force (copy and paste) but need to rise to the pythonic way and
 > use a while loop and a nested if-else structure.  My attempts have been
 > lame enough that I don't even want to put them here.

A while loop with an if inside is indeed a very plausible solution, so 
it would be interesting to see your attempts.

 > int_list
 > [36, 39, 39, 45, 61, 54, 61, 93, 62, 51, 47, 72, 54, 36, 62, 50, 41, 41,
 > 40, 62, 62, 58, 57, 54, 49, 43, 47, 50, 45, 41, 54, 57, 57, 55, 62, 51,
 > 34, 57, 55, 63, 45, 45, 42, 44, 34, 53, 67, 58, 56, 43, 33]
 >>>> int_list.sort()
 >>>> int_list
 > [33, 34, 34, 36, 36, 39, 39, 40, 41, 41, 41, 42, 43, 43, 44, 45, 45, 45,
 > 45, 47, 47, 49, 50, 50, 51, 51, 53, 54, 54, 54, 54, 55, 55, 56, 57, 57,
 > 57, 57, 58, 58, 61, 61, 62, 62, 62, 62, 62, 63, 67, 72, 93]
 >>>> flo_list = [float(integral) for integral in int_list]

While this last line shows that you've started using list 
comprehensions, which is a good thing, converting your data to floating 
point is not a good idea. It is completely unnecessary and (though 
probably not relevant here) can compromise the accuracy of calculations 
due to inherent rounding errors.
I guess you are doing this to prevent subsequent rounding of the result 
of sum(int_list)/len(int_list).
This is a Python2-specific issue and, personally, I think that as a 
beginner you should use Python3, where (among other things) this is not 
a problem.
If you want to stick to Python2 for whatever reason then do:

from __future__ import division

after which integer divisions return a float if required just as in Python3.

 >>> sum(int_list)/len(int_list)
51.31372549019608

 >>>> flo_list
 > [33.0, 34.0, 34.0, 36.0, 36.0, 39.0, 39.0, 40.0, 41.0, 41.0, 41.0, 42.0,
 > 43.0, 43.0, 44.0, 45.0, 45.0, 45.0, 45.0, 47.0, 47.0, 49.0, 50.0, 50.0,
 > 51.0, 51.0, 53.0, 54.0, 54.0, 54.0, 54.0, 55.0, 55.0, 56.0, 57.0, 57.0,
 > 57.0, 57.0, 58.0, 58.0, 61.0, 61.0, 62.0, 62.0, 62.0, 62.0, 62.0, 63.0,
 > 67.0, 72.0, 93.0]
 >>>> sum(flo_list)
 > 2617.0
 >>>>  totalnum = sum(flo_list)

stop generating references if you're not going to use them later!
Confuses you and others.

 >>>> len(flo_list)
 > 51
 >>>> mean = sum(flo_list)/len(flo_list)
 >>>> mean
 > 51.31372549019608

So, you know how to calculate the total mean. For the means of 
subsamples what you have to do is to apply that same logic to subsamples 
of the data, which you have to generate.
Without going through the lists of values several times, however, I 
cannot think of any simple implementation of this, which does not 
involve plenty of novel concepts.
One fairly simple approach would be through a while loop as you 
suggested, but as said before, for loops are often more elegant in 
Python. I guess the following code is roughly what you had in mind ?

breakpoints = [your_list_of breakpoints]
large_value_buffer = []
int_list_iter = iter(int_list) # see comment below
for breakpoint in breakpoints:
	sublist = large_value_buffer
	for value in int_list_iter:
		if value < breakpoint:
			sublist.append(value)
			if large_value_buffer:
				large_value_buffer = []
		else:
			if sublist:
				print(sum(sublist)/len(sublist))
				large_value_buffer.append(value)
			break

Essentially, you should know all elements of this small program except 
the iter(int_list). Essentially, this gives you a one-time iterator, 
which cannot be reused or reset, to use in the inner for loop. This 
prevents starting from the beginning of the list every time.

Since this is probably too complicated for you to work it out by 
yourself at this stage, I decided to give you the complete code, but 
make sure you understand what it does, especially think about what the 
large_value_buffer is doing.

One problem with this code is that it silently skips empty bins. Maybe 
that's something for you to work on ?

Best,
Wolfgang