[Numpy-discussion] Data standardizing

Wes McKinney wesmckinn at gmail.com
Wed Apr 13 21:39:49 EDT 2011


On Wed, Apr 13, 2011 at 9:50 AM, Jonathan Rocher <jrocher at enthought.com> wrote:
> Hi,
>
> I assume you have this data in a txt file, correct? You can load up all of
> it in a numpy array using
> import numpy as np
> data = np.loadtxt("climat_file.txt", skiprows = 1)
>
> Then you can compute the mean you want by taking it on a slice of the data
> array. For example, if you want to compute the mean of your data in Jan for
> 1950-1970 (say including 1970)
> mean1950_1970 = data[1950:1971,1].mean()
>
> Then the std deviation you want could be computed using
> my_std = np.sqrt(np.mean((data[:,1]-mean1950_1970)**2))
>
> Hope this helps,
> Jonathan
>
> On Tue, Apr 12, 2011 at 1:48 PM, Climate Research <climateforu at gmail.com>
> wrote:
>>
>> Hi
>> I am purely new to python and numpy..  I am using python for doing
>> statistical calculations to Climate data..
>>
>> I  have  a  data set in the following format..
>>
>> Year  Jan      feb       Mar    Apr.................   Dec
>> 1900  1000    1001       ,        ,                         ,
>> 1901  1011    1012       ,        ,                         ,
>> 1902  1009    1007       ,                                  ,
>> ,,,,        ,           '          ,        ,                         ,
>> ,,,,        ,           ,
>> 2010  1008    1002       ,        ,                         ,
>>
>> I actually want to standardize each of these values with corresponding
>> standard deviations for  each monthly data column..
>> I have found out the standard deviations for each column..  but now i need
>> to  find the standared deviation  only for a prescribed mean value
>> ie,  when i am finding the standared deviation for the January data
>> column..  the mean should be calculated only for the january data, say from
>> 1950-1970. With this mean  i  want to calculate the SD  for entire column.
>> Any help will be appreciated..
>>
>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>
>
>
> --
> Jonathan Rocher, PhD
> Scientific software developer
> Enthought, Inc.
> jrocher at enthought.com
> 1-512-536-1057
> http://www.enthought.com
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>

To standardize the data over each column you'll want to do:

(data - data.mean(axis=0)) / data.std(axis=0, ddof=1)

Note the broadcasting behavior of the (matrix - vector) operation--see
NumPy documentation for more details. The ddof=1 is there to give you
the (unbiased) sample standard deviation.

<shameless plug>

If you're looking for data structures to carry around your metadata
(dates and month labels), look to pandas (my project:
http://pandas.sourceforge.net/) or larry
(http://larry.sourceforge.net/).

</shameless plug>

- Wes



More information about the NumPy-Discussion mailing list