chi squared (X2) in Python

ts8807385 at gmail.com ts8807385 at gmail.com
Mon Feb 16 12:55:41 EST 2009


I was wondering if anyone has done this in Python. I wrote two
functions that do it (I think... see below), but I do not understand
how to interpret the results. I'm doing an experiment to implement ent
in Python. ent tests the randomness of files and chi squared is
probably the best test for this purposes when compared to other tests.
Many of the statistical tests are easy (like Arithmetic Mean, etc) and
I have no problems interpreting the results from those, but chi
squared has stumped me. Here are my two simple functions, run them if
you like to better understand the output:

import os
import os.path

def observed(f):

    # argument f is a filepath/filename
    #
    # Return a list of observed characters in decimal ord(char).
    # Decimal value of characters may be 0 through 255.
    # [43, 54, 0, 255, 4, etc.]

    chars = []
    #print f

    fd = open(f, 'rb')
    bytes = fd.read(13312)
    fd.close()

    for byte in bytes:
        chars.append(ord(byte))

    #print chars

    if len(chars) != 13312:
        print "Wait... chars does not equal 13312 in observed!!!"
        return None
    else:
        return chars

def chi(char_list):

    # Expected frequency of characters. I arrived at this like so:
    # expected = number of observations/number of possibilities
    # 52 = 13312/256

    expected = 52.0

    print "observed\texpected\tx2"

    # 0 - 255
    for x in range(0,256):
        observed = 0
        for char in char_list:
            if x == char:
                observed +=1

        # The three chi squared calculations
        # one = observed - expected
        # two = one squared
        # x2 = two/expected

        # x2 = (observed - expected) squared
        #       ----------------------------
        #                expected

        one = observed - expected
        two = one * one
        x2 = two/expected

        print observed, "\t", expected, "\t", x2


chi(observed("filepath"))

The output looks similar to this:

observed	expected	x2
62 	52.0 	1.92307692308
46 	52.0 	0.692307692308
60 	52.0 	1.23076923077
68 	52.0 	4.92307692308

I know this is a bit off-topic here, just hoping someone could help me
interpret the x2 variable. After that, I'll be OK. I need to sum up
things to get an overall x2 for the bytes I've read, but before doing
that, I wanted to post this note. Please feel free to comment on any
aspect of this. If I've got something entirely wrong, let me know.
BTW, I selected 13KB (13,312) as it seems to be efficient and a decent
size to test, the data could be any amount (up to and including the
whole file) above this.

Thanks,

Tiff




More information about the Python-list mailing list