in place list modification necessary? What's a better idiom?

MooMaster ntv1534 at gmail.com
Tue Apr 7 05:17:43 CEST 2009


A similar discussion has already occurred, over 4 years ago:
http://groups.google.com/group/comp.lang.python/browse_thread/thread/b806ada0732643d/5dff55826a199928?lnk=gst&q=list+in+place#5dff55826a199928

Nevertheless, I have a use-case where such a discussion comes up. For
my data mining class I'm writing an implementation of the bisecting
KMeans clustering algorithm (if you're not familiar with clustering
and are interested, this gives a decent example based overview:
http://rakaposhi.eas.asu.edu/cse494/notes/f02-clustering.ppt). Given a
CSV dataset of n records, we are to cluster them accordingly.

The dataset is generalizable enough to have any kind of data-type
(strings, floats, booleans, etc) for each of the record's columnar
values, for example here's a couple of  records from the famous iris
dataset:

5.1,3.5,1.4,0.2,Iris-setosa
6.4,3.2,4.5,1.5,Iris-versicolor

Now we can't calculate a meaningful Euclidean distance for something
like "Iris-setosa" and "Iris-versicolor" unless we use string-edit
distance or something overly complicated, so instead we'll use a
simple quantization scheme of enumerating the set of values within the
column domain and replacing the strings with numbers (i.e. Iris-setosa
= 1, iris-versicolor=2).

So I'm reading in values from a file, and for each column I need to
dynamically discover the range of possible values it can take and
quantize if necessary. This is the solution I've come up with:

<code>
def createInitialCluster(fileName):
    #get the data from the file
    points = []
    with open(fileName, 'r') as f:
        for line in f:
            points.append(line.rstrip('\n'))
    #clean up the data
    fixedPoints = []
    for point in points:
        dimensions = [quantize(i, points, point.split(",").index(i))
for i in point.split(",")]
        print dimensions
        fixedPoints.append(Point(dimensions))
    #return an initial cluster of all the points
    return Cluster(fixedPoints)

def quantize(stringToQuantize, pointList, columnIndex):
    #if it's numeric, no need to quantize
    if(isNumeric(stringToQuantize)):
        return float(stringToQuantize)
    #first we need to discover all the possible values of this column
    domain = []
    for point in pointList:
        domain.append(point.split(",")[columnIndex])
    #make it a set to remove duplicates
    domain = list(Set(domain))
    #use the index into the domain as the number representing this
value
    return float(domain.index(stringToQuantize))

#harvested from http://www.rosettacode.org/wiki/IsNumeric#Python
def isNumeric(string):
    try:
        i = float(string)
    except ValueError:
        return False
    return True

</code>

It works, but it feels a little ugly, and not exactly Pythonic. Using
two lists I need the original point list to read in the data, then the
dimensions one to hold the processed point, and a fixedPoint list to
make objects out of the processed data. If my dataset is in the order
of millions, this'll nuke the memory. I tried something like:

for point in points:
   point = Point([quantize(i, points, point.split(",").index(i)) for i
in point.split(",")])

but when I print out the points afterward, it doesn't keep the
changes.

What's a more efficient way of doing this?




More information about the Python-list mailing list