in place list modification necessary? What's a better idiom?

MooMaster ntv1534 at
Tue Apr 7 05:17:43 CEST 2009

A similar discussion has already occurred, over 4 years ago:

Nevertheless, I have a use-case where such a discussion comes up. For
my data mining class I'm writing an implementation of the bisecting
KMeans clustering algorithm (if you're not familiar with clustering
and are interested, this gives a decent example based overview: Given a
CSV dataset of n records, we are to cluster them accordingly.

The dataset is generalizable enough to have any kind of data-type
(strings, floats, booleans, etc) for each of the record's columnar
values, for example here's a couple of  records from the famous iris


Now we can't calculate a meaningful Euclidean distance for something
like "Iris-setosa" and "Iris-versicolor" unless we use string-edit
distance or something overly complicated, so instead we'll use a
simple quantization scheme of enumerating the set of values within the
column domain and replacing the strings with numbers (i.e. Iris-setosa
= 1, iris-versicolor=2).

So I'm reading in values from a file, and for each column I need to
dynamically discover the range of possible values it can take and
quantize if necessary. This is the solution I've come up with:

def createInitialCluster(fileName):
    #get the data from the file
    points = []
    with open(fileName, 'r') as f:
        for line in f:
    #clean up the data
    fixedPoints = []
    for point in points:
        dimensions = [quantize(i, points, point.split(",").index(i))
for i in point.split(",")]
        print dimensions
    #return an initial cluster of all the points
    return Cluster(fixedPoints)

def quantize(stringToQuantize, pointList, columnIndex):
    #if it's numeric, no need to quantize
        return float(stringToQuantize)
    #first we need to discover all the possible values of this column
    domain = []
    for point in pointList:
    #make it a set to remove duplicates
    domain = list(Set(domain))
    #use the index into the domain as the number representing this
    return float(domain.index(stringToQuantize))

#harvested from
def isNumeric(string):
        i = float(string)
    except ValueError:
        return False
    return True


It works, but it feels a little ugly, and not exactly Pythonic. Using
two lists I need the original point list to read in the data, then the
dimensions one to hold the processed point, and a fixedPoint list to
make objects out of the processed data. If my dataset is in the order
of millions, this'll nuke the memory. I tried something like:

for point in points:
   point = Point([quantize(i, points, point.split(",").index(i)) for i
in point.split(",")])

but when I print out the points afterward, it doesn't keep the

What's a more efficient way of doing this?

More information about the Python-list mailing list