in place list modification necessary? What's a better idiom?
MooMaster
ntv1534 at gmail.com
Thu Apr 9 00:17:41 EDT 2009
On Apr 6, 10:43 pm, Carl Banks <pavlovevide... at gmail.com> wrote:
> MooMaster wrote:
> > So I'm reading in values from a file, and for each column I need to
> > dynamically discover the range of possible values it can take and
> > quantize if necessary. This is the solution I've come up with:
>
> > <code>
> > def createInitialCluster(fileName):
> > #get the data from the file
> > points = []
> > with open(fileName, 'r') as f:
> > for line in f:
> > points.append(line.rstrip('\n'))
> > #clean up the data
> > fixedPoints = []
> > for point in points:
> > dimensions = [quantize(i, points, point.split(",").index(i))
> > for i in point.split(",")]
> > print dimensions
> > fixedPoints.append(Point(dimensions))
> > #return an initial cluster of all the points
> > return Cluster(fixedPoints)
>
> > def quantize(stringToQuantize, pointList, columnIndex):
> > #if it's numeric, no need to quantize
> > if(isNumeric(stringToQuantize)):
> > return float(stringToQuantize)
> > #first we need to discover all the possible values of this column
> > domain = []
> > for point in pointList:
> > domain.append(point.split(",")[columnIndex])
> > #make it a set to remove duplicates
> > domain = list(Set(domain))
> > #use the index into the domain as the number representing this
> > value
> > return float(domain.index(stringToQuantize))
>
> > #harvested fromhttp://www.rosettacode.org/wiki/IsNumeric#Python
> > def isNumeric(string):
> > try:
> > i = float(string)
> > except ValueError:
> > return False
> > return True
>
> Big problem with this. I'm guessing you never ran it on a really big
> file yet. Your quantize function does a lot of unnecessary work: it
> rebuilds the list of indices for every single line, every single non-
> numeric entry, in fact. (Tech-speak: that is N^2 behavior in both the
> number of lines and number of non-numeric columns. Not good.) This
> will work ok for a small file, but will take forever on a large file
> (perhaps literally).
>
> So, forgetting about column indexing for a moment, we can improve this
> vastly simply by generating the list of indices once. Do that in a
> separete function, have it return the list, and then pass that list to
> the quantize function. So replace the midportion of your
> createInitialCluster function with something like this:
>
> ....
> for i in xrange(len(points[0])): # number of columns
> column_indices.append(quantize_column(points,i))
> fixedPoints = []
> for point in points:
> dimensions = [quantize(s, column_indices[i], point.split
> (",").index(i))
> for (i,s) in enumerate(point.split(","))] # need index
> as well as entry here
> print dimensions
> fixedPoints.append(Point(dimensions))
> ....
>
> And the two functions would be something like this:
>
> def quantize_columns(point_list,column_index):
> # this assumes if the first column is numeric the whole column
> would be
> if(isNumeric(point_list[0][column_index])):
> return None # don't quantize this column
> #first we need to discover all the possible values of this column
> domain = []
> for point in point_list:
> domain.append(point.split(",")[column_index])
> #make it a set to remove duplicates
> return list(set(domain))
>
> def quantize(i,domain,s):
> if domain is None:
> return float(s)
> return float(domain.index(s))
>
> This (once debugged :) will run much, much faster on a large dataset.
>
> Now back to your question.
>
> > It works, but it feels a little ugly, and not exactly Pythonic. Using
> > two lists I need the original point list to read in the data, then the
> > dimensions one to hold the processed point, and a fixedPoint list to
> > make objects out of the processed data. If my dataset is in the order
> > of millions, this'll nuke the memory. I tried something like:
>
> > for point in points:
> > point = Point([quantize(i, points, point.split(",").index(i)) for i
> > in point.split(",")])
> > but when I print out the points afterward, it doesn't keep the
> > changes.
>
> It's because the order of items in a set is undefined. The order of
> the items in list(set(["a","b","c","d"])) might be very different from
> the order in list(set(["a","b","c","d","e"])). You were passing
> quantize incomplete an incomplete list of points, so as the points
> grew, the items in the set changed, and it messed up the order. In
> fact, you should never rely on the order being the same, even if you
> created the set with the very same arguments.
>
> What you are trying to do should be done with dictionaries: create a
> dict that maps a value to a number.
>
> Now, based on your quantize function, it would seem that the number
> associated with the value is arbitrary (doesn't matter what it is, as
> long as it's distinct), so it isn't necessary to read the whole csv
> file in before assigning numbers; just build the dict as you go.
>
> I suggest collections.defaultdict for this task. It's like a regular
> dict, but it creates a new value any time you access a key that
> doesn't exist, a perfect solution to your task. We'll pass it a
> function that generates a different index each time it's called.
> (This is probably too advanced for you, but oh well, it's such a cool
> trick.)
>
> import collections
> import itertools
>
> def createInitialCluster(fileName):
> fixedPoints = []
> # quantization is a dict that assigns sequentially-increasing
> numbers
> # to values when reading keys that don't yet exit
> quantization = defaultdict.collections(itertools.count().next)
> with open(fileName, 'r') as f:
> for line in f:
> dimensions = []
> for s in line.rstrip('\n').split(","):
> if isNumeric(s):
> dimensions.append(float(s))
> else:
> dimensions.append(float(quantization[s]))
> fixedPoints.append(Point(dimensions))
> return Cluster(fixedPoints)
>
> A couple general pointers:
>
> * Don't ever use i to represent a string. Programmers expect i to be
> an integer. i,j,k,l,m, and n should be integers, in fact. u,v,w,x,y,
> and z should be floats. You should stick to this convention whenever
> possible, but definitely never use i for anything but an integer.
>
> * set is built into Python now; unless you're using an older version
> (2.3 I think) you should use set instead of Set.
>
> * The Python style guide (q.g.) recommends that variables use names
> such as column_index rather than columnIndex. The world won't end if
> you don't follow it but if you want to be Pythonic that's how.
>
> Carl Banks
Nice explanation! Very detailed and thorough, thanks! I haven't used
defaultdict before, so that was a very useful and efficient trick!
More information about the Python-list
mailing list