in place list modification necessary? What's a better idiom?

Thu Apr 9 00:17:41 EDT 2009

On Apr 6, 10:43 pm, Carl Banks <pavlovevide... at gmail.com> wrote:
> MooMaster wrote:
> > So I'm reading in values from a file, and for each column I need to
> > dynamically discover the range of possible values it can take and
> > quantize if necessary. This is the solution I've come up with:
>
> > <code>
> > def createInitialCluster(fileName):
> >     #get the data from the file
> >     points = []
> >     with open(fileName, 'r') as f:
> >         for line in f:
> >             points.append(line.rstrip('\n'))
> >     #clean up the data
> >     fixedPoints = []
> >     for point in points:
> >         dimensions = [quantize(i, points, point.split(",").index(i))
> > for i in point.split(",")]
> >         print dimensions
> >         fixedPoints.append(Point(dimensions))
> >     #return an initial cluster of all the points
> >     return Cluster(fixedPoints)
>
> > def quantize(stringToQuantize, pointList, columnIndex):
> >     #if it's numeric, no need to quantize
> >     if(isNumeric(stringToQuantize)):
> >         return float(stringToQuantize)
> >     #first we need to discover all the possible values of this column
> >     domain = []
> >     for point in pointList:
> >         domain.append(point.split(",")[columnIndex])
> >     #make it a set to remove duplicates
> >     domain = list(Set(domain))
> >     #use the index into the domain as the number representing this
> > value
> >     return float(domain.index(stringToQuantize))
>
> > #harvested fromhttp://www.rosettacode.org/wiki/IsNumeric#Python
> > def isNumeric(string):
> >     try:
> >         i = float(string)
> >     except ValueError:
> >         return False
> >     return True
>
> Big problem with this.  I'm guessing you never ran it on a really big
> file yet.  Your quantize function does a lot of unnecessary work: it
> rebuilds the list of indices for every single line, every single non-
> numeric entry, in fact.  (Tech-speak: that is N^2 behavior in both the
> number of lines and number of non-numeric columns.  Not good.)  This
> will work ok for a small file, but will take forever on a large file
> (perhaps literally).
>
> So, forgetting about column indexing for a moment, we can improve this
> vastly simply by generating the list of indices once.  Do that in a
> separete function, have it return the list, and then pass that list to
> the quantize function.  So replace the midportion of your
> createInitialCluster function with something like this:
>
>     ....
>     for i in xrange(len(points[0])): # number of columns
>         column_indices.append(quantize_column(points,i))
>     fixedPoints = []
>     for point in points:
>         dimensions = [quantize(s, column_indices[i], point.split
> (",").index(i))
>                 for (i,s) in enumerate(point.split(","))] # need index
> as well as entry here
>         print dimensions
>         fixedPoints.append(Point(dimensions))
>     ....
>
> And the two functions would be something like this:
>
> def quantize_columns(point_list,column_index):
>     # this assumes if the first column is numeric the whole column
> would be
>     if(isNumeric(point_list[0][column_index])):
>         return None # don't quantize this column
>     #first we need to discover all the possible values of this column
>     domain = []
>     for point in point_list:
>         domain.append(point.split(",")[column_index])
>     #make it a set to remove duplicates
>     return list(set(domain))
>
> def quantize(i,domain,s):
>     if domain is None:
>         return float(s)
>     return float(domain.index(s))
>
> This (once debugged :) will run much, much faster on a large dataset.
>
> Now back to your question.
>
> > It works, but it feels a little ugly, and not exactly Pythonic. Using
> > two lists I need the original point list to read in the data, then the
> > dimensions one to hold the processed point, and a fixedPoint list to
> > make objects out of the processed data. If my dataset is in the order
> > of millions, this'll nuke the memory. I tried something like:
>
> > for point in points:
> >    point = Point([quantize(i, points, point.split(",").index(i)) for i
> > in point.split(",")])
> > but when I print out the points afterward, it doesn't keep the
> > changes.
>
> It's because the order of items in a set is undefined.  The order of
> the items in list(set(["a","b","c","d"])) might be very different from
> the order in list(set(["a","b","c","d","e"])).  You were passing
> quantize incomplete an incomplete list of points, so as the points
> grew, the items in the set changed, and it messed up the order.  In
> fact,  you should never rely on the order being the same, even if you
> created the set with the very same arguments.
>
> What you are trying to do should be done with dictionaries: create a
> dict that maps a value to a number.
>
> Now, based on your quantize function, it would seem that the number
> associated with the value is arbitrary (doesn't matter what it is, as
> long as it's distinct), so it isn't necessary to read the whole csv
> file in before assigning numbers; just build the dict as you go.
>
> I suggest collections.defaultdict for this task.  It's like a regular
> dict, but it creates a new value any time you access a key that
> doesn't exist, a perfect solution to your task.  We'll pass it a
> function that generates a different index each time it's called.
> (This is probably too advanced for you, but oh well, it's such a cool
> trick.)
>
> import collections
> import itertools
>
> def createInitialCluster(fileName):
>     fixedPoints = []
>     # quantization is a dict that assigns sequentially-increasing
> numbers
>     # to values when reading keys that don't yet exit
>     quantization = defaultdict.collections(itertools.count().next)
>     with open(fileName, 'r') as f:
>         for line in f:
>             dimensions = []
>             for s in line.rstrip('\n').split(","):
>                 if isNumeric(s):
>                     dimensions.append(float(s))
>                 else:
>                     dimensions.append(float(quantization[s]))
>             fixedPoints.append(Point(dimensions))
>     return Cluster(fixedPoints)
>
> A couple general pointers:
>
> * Don't ever use i to represent a string.  Programmers expect i to be
> an integer.  i,j,k,l,m, and n should be integers, in fact.  u,v,w,x,y,
> and z should be floats.  You should stick to this convention whenever
> possible, but definitely never use i for anything but an integer.
>
> * set is built into Python now; unless you're using an older version
> (2.3 I think) you should use set instead of Set.
>
> * The Python style guide (q.g.) recommends that variables use names
> such as column_index rather than columnIndex.  The world won't end if
> you don't follow it but if you want to be Pythonic that's how.
>
> Carl Banks

Nice explanation! Very detailed and thorough, thanks! I haven't used
defaultdict before, so that was a very useful and efficient trick!