in place list modification necessary? What's a better idiom?

Tue Apr 7 04:10:56 EDT 2009

On Apr 7, 12:38 am, Peter Otten <__pete... at web.de> wrote:
> MooMaster wrote:
> > Now we can't calculate a meaningful Euclidean distance for something
> > like "Iris-setosa" and "Iris-versicolor" unless we use string-edit
> > distance or something overly complicated, so instead we'll use a
> > simple quantization scheme of enumerating the set of values within the
> > column domain and replacing the strings with numbers (i.e. Iris-setosa
> > = 1, iris-versicolor=2).
>
> I'd calculate the distance as
>
> def string_dist(x, y, weight=1):
>     return weight * (x == y)
>
> You don't get a high resolution in that dimension, but you don't introduce
> an element of randomness, either.

Does the algorithm require well-ordered data along the dimensions?
Though I've never heard of it, the fact that it's called "bisecting
Kmeans" suggests to me that it does, which means this wouldn't work.

However, the OP better be sure to set the scales for the quantized
dimensions high enough so that no clusters form containing points with
different discrete values.

That, in turn, suggests he might as well not even bother sending the
discrete values to the clustering algorithm, but instead to call it
for each unique set of discretes.  (However, I could imagine the
marginal cost of more dimensions is less than that of multiple runs;
I've been dealing with such a case at work.)

I'll leave it to the OP to decide.

Carl Banks