clusters of numbers
avigross at verizon.net
Sat Dec 15 23:43:44 EST 2018
From: Avi Gross <avigross at verizon.net>
Sent: Saturday, December 15, 2018 11:27 PM
To: 'Marc Lucke' <marc at marcsnet.com>
Subject: RE: clusters of numbers
There are k-means implementations in python, R and other places. Most uses would have two or more dimensions with a goal of specifying how many clusters to look for and then it iterates starting with random existing points to cluster things near those points and then near the centers of those clusters until things stabilize.
Your data is 1-D. Something simpler like a bar chart makes sense. But that may not show underlying patterns.
I am more familiar with doing graphics in R but you can see a tabular view of your data:
1 2 3 5 6 7 8 10 11 12 14 15 16 17 19 20 21 23 24 25 26 29 35 43
124 116 97 95 89 74 57 73 48 49 38 35 20 33 21 19 14 5 4 4 3 1 1 1
There are clear gaps and a bar chart (which I cannot attach but could send in private email) does show clusters visibly.
But those may largely be an artifact of the missing info.
If you tell us more, we might be able to provide a better statistical answer. I assume you know how to get means and so on.
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 1021 7.82 6.01 6 7.12 5.93 1 43 42 1.04 1.23 0.19
Yes, the above is hard to read as I cannot use tables or a constant width font in this forum.
I ran a kmeans asking for 3 clusters:
The three clusters had these scores in them:
Cluster 1: 5 6 7 8 10 11
Cluster 2: 1 2 3
Cluster 3: 12 14 15 16 17 19 20 21 23 24 25 26 29 35 43
If I run it asking for say 5 clusters:
And here are your five clusters:
5 6 7 8
10 11 12 14
15 16 17 19 20 21 23 24 25 26 29 35 43
If you ran this for various numbers, you might see one that makes more sense to you. Or, maybe not.
We culd tell you what functions to use but if you search using keywords like python (or another language) followed by k-means or kmeans you can fid out what to install and use. In python, you would need Numpy and probably SciPy as well as the sklearn modules with the Kmeans function in sklearn.clusters. Note you can fine tune the algorithm multiple ways or run it several times as the results can depend on the initial guesses. And you may want to be able to make graphics showing the clusters, albeit it is 1-D.
From: Python-list <python-list-bounces+avigross=verizon.net at python.org> On Behalf Of Marc Lucke
Sent: Saturday, December 15, 2018 7:55 PM
To: python-list at python.org
Subject: clusters of numbers
I have a hobby project that sorts my email automatically for me & I want to improve it. There's data science and statistical info that I'm missing, & I always enjoy reading about the pythonic way to do things too.
I have a list of percentage scores:
& I'd like to know know whether, & how the numbers are clustered. In an extreme & illustrative example, 1..10 would have zero clusters;
1,1,1,2,2,2,7,7,7 would have 3 clusters (around 1,2 & 7);
17,22,20,45,47,51,82,84,83 would have 3 clusters. (around 20, 47 & 83). In my set, when I scan it, I intuitively figure there's lots of numbers close to 0 & a lot close to 20 (or there abouts).
I saw info about k-clusters but I'm not sure if I'm going down the right path. I'm interested in k-clusters & will teach myself, but my priority is working out this problem.
Do you know the name of the algorithm I'm trying to use? If so, are there python libraries like numpy that I can leverage? I imagine that I could iterate from 0 to 100% using that as an artificial mean, discard values that are over a standard deviation away, and count the number of scores for that mean; then at the end of that I could set a threshold for which the artificial mean would be kept something like (no attempt at correct syntax:
for i in range 100:
for j in list:
if abs(j-i) > deviation:
if count > threshold:
That algorithm is entirely untested & I think it could work, it's just I don't want to reinvent the wheel. Any ideas kindly appreciated.
More information about the Python-list