How to determine suitable cluster algo
I want to do customized clustering algo for my datasets, that's cuz I don't want to try every algo and its hyperparameters. I though I just define the default range of import hyperparameters ex: number of cluster in K-means. I want to iterate some possible clutering alog like K-means, DBSCAN, AP...etc, and I choose the suitable algo to cluster for me. I'm not sure if that is able to do, but does GridSearchCV work for me? Or any other ways to determine that? thx
GridSearchCV is meant for tuning hyperparameters of a model over some ranges of configurations and parameter values. Like the documentation explains: https://scikit-learn.org/stable/modules/grid_search.html (and it also has some examples) The (e.g. 10-fold) cross-validation as measure of accuracy (how accurately do different folds attain the value of the statistic) and generalization (that the accuracy remains similar between folds) is at least that what I’m taught at uni. A greater problem is how can one decide, what parameters or e.g. parameter ranges to look for. Since some e.g. float-valued parameters might have some ranges that are “more often used”, while some others that may not work for most of the time. Additionally e.g. the kernels and stuff have some which may have more general robustness, while some others may become computationally very expensive, when combined with some other parameters (such as that in MLPClassifier some activation functions and hidden_layer_sizes may correlate in increased computation cost, while not necessarily increasing accuracy). The best I’ve figured so far is to: Start with few of the most often used / major parameters and try to get them to produce results that are as accurate as possible with still affordable computation time. Only after that consider adding more params. However, I’ve not found much info regarding how the parameters of different methods are ordered in terms of “significance”. One could assume that by the preceding ones are more major than the following ones. However, some of the parameters also clearly “correlate” between each other, so they have cross-effects on accuracy etc. Best is probably just start trying and then perhaps write down, if you notice some general patterns as to what works? There’s also: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.... for designing “pipelines” or sort of “Design of Experiments” on sklearn algos. Also found this: https://towardsdatascience.com/design-your-engineering-experiment-plan-with-... but have not tried it, nor know if it’s necessary. BR, Matti Lähetetty Windows 10:n Sähköpostista Lähettäjä: lampahome Lähetetty: Thursday, 24 January 2019 11.14 Vastaanottaja: Scikit-learn mailing list Aihe: [scikit-learn] How to determine suitable cluster algo I want to do customized clustering algo for my datasets, that's cuz I don't want to try every algo and its hyperparameters. I though I just define the default range of import hyperparameters ex: number of cluster in K-means. I want to iterate some possible clutering alog like K-means, DBSCAN, AP...etc, and I choose the suitable algo to cluster for me. I'm not sure if that is able to do, but does GridSearchCV work for me? Or any other ways to determine that? thx --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus
Maybe the suitable way is try-and-error? What I'm interesting is that my datasets is very huge and I can't try number of cluster from 1 to N if I have N samples That cost too much time for me. Maybe I should define the initial number of cluster based on execution time? Then analyze the next step is increase/decrease the number of cluster? thx
For determining what one can afford computaionally, see e.g.: https://stackoverflow.com/questions/22443041/predicting-how-long-an-scikit-l... https://www.reddit.com/r/scikit_learn/comments/a746h0/is_there_any_way_to_es... Lähetetty Windows 10:n Sähköpostista Lähettäjä: lampahome Lähetetty: Friday, 25 January 2019 3.42 Vastaanottaja: Scikit-learn mailing list Aihe: Re: [scikit-learn] How to determine suitable cluster algo Maybe the suitable way is try-and-error? What I'm interesting is that my datasets is very huge and I can't try number of cluster from 1 to N if I have N samples That cost too much time for me. Maybe I should define the initial number of cluster based on execution time? Then analyze the next step is increase/decrease the number of cluster? thx --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus
Also, Remember that some algos may exhibit “sweet spots” w.r.t. computation time and gained accuracy. So you might want to keep measuring “explained variance”, while you add complexity to your models. And then do plots of model complexity vs explained variance. E.g. in MLPClassifier you’d plot e.g. hidden layers against explained variance to figure out where adding hidden layers starts to exhibit lesser gain in explained variance. Lähetetty Windows 10:n Sähköpostista Lähettäjä: Matti Viljamaa Lähetetty: Friday, 25 January 2019 13.43 Vastaanottaja: Scikit-learn mailing list Aihe: VS: [scikit-learn] How to determine suitable cluster algo For determining what one can afford computaionally, see e.g.: https://stackoverflow.com/questions/22443041/predicting-how-long-an-scikit-l... https://www.reddit.com/r/scikit_learn/comments/a746h0/is_there_any_way_to_es... Lähetetty Windows 10:n Sähköpostista Lähettäjä: lampahome Lähetetty: Friday, 25 January 2019 3.42 Vastaanottaja: Scikit-learn mailing list Aihe: Re: [scikit-learn] How to determine suitable cluster algo Maybe the suitable way is try-and-error? What I'm interesting is that my datasets is very huge and I can't try number of cluster from 1 to N if I have N samples That cost too much time for me. Maybe I should define the initial number of cluster based on execution time? Then analyze the next step is increase/decrease the number of cluster? thx Virus-free. www.avast.com --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus
My comments are at the end as some people do not like top posts. From: scikit-learn <scikit-learn-bounces+avigross=verizon.net@python.org> On Behalf Of Matti Viljamaa Sent: Friday, January 25, 2019 3:31 PM To: Scikit-learn mailing list <scikit-learn@python.org> Subject: Re: [scikit-learn] How to determine suitable cluster algo Also, Remember that some algos may exhibit “sweet spots” w.r.t. computation time and gained accuracy. So you might want to keep measuring “explained variance”, while you add complexity to your models. And then do plots of model complexity vs explained variance. E.g. in MLPClassifier you’d plot e.g. hidden layers against explained variance to figure out where adding hidden layers starts to exhibit lesser gain in explained variance. Lähetetty Windows 10:n Sähköposti <https://go.microsoft.com/fwlink/?LinkId=550986> sta Lähettäjä: Matti Viljamaa <mailto:matti.v.viljamaa@gmail.com> Lähetetty: Friday, 25 January 2019 13.43 Vastaanottaja: Scikit-learn mailing list <mailto:scikit-learn@python.org> Aihe: VS: [scikit-learn] How to determine suitable cluster algo For determining what one can afford computaionally, see e.g.: https://stackoverflow.com/questions/22443041/predicting-how-long-an-scikit-l... https://www.reddit.com/r/scikit_learn/comments/a746h0/is_there_any_way_to_es... <https://www.redditcom/r/scikit_learn/comments/a746h0/is_there_any_way_to_est...> Lähetetty Windows 10:n Sähköposti <https://go.microsoft.com/fwlink/?LinkId=550986> sta Lähettäjä: lampahome <mailto:pahome.chen@mirlab.org> Lähetetty: Friday, 25 January 2019 3.42 Vastaanottaja: Scikit-learn mailing list <mailto:scikit-learn@python.org> Aihe: Re: [scikit-learn] How to determine suitable cluster algo Maybe the suitable way is try-and-error? What I'm interesting is that my datasets is very huge and I can't try number of cluster from 1 to N if I have N samples That cost too much time for me. Maybe I should define the initial number of cluster based on execution time? Then analyze the next step is increase/decrease the number of cluster? thx __COMMENT__ This is a question, not a suggestion. The poster suggested they have such a large amount of data that looking for larger numbers of clusters to find a ‘sweet’ spot may take too much time. Is there any value in taking a much smaller random sample of data that remains big enough and trying that on a reasonable range of clusters? The results would not be definitive but might supply a clue as to what range to try again with the full data. As I see mentioned, the run time may not be going up if the data is constant and the number of clusters varies. I am not sure what clustering algorithms you want to use but for something like K-means with reasonable data, generally the number of clusters that show meaningful results are usually much smaller than the number of items in the data. The algorithms often terminate when successive runs show little change. This may likely be a tunable parameter. So if you ask it to make N+1 clusters it may even terminate sooner than for N if it is because that number of clusters more closely resembles the variation in the data. And, again, if you are using a K-means variant, it may be better to use some human intervention to see if a particular level of clustering fits some model you can make that explains what each cluster has in common. If you overfit, the number of clusters can effectively be the number of unique items in your data and probably has no meaningful purpose. Again, just a question. There are algorithms out there that deal better with large data than others. Avi
participants (3)
-
Avi Gross -
lampahome -
Matti Viljamaa