From t3kcit at gmail.com Tue Nov 1 10:05:54 2016
From: t3kcit at gmail.com (Andy)
Date: Tue, 1 Nov 2016 10:05:54 -0400
Subject: [scikit-learn] creating a custom scoring function for
cross-validation of classification
In-Reply-To:
References:
Message-ID:
Hi.
If you want to pass a custom scorer, you need to pass the scorer, not a
string with the scorer name.
Andy
On 10/31/2016 04:28 PM, Sumeet Sandhu wrote:
> Hi,
>
> I've been staring at various doc pages for a while to create a custom
> scorer that uses predict_proba output of a multi-class SGDClassifier :
> http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score
> http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
> http://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html#sklearn.metrics.make_scorer
>
> I got the impression I could customize the "scoring'' parameter in
> cross_val_score directly, but that didn't work.
> Then I tried customizing the "score_func" parameter in make_scorer,
> but that didn't work either. Both errors are ValuErrors :
>
> Traceback (most recent call last):
> File "", line 3, in
> accuracy = mean(cross_val_score(LRclassifier, trainPatentVecs,
> trainLabelVecs, cv=10, scoring = 'topNscorer'))
> File
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/cross_validation.py",
> line 1425, in cross_val_score
> scorer = check_scoring(estimator, scoring=scoring)
> File
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/metrics/scorer.py",
> line 238, in check_scoring
> return get_scorer(scoring)
> File
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/metrics/scorer.py",
> line 197, in get_scorer
> % (scoring, sorted(SCORERS.keys())))
> ValueError: 'topNscorer' is not a valid scoring value. Valid options
> are ['accuracy', 'adjusted_rand_score', 'average_precision', 'f1',
> 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'log_loss',
> 'mean_absolute_error', 'mean_squared_error', 'median_absolute_error',
> 'precision', 'precision_macro', 'precision_micro',
> 'precision_samples', 'precision_weighted', 'r2', 'recall',
> 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted',
> 'roc_auc']
>
> At a high level, I want to find out if the true label was found in the
> top N multi-class labels coming out of an SGD classifier. Built-in
> scores like "accuracy" only look at N=1.
>
> Here is the code using make_scorer :
> LRclassifier = SGDClassifier(loss='log')
> topNscorer = make_scorer(topNscoring, greater_is_better=True,
> needs_proba=True)
> accuracyN = mean(cross_val_score(LRclassifier, Data, Labels,
> scoring = 'topNscorer'))
>
> Here is the code for the custom scoring function :
> def topNscoring(y, yp):
> ## Inputs y = true label per sample, yp = predict_proba
> probabilities of all labels per sample
> N = 5
> foundN = []
> for ii in xrange(0,shape(yp)[0]):
> indN = [ w[0] for w in
> sorted(enumerate(list(yp[ii,:])),key=lambda w:w[1],reverse=True)[0:N] ]
> if y[ii] in indN: foundN.append(1)
> else: foundN.append(0)
> return mean(foundN)
>
> Any help will be greatly appreciated.
>
> best regards,
> Sumeet
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From m.waseem.ahmad at gmail.com Tue Nov 1 12:50:36 2016
From: m.waseem.ahmad at gmail.com (muhammad waseem)
Date: Tue, 1 Nov 2016 16:50:36 +0000
Subject: [scikit-learn] SVM number of support vectors
Message-ID:
Hello All,
I am trying to replicate the below figure and wanted to confirm that number
of support vectors can be calculated by *support_vectors_* attribute in
scikitlearn?
[image: Inline image 1]
Regards
Waseem
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 24829 bytes
Desc: not available
URL:
From sumeet.k.sandhu at gmail.com Tue Nov 1 12:52:35 2016
From: sumeet.k.sandhu at gmail.com (Sumeet Sandhu)
Date: Tue, 1 Nov 2016 09:52:35 -0700
Subject: [scikit-learn] creating a custom scoring function for
cross-validation of classification
In-Reply-To:
References:
Message-ID:
ahha - thanks Andy !
that works...
On Tue, Nov 1, 2016 at 7:05 AM, Andy wrote:
> Hi.
> If you want to pass a custom scorer, you need to pass the scorer, not a
> string with the scorer name.
> Andy
>
>
> On 10/31/2016 04:28 PM, Sumeet Sandhu wrote:
>
> Hi,
>
> I've been staring at various doc pages for a while to create a custom
> scorer that uses predict_proba output of a multi-class SGDClassifier :
> http://scikit-learn.org/stable/modules/generated/
> sklearn.model_selection.cross_val_score.html#sklearn.model_
> selection.cross_val_score
> http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-
> parameter
> http://scikit-learn.org/stable/modules/generated/
> sklearn.metrics.make_scorer.html#sklearn.metrics.make_scorer
>
> I got the impression I could customize the "scoring'' parameter in
> cross_val_score directly, but that didn't work.
> Then I tried customizing the "score_func" parameter in make_scorer, but
> that didn't work either. Both errors are ValuErrors :
>
> Traceback (most recent call last):
> File "", line 3, in
> accuracy = mean(cross_val_score(LRclassifier, trainPatentVecs,
> trainLabelVecs, cv=10, scoring = 'topNscorer'))
> File "/Library/Frameworks/Python.framework/Versions/2.7/lib/
> python2.7/site-packages/sklearn/cross_validation.py", line 1425, in
> cross_val_score
> scorer = check_scoring(estimator, scoring=scoring)
> File "/Library/Frameworks/Python.framework/Versions/2.7/lib/
> python2.7/site-packages/sklearn/metrics/scorer.py", line 238, in
> check_scoring
> return get_scorer(scoring)
> File "/Library/Frameworks/Python.framework/Versions/2.7/lib/
> python2.7/site-packages/sklearn/metrics/scorer.py", line 197, in
> get_scorer
> % (scoring, sorted(SCORERS.keys())))
> ValueError: 'topNscorer' is not a valid scoring value. Valid options are
> ['accuracy', 'adjusted_rand_score', 'average_precision', 'f1', 'f1_macro',
> 'f1_micro', 'f1_samples', 'f1_weighted', 'log_loss', 'mean_absolute_error',
> 'mean_squared_error', 'median_absolute_error', 'precision',
> 'precision_macro', 'precision_micro', 'precision_samples',
> 'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro',
> 'recall_samples', 'recall_weighted', 'roc_auc']
>
> At a high level, I want to find out if the true label was found in the top
> N multi-class labels coming out of an SGD classifier. Built-in scores like
> "accuracy" only look at N=1.
>
> Here is the code using make_scorer :
> LRclassifier = SGDClassifier(loss='log')
> topNscorer = make_scorer(topNscoring, greater_is_better=True,
> needs_proba=True)
> accuracyN = mean(cross_val_score(LRclassifier, Data, Labels,
> scoring = 'topNscorer'))
>
> Here is the code for the custom scoring function :
> def topNscoring(y, yp):
> ## Inputs y = true label per sample, yp = predict_proba probabilities
> of all labels per sample
> N = 5
> foundN = []
> for ii in xrange(0,shape(yp)[0]):
> indN = [ w[0] for w in sorted(enumerate(list(yp[ii,:])),key=lambda
> w:w[1],reverse=True)[0:N] ]
> if y[ii] in indN: foundN.append(1)
> else: foundN.append(0)
> return mean(foundN)
>
> Any help will be greatly appreciated.
>
> best regards,
> Sumeet
>
>
>
>
> _______________________________________________
> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From t3kcit at gmail.com Wed Nov 2 12:10:40 2016
From: t3kcit at gmail.com (Andy)
Date: Wed, 2 Nov 2016 12:10:40 -0400
Subject: [scikit-learn] Fwd: libmf bindings
In-Reply-To:
References:
Message-ID: <6f7aa766-a6e3-ae85-94b7-5d113b56ae55@gmail.com>
-------- Forwarded Message --------
Subject: libmf bindings
Date: Wed, 2 Nov 2016 11:38:00 -0400
From: sam royston
To: scikit-learn-owner at python.org
Hi,
Thanks for all your hard work on this useful tool! I'm hoping to
contribute bindings to Chih-Jen Lin's libmf:
https://www.csie.ntu.edu.tw/~cjlin/libmf/
. It looks like you guys
have functionality for NMF, but used only in the decomposition/
dimensionality reduction setting (and obviously only with non-negative
values). Id like to add functionality in the form python wrappers for
libmf, much like you have for Chih-Jen Lin's other libraries libsvm and
liblinear.
Libmf is very efficient and offers great functionality for missing data
imputation, recommendation systems and more.
I have already written bindings using ctypes, but I see that you have
you Cython for libsvm and liblinear - is it necessary that I switch to
that interface?
Let me know what you think of a contribution like this.
Thanks,
Sam
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From drraph at gmail.com Wed Nov 2 12:25:46 2016
From: drraph at gmail.com (Raphael C)
Date: Wed, 2 Nov 2016 16:25:46 +0000
Subject: [scikit-learn] Fwd: libmf bindings
In-Reply-To: <6f7aa766-a6e3-ae85-94b7-5d113b56ae55@gmail.com>
References:
<6f7aa766-a6e3-ae85-94b7-5d113b56ae55@gmail.com>
Message-ID:
(I am not a scikit learn dev.)
This is a great idea and I for one look forward to using it.
My understanding is that libmf optimises only over the observed values
(that is the explicitly given values in a sparse matrix) as is typically
needed for recommender system whereas the scikit learn NMF code assumes
that any non-specified value in a sparse matrix is zero. It is worth
bearing that in mind in any comparison that is carried out.
Raphael
On 2 November 2016 at 16:10, Andy wrote:
>
>
>
> -------- Forwarded Message --------
> Subject: libmf bindings
> Date: Wed, 2 Nov 2016 11:38:00 -0400
> From: sam royston
> To: scikit-learn-owner at python.org
>
> Hi,
>
> Thanks for all your hard work on this useful tool! I'm hoping to
> contribute bindings to Chih-Jen Lin's libmf: https://www.csie.ntu.
> edu.tw/~cjlin/libmf/. It looks like you guys have functionality for NMF,
> but used only in the decomposition/ dimensionality reduction setting (and
> obviously only with non-negative values). Id like to add functionality in
> the form python wrappers for libmf, much like you have for Chih-Jen Lin's
> other libraries libsvm and liblinear.
>
> Libmf is very efficient and offers great functionality for missing data
> imputation, recommendation systems and more.
>
> I have already written bindings using ctypes, but I see that you have you
> Cython for libsvm and liblinear - is it necessary that I switch to that
> interface?
>
> Let me know what you think of a contribution like this.
>
> Thanks,
> Sam
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From gael.varoquaux at normalesup.org Wed Nov 2 12:32:15 2016
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Wed, 2 Nov 2016 17:32:15 +0100
Subject: [scikit-learn] Fwd: libmf bindings
In-Reply-To: <6f7aa766-a6e3-ae85-94b7-5d113b56ae55@gmail.com>
References:
<6f7aa766-a6e3-ae85-94b7-5d113b56ae55@gmail.com>
Message-ID: <20161102163215.GF3067723@phare.normalesup.org>
Given that we'd love to get rid of our libsvm/liblinear biddings, I would
be more in favor of improving our matrix factorization code rather than
including this code.
That said, +1 for missing data imputation with matrix factorization, once
we're done with the current PRs on missing data.
Ga?l
On Wed, Nov 02, 2016 at 12:10:40PM -0400, Andy wrote:
> -------- Forwarded Message --------
> Subject: libmf bindings
> Date: Wed, 2 Nov 2016 11:38:00 -0400
> From: sam royston
> To: scikit-learn-owner at python.org
> Hi,
> Thanks for all your hard work on this useful tool! I'm hoping to contribute
> bindings to Chih-Jen Lin's libmf:?https://www.csie.ntu.edu.tw/~cjlin/libmf/. It
> looks like you guys have functionality for NMF, but used only in the
> decomposition/ dimensionality reduction setting (and obviously only with
> non-negative values). Id like to add functionality in the form python wrappers
> for libmf, much like you have for Chih-Jen Lin's other libraries libsvm and
> liblinear.
> Libmf is very efficient and offers great functionality for missing data
> imputation, recommendation systems and more.
> I have already written bindings using ctypes, but I see that you have you
> Cython for libsvm and liblinear - is it necessary that I switch to that
> interface?
> Let me know what you think of a contribution like this.
> Thanks,
> Sam
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
--
Gael Varoquaux
Researcher, INRIA Parietal
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone: ++ 33-1-69-08-79-68
http://gael-varoquaux.info http://twitter.com/GaelVaroquaux
From jalopcar at gmail.com Thu Nov 3 11:16:39 2016
From: jalopcar at gmail.com (Jaime Lopez Carvajal)
Date: Thu, 3 Nov 2016 10:16:39 -0500
Subject: [scikit-learn] hierarchical clustering
Message-ID:
Hi there,
I am trying to do image classification using hierarchical clustering.
So, I have my data, and apply this steps:
from scipy.cluster.hierarchy import dendrogram, linkage
data1 = np.array(data)
Z = linkage(data, 'ward')
dendrogram(Z, truncate_mode='lastp', p=12, show_leaf_counts=False,
leaf_rotation=90., leaf_font_size=12.,show_contracted=True)
plt.show()
So, I can see the dendrogram with 12 clusters as I want, but I dont know
how to use this to classify the image.
Also, I understand that funtion cluster.hierarchy.cut_tree(Z, n_clusters),
that cut the tree at that number of clusters, but again I dont know how to
procedd from there. I would like to have something like: cluster =
predict(Z, instance)
Any advice or direction would be really appreciate,
Thanks in advance, Jaime
--
*Jaime Lopez Carvajal*
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From jni.soma at gmail.com Thu Nov 3 18:00:27 2016
From: jni.soma at gmail.com (Juan Nunez-Iglesias)
Date: Fri, 4 Nov 2016 09:00:27 +1100
Subject: [scikit-learn] hierarchical clustering
In-Reply-To:
References:
Message-ID:
Hi Jaime,
>From *Elegant SciPy*:
"""
The *fcluster* function takes a linkage matrix, as returned by linkage, and
a threshold, and returns cluster identities. It's difficult to know
a-priori what the threshold should be, but we can obtain the appropriate
threshold for a fixed number of clusters by checking the distances in the
linkage matrix.
from scipy.cluster.hierarchy import fcluster
n_clusters = 3
threshold_distance = (Z[-n_clusters, 2] + Z[-n_clusters+1, 2]) / 2
clusters = fcluster(Z, threshold_distance, 'distance')
"""
As an aside, I imagine this question is better placed in the SciPy mailing
list than scikit-learn (which has its own hierarchical clustering API).
Juan.
On Fri, Nov 4, 2016 at 2:16 AM, Jaime Lopez Carvajal
wrote:
> Hi there,
>
> I am trying to do image classification using hierarchical clustering.
> So, I have my data, and apply this steps:
>
> from scipy.cluster.hierarchy import dendrogram, linkage
>
> data1 = np.array(data)
> Z = linkage(data, 'ward')
> dendrogram(Z, truncate_mode='lastp', p=12, show_leaf_counts=False,
> leaf_rotation=90., leaf_font_size=12.,show_contracted=True)
> plt.show()
>
> So, I can see the dendrogram with 12 clusters as I want, but I dont know
> how to use this to classify the image.
> Also, I understand that funtion cluster.hierarchy.cut_tree(Z, n_clusters),
> that cut the tree at that number of clusters, but again I dont know how to
> procedd from there. I would like to have something like: cluster =
> predict(Z, instance)
>
> Any advice or direction would be really appreciate,
>
> Thanks in advance, Jaime
>
>
> --
>
> *Jaime Lopez Carvajal*
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From jalopcar at gmail.com Thu Nov 3 18:12:55 2016
From: jalopcar at gmail.com (Jaime Lopez Carvajal)
Date: Thu, 3 Nov 2016 17:12:55 -0500
Subject: [scikit-learn] hierarchical clustering
In-Reply-To:
References:
Message-ID:
Hi Juan,
The fcluster function was that I needed. I can now proceed from here to
classify images.
Thank you very much,
Jaime
On Thu, Nov 3, 2016 at 5:00 PM, Juan Nunez-Iglesias
wrote:
> Hi Jaime,
>
> From *Elegant SciPy*:
>
> """
> The *fcluster* function takes a linkage matrix, as returned by linkage,
> and a threshold, and returns cluster identities. It's difficult to know
> a-priori what the threshold should be, but we can obtain the appropriate
> threshold for a fixed number of clusters by checking the distances in the
> linkage matrix.
>
> from scipy.cluster.hierarchy import fcluster
> n_clusters = 3
> threshold_distance = (Z[-n_clusters, 2] + Z[-n_clusters+1, 2]) / 2
> clusters = fcluster(Z, threshold_distance, 'distance')
>
> """
>
> As an aside, I imagine this question is better placed in the SciPy mailing
> list than scikit-learn (which has its own hierarchical clustering API).
>
> Juan.
>
> On Fri, Nov 4, 2016 at 2:16 AM, Jaime Lopez Carvajal
> wrote:
>
>> Hi there,
>>
>> I am trying to do image classification using hierarchical clustering.
>> So, I have my data, and apply this steps:
>>
>> from scipy.cluster.hierarchy import dendrogram, linkage
>>
>> data1 = np.array(data)
>> Z = linkage(data, 'ward')
>> dendrogram(Z, truncate_mode='lastp', p=12, show_leaf_counts=False,
>> leaf_rotation=90., leaf_font_size=12.,show_contracted=True)
>> plt.show()
>>
>> So, I can see the dendrogram with 12 clusters as I want, but I dont know
>> how to use this to classify the image.
>> Also, I understand that funtion cluster.hierarchy.cut_tree(Z,
>> n_clusters), that cut the tree at that number of clusters, but again I dont
>> know how to procedd from there. I would like to have something like:
>> cluster = predict(Z, instance)
>>
>> Any advice or direction would be really appreciate,
>>
>> Thanks in advance, Jaime
>>
>>
>> --
>>
>> *Jaime Lopez Carvajal*
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
--
*Jaime Lopez Carvajal*
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From rth.yurchak at gmail.com Fri Nov 4 05:28:13 2016
From: rth.yurchak at gmail.com (Roman Yurchak)
Date: Fri, 4 Nov 2016 10:28:13 +0100
Subject: [scikit-learn] hierarchical clustering
In-Reply-To:
References:
Message-ID: <581C54AD.8040803@gmail.com>
Hi Jaime,
Alternatively, in scikit learn I think, you could use
hac = AgglomerativeClustering(n_clusters, linkage="ward")
hac.fit(data)
clusters = hac.labels_
there in an example on how to plot a dendrogram from this in
https://github.com/scikit-learn/scikit-learn/pull/3464
AgglomerativeClustering internally calls scikit learn's version of
cut_tree. I would be curious to know whether this is equivalent to
scipy's fcluster.
Roman
On 03/11/16 23:12, Jaime Lopez Carvajal wrote:
> Hi Juan,
>
> The fcluster function was that I needed. I can now proceed from here to
> classify images.
> Thank you very much,
>
> Jaime
>
> On Thu, Nov 3, 2016 at 5:00 PM, Juan Nunez-Iglesias > wrote:
>
> Hi Jaime,
>
> From /Elegant SciPy/:
>
> """
> The *fcluster* function takes a linkage matrix, as returned by
> linkage, and a threshold, and returns cluster identities. It's
> difficult to know a-priori what the threshold should be, but we can
> obtain the appropriate threshold for a fixed number of clusters by
> checking the distances in the linkage matrix.
>
> from scipy.cluster.hierarchy import fcluster
> n_clusters = 3
> threshold_distance = (Z[-n_clusters, 2] + Z[-n_clusters+1, 2]) / 2
> clusters = fcluster(Z, threshold_distance, 'distance')
>
> """
>
> As an aside, I imagine this question is better placed in the SciPy
> mailing list than scikit-learn (which has its own hierarchical
> clustering API).
>
> Juan.
>
> On Fri, Nov 4, 2016 at 2:16 AM, Jaime Lopez Carvajal
> > wrote:
>
> Hi there,
>
> I am trying to do image classification using hierarchical
> clustering.
> So, I have my data, and apply this steps:
>
> from scipy.cluster.hierarchy import dendrogram, linkage
>
> data1 = np.array(data)
> Z = linkage(data, 'ward')
> dendrogram(Z, truncate_mode='lastp', p=12,
> show_leaf_counts=False, leaf_rotation=90.,
> leaf_font_size=12.,show_contracted=True)
> plt.show()
>
> So, I can see the dendrogram with 12 clusters as I want, but I
> dont know how to use this to classify the image.
> Also, I understand that funtion cluster.hierarchy.cut_tree(Z,
> n_clusters), that cut the tree at that number of clusters, but
> again I dont know how to procedd from there. I would like to
> have something like: cluster = predict(Z, instance)
>
> Any advice or direction would be really appreciate,
>
> Thanks in advance, Jaime
>
>
> --
> /*Jaime Lopez Carvajal
> */
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
>
>
> --
> /*Jaime Lopez Carvajal
> */
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
From gael.varoquaux at normalesup.org Fri Nov 4 05:36:49 2016
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Fri, 4 Nov 2016 10:36:49 +0100
Subject: [scikit-learn] hierarchical clustering
In-Reply-To: <581C54AD.8040803@gmail.com>
References:
<581C54AD.8040803@gmail.com>
Message-ID: <20161104093649.GA137008@phare.normalesup.org>
> AgglomerativeClustering internally calls scikit learn's version of
> cut_tree. I would be curious to know whether this is equivalent to
> scipy's fcluster.
It differs in that it enable to add connectivity contraints.
From m.marcinmichal at gmail.com Fri Nov 4 06:45:39 2016
From: m.marcinmichal at gmail.com (=?UTF-8?Q?Marcin_Miro=C5=84czuk?=)
Date: Fri, 4 Nov 2016 11:45:39 +0100
Subject: [scikit-learn] Naive Bayes - Multinomial Naive Bayes tf-idf
Message-ID:
Hi,
In our experiments, we use a Multinomial Naive Bayes (MNB). The traditional
MNB implies the TF weight of the words. We read in documentation
http://scikit-learn.org/stable/modules/naive_bayes.html which describes
Multinomial Naive Bayes that "... where the data are typically represented
as word vector counts, although tf-idf vectors are also known to work well
in practice". The "word vector counts" is a TF and it is well known. We
have a problem which the "tf-idf vectors". In this case, i.e. tf-idf it
was implemented the approach of the D. M. Rennie et all Tackling the Poor
Assumptions of Naive Bayes Text Classification? In the documentation, there
are not any citation of this solution.
Best,
--
Marcin M.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From jalopcar at gmail.com Fri Nov 4 09:15:37 2016
From: jalopcar at gmail.com (Jaime Lopez Carvajal)
Date: Fri, 4 Nov 2016 08:15:37 -0500
Subject: [scikit-learn] hierarchical clustering
In-Reply-To: <581C54AD.8040803@gmail.com>
References:
<581C54AD.8040803@gmail.com>
Message-ID:
Hi Roman,
I will check that function too.
Thanks for help.
Have a good day, Jaime
On Fri, Nov 4, 2016 at 4:28 AM, Roman Yurchak wrote:
> Hi Jaime,
>
> Alternatively, in scikit learn I think, you could use
> hac = AgglomerativeClustering(n_clusters, linkage="ward")
> hac.fit(data)
> clusters = hac.labels_
> there in an example on how to plot a dendrogram from this in
> https://github.com/scikit-learn/scikit-learn/pull/3464
>
> AgglomerativeClustering internally calls scikit learn's version of
> cut_tree. I would be curious to know whether this is equivalent to
> scipy's fcluster.
>
> Roman
>
> On 03/11/16 23:12, Jaime Lopez Carvajal wrote:
> > Hi Juan,
> >
> > The fcluster function was that I needed. I can now proceed from here to
> > classify images.
> > Thank you very much,
> >
> > Jaime
> >
> > On Thu, Nov 3, 2016 at 5:00 PM, Juan Nunez-Iglesias > > wrote:
> >
> > Hi Jaime,
> >
> > From /Elegant SciPy/:
> >
> > """
> > The *fcluster* function takes a linkage matrix, as returned by
> > linkage, and a threshold, and returns cluster identities. It's
> > difficult to know a-priori what the threshold should be, but we can
> > obtain the appropriate threshold for a fixed number of clusters by
> > checking the distances in the linkage matrix.
> >
> > from scipy.cluster.hierarchy import fcluster
> > n_clusters = 3
> > threshold_distance = (Z[-n_clusters, 2] + Z[-n_clusters+1, 2]) / 2
> > clusters = fcluster(Z, threshold_distance, 'distance')
> >
> > """
> >
> > As an aside, I imagine this question is better placed in the SciPy
> > mailing list than scikit-learn (which has its own hierarchical
> > clustering API).
> >
> > Juan.
> >
> > On Fri, Nov 4, 2016 at 2:16 AM, Jaime Lopez Carvajal
> > > wrote:
> >
> > Hi there,
> >
> > I am trying to do image classification using hierarchical
> > clustering.
> > So, I have my data, and apply this steps:
> >
> > from scipy.cluster.hierarchy import dendrogram, linkage
> >
> > data1 = np.array(data)
> > Z = linkage(data, 'ward')
> > dendrogram(Z, truncate_mode='lastp', p=12,
> > show_leaf_counts=False, leaf_rotation=90.,
> > leaf_font_size=12.,show_contracted=True)
> > plt.show()
> >
> > So, I can see the dendrogram with 12 clusters as I want, but I
> > dont know how to use this to classify the image.
> > Also, I understand that funtion cluster.hierarchy.cut_tree(Z,
> > n_clusters), that cut the tree at that number of clusters, but
> > again I dont know how to procedd from there. I would like to
> > have something like: cluster = predict(Z, instance)
> >
> > Any advice or direction would be really appreciate,
> >
> > Thanks in advance, Jaime
> >
> >
> > --
> > /*Jaime Lopez Carvajal
> > */
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> >
> >
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> >
> >
> >
> >
> > --
> > /*Jaime Lopez Carvajal
> > */
> >
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
--
*Jaime Lopez Carvajal*
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From t3kcit at gmail.com Fri Nov 4 10:43:36 2016
From: t3kcit at gmail.com (Andy)
Date: Fri, 4 Nov 2016 09:43:36 -0500
Subject: [scikit-learn] Naive Bayes - Multinomial Naive Bayes tf-idf
In-Reply-To:
References:
Message-ID: <68852b61-b1be-7e76-31e9-b5d8caac9b9f@gmail.com>
On 11/04/2016 05:45 AM, Marcin Miro?czuk wrote:
> Hi,
> In our experiments, we use a Multinomial Naive Bayes (MNB). The
> traditional MNB implies the TF weight of the words. We read in
> documentation http://scikit-learn.org/stable/modules/naive_bayes.html
> which describes Multinomial Naive Bayes that "... where the data are
> typically represented as word vector counts, although tf-idf vectors
> are also known to work well in practice". The "word vector counts" is
> a TF and it is well known. We have a problem which the "tf-idf
> vectors". In this case, i.e. tf-idf it was implemented the approach
> of the D. M. Rennie et all Tackling the Poor Assumptions of Naive
> Bayes Text Classification? In the documentation, there are not any
> citation of this solution.
No, I think that paper implements something slightly different. The
documentation says that you can apply the TfidfVectorizer instead of
CountVectorizer and it can still work.
From brookm291 at gmail.com Fri Nov 4 16:43:59 2016
From: brookm291 at gmail.com (KevNo)
Date: Sat, 05 Nov 2016 05:43:59 +0900
Subject: [scikit-learn] Recurrent Decision Tree
Message-ID: <581CF30F.9040802@gmail.com>
Just wondering if Recurrent Decision Tree has been investigated
by Scikit previously.
Main interest is in path dependant (time series data) problems,
the recurrence is often necessary to model the path dependent state.
In other words, wrong prediction will affect the subsequent predictions.
Here, a research paper on Recurrent Decision Tree,
from Walt Disney Research (!)
https://goo.gl/APGpvM
Any thought is welcome.
Thanks
Brookm
scikit-learn-request at python.org wrote:
> Send scikit-learn mailing list submissions to
> scikit-learn at python.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://mail.python.org/mailman/listinfo/scikit-learn
> or, via email, send a message with subject or body 'help' to
> scikit-learn-request at python.org
>
> You can reach the person managing the list at
> scikit-learn-owner at python.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of scikit-learn digest..."
>
>
> Today's Topics:
>
> 1. Re: hierarchical clustering (Gael Varoquaux)
> 2. Naive Bayes - Multinomial Naive Bayes tf-idf (Marcin Miro?czuk)
> 3. Re: hierarchical clustering (Jaime Lopez Carvajal)
> 4. Re: Naive Bayes - Multinomial Naive Bayes tf-idf (Andy)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Fri, 4 Nov 2016 10:36:49 +0100
> From: Gael Varoquaux
> To: Scikit-learn user and developer mailing list
>
> Subject: Re: [scikit-learn] hierarchical clustering
> Message-ID:<20161104093649.GA137008 at phare.normalesup.org>
> Content-Type: text/plain; charset=us-ascii
>
>> AgglomerativeClustering internally calls scikit learn's version of
>> cut_tree. I would be curious to know whether this is equivalent to
>> scipy's fcluster.
>
> It differs in that it enable to add connectivity contraints.
>
>
> ------------------------------
>
> Message: 2
> Date: Fri, 4 Nov 2016 11:45:39 +0100
> From: Marcin Miro?czuk
> To: scikit-learn at python.org
> Subject: [scikit-learn] Naive Bayes - Multinomial Naive Bayes tf-idf
> Message-ID:
>
> Content-Type: text/plain; charset="utf-8"
>
> Hi,
> In our experiments, we use a Multinomial Naive Bayes (MNB). The traditional
> MNB implies the TF weight of the words. We read in documentation
> http://scikit-learn.org/stable/modules/naive_bayes.html which describes
> Multinomial Naive Bayes that "... where the data are typically represented
> as word vector counts, although tf-idf vectors are also known to work well
> in practice". The "word vector counts" is a TF and it is well known. We
> have a problem which the "tf-idf vectors". In this case, i.e. tf-idf it
> was implemented the approach of the D. M. Rennie et all Tackling the Poor
> Assumptions of Naive Bayes Text Classification? In the documentation, there
> are not any citation of this solution.
>
> Best,
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From Dale.T.Smith at macys.com Mon Nov 7 08:10:03 2016
From: Dale.T.Smith at macys.com (Dale T Smith)
Date: Mon, 7 Nov 2016 13:10:03 +0000
Subject: [scikit-learn] Recurrent Decision Tree
In-Reply-To: <581CF30F.9040802@gmail.com>
References: <581CF30F.9040802@gmail.com>
Message-ID:
Searching the mailing list would be the best way to find out this information.
It may be in the contrib packages on github ? have you checked?
__________________________________________________________________________________________________________________________________________
Dale T. Smith | Macy's Systems and Technology | IFS eCom CSE Data Science
5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com
From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of KevNo
Sent: Friday, November 4, 2016 4:44 PM
To: scikit-learn at python.org
Subject: [scikit-learn] Recurrent Decision Tree
? EXT MSG:
Just wondering if Recurrent Decision Tree has been investigated
by Scikit previously.
Main interest is in path dependant (time series data) problems,
the recurrence is often necessary to model the path dependent state.
In other words, wrong prediction will affect the subsequent predictions.
Here, a research paper on Recurrent Decision Tree,
from Walt Disney Research (!)
https://goo.gl/APGpvM
Any thought is welcome.
Thanks
Brookm
scikit-learn-request at python.org wrote:
Send scikit-learn mailing list submissions to
scikit-learn at python.org
To subscribe or unsubscribe via the World Wide Web, visit
https://mail.python.org/mailman/listinfo/scikit-learn
or, via email, send a message with subject or body 'help' to
scikit-learn-request at python.org
You can reach the person managing the list at
scikit-learn-owner at python.org
When replying, please edit your Subject line so it is more specific
than "Re: Contents of scikit-learn digest..."
Today's Topics:
1. Re: hierarchical clustering (Gael Varoquaux)
2. Naive Bayes - Multinomial Naive Bayes tf-idf (Marcin Miro?czuk)
3. Re: hierarchical clustering (Jaime Lopez Carvajal)
4. Re: Naive Bayes - Multinomial Naive Bayes tf-idf (Andy)
----------------------------------------------------------------------
Message: 1
Date: Fri, 4 Nov 2016 10:36:49 +0100
From: Gael Varoquaux
To: Scikit-learn user and developer mailing list
Subject: Re: [scikit-learn] hierarchical clustering
Message-ID: <20161104093649.GA137008 at phare.normalesup.org>
Content-Type: text/plain; charset=us-ascii
AgglomerativeClustering internally calls scikit learn's version of
cut_tree. I would be curious to know whether this is equivalent to
scipy's fcluster.
It differs in that it enable to add connectivity contraints.
------------------------------
Message: 2
Date: Fri, 4 Nov 2016 11:45:39 +0100
From: Marcin Miro?czuk
To: scikit-learn at python.org
Subject: [scikit-learn] Naive Bayes - Multinomial Naive Bayes tf-idf
Message-ID:
Content-Type: text/plain; charset="utf-8"
Hi,
In our experiments, we use a Multinomial Naive Bayes (MNB). The traditional
MNB implies the TF weight of the words. We read in documentation
http://scikit-learn.org/stable/modules/naive_bayes.html which describes
Multinomial Naive Bayes that "... where the data are typically represented
as word vector counts, although tf-idf vectors are also known to work well
in practice". The "word vector counts" is a TF and it is well known. We
have a problem which the "tf-idf vectors". In this case, i.e. tf-idf it
was implemented the approach of the D. M. Rennie et all Tackling the Poor
Assumptions of Naive Bayes Text Classification? In the documentation, there
are not any citation of this solution.
Best,
* This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From ragvrv at gmail.com Mon Nov 7 09:51:11 2016
From: ragvrv at gmail.com (Raghav R V)
Date: Mon, 7 Nov 2016 15:51:11 +0100
Subject: [scikit-learn] Recurrent Decision Tree
In-Reply-To:
References: <581CF30F.9040802@gmail.com>
Message-ID:
Hi,
The reference paper seems pretty new with very few citations. Check our FAQ
on inclusion criterion -
http://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms
On Mon, Nov 7, 2016 at 2:10 PM, Dale T Smith wrote:
> Searching the mailing list would be the best way to find out this
> information.
>
>
>
> It may be in the contrib packages on github ? have you checked?
>
>
>
>
>
> ____________________________________________________________
> ____________________________________________________________
> __________________
> *Dale T. Smith* *|* Macy's Systems and Technology *|* IFS eCom CSE Data
> Science
> 5985 State Bridge Road, Johns Creek, GA 30097 *|* dale.t.smith at macys.com
>
>
>
> *From:* scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=
> macys.com at python.org] *On Behalf Of *KevNo
> *Sent:* Friday, November 4, 2016 4:44 PM
> *To:* scikit-learn at python.org
> *Subject:* [scikit-learn] Recurrent Decision Tree
>
>
>
> ? EXT MSG:
>
> Just wondering if Recurrent Decision Tree has been investigated
> by Scikit previously.
>
> Main interest is in path dependant (time series data) problems,
> the recurrence is often necessary to model the path dependent state.
> In other words, wrong prediction will affect the subsequent predictions.
>
> Here, a research paper on Recurrent Decision Tree,
> from Walt Disney Research (!)
>
> https://goo.gl/APGpvM
>
>
> Any thought is welcome.
> Thanks
> Brookm
>
>
>
>
>
> scikit-learn-request at python.org wrote:
>
> Send scikit-learn mailing list submissions to
>
> scikit-learn at python.org
>
>
>
> To subscribe or unsubscribe via the World Wide Web, visit
>
> https://mail.python.org/mailman/listinfo/scikit-learn
>
> or, via email, send a message with subject or body 'help' to
>
> scikit-learn-request at python.org
>
>
>
> You can reach the person managing the list at
>
> scikit-learn-owner at python.org
>
>
>
> When replying, please edit your Subject line so it is more specific
>
> than "Re: Contents of scikit-learn digest..."
>
>
>
>
>
> Today's Topics:
>
>
>
> 1. Re: hierarchical clustering (Gael Varoquaux)
>
> 2. Naive Bayes - Multinomial Naive Bayes tf-idf (Marcin Miro?czuk)
>
> 3. Re: hierarchical clustering (Jaime Lopez Carvajal)
>
> 4. Re: Naive Bayes - Multinomial Naive Bayes tf-idf (Andy)
>
>
>
>
>
> ----------------------------------------------------------------------
>
>
>
> Message: 1
>
> Date: Fri, 4 Nov 2016 10:36:49 +0100
>
> From: Gael Varoquaux
>
> To: Scikit-learn user and developer mailing list
>
>
>
> Subject: Re: [scikit-learn] hierarchical clustering
>
> Message-ID: <20161104093649.GA137008 at phare.normalesup.org> <20161104093649.GA137008 at phare.normalesup.org>
>
> Content-Type: text/plain; charset=us-ascii
>
>
>
> AgglomerativeClustering internally calls scikit learn's version of
>
> cut_tree. I would be curious to know whether this is equivalent to
>
> scipy's fcluster.
>
>
>
> It differs in that it enable to add connectivity contraints.
>
>
>
>
>
> ------------------------------
>
>
>
> Message: 2
>
> Date: Fri, 4 Nov 2016 11:45:39 +0100
>
> From: Marcin Miro?czuk
>
> To: scikit-learn at python.org
>
> Subject: [scikit-learn] Naive Bayes - Multinomial Naive Bayes tf-idf
>
> Message-ID:
>
>
>
> Content-Type: text/plain; charset="utf-8"
>
>
>
> Hi,
>
> In our experiments, we use a Multinomial Naive Bayes (MNB). The traditional
>
> MNB implies the TF weight of the words. We read in documentation
>
> http://scikit-learn.org/stable/modules/naive_bayes.html which describes
>
> Multinomial Naive Bayes that "... where the data are typically represented
>
> as word vector counts, although tf-idf vectors are also known to work well
>
> in practice". The "word vector counts" is a TF and it is well known. We
>
> have a problem which the "tf-idf vectors". In this case, i.e. tf-idf it
>
> was implemented the approach of the D. M. Rennie et all Tackling the Poor
>
> Assumptions of Naive Bayes Text Classification? In the documentation, there
>
> are not any citation of this solution.
>
>
>
> Best,
>
>
>
> * This is an EXTERNAL EMAIL. Stop and think before clicking a link or
> opening attachments.
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
--
Raghav RV
https://github.com/raghavrv
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From brookm291 at gmail.com Mon Nov 7 12:17:56 2016
From: brookm291 at gmail.com (KevNo)
Date: Tue, 08 Nov 2016 02:17:56 +0900
Subject: [scikit-learn] Recurrent Decision Tree
In-Reply-To:
References:
Message-ID: <5820B744.9080800@gmail.com>
This is nothing to do with Scikit guidelines criteria....
This is about scientific/mathematic view Recurrent Decision Tree which
is a specific tree by nature
(you cannot apply standard algos on this).
Suppose very little number of people has experience with recurrence in
Decision Tree...
scikit-learn-request at python.org wrote:
> Send scikit-learn mailing list submissions to
> scikit-learn at python.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://mail.python.org/mailman/listinfo/scikit-learn
> or, via email, send a message with subject or body 'help' to
> scikit-learn-request at python.org
>
> You can reach the person managing the list at
> scikit-learn-owner at python.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of scikit-learn digest..."
>
>
> Today's Topics:
>
> 1. Re: Recurrent Decision Tree (Raghav R V)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Mon, 7 Nov 2016 15:51:11 +0100
> From: Raghav R V
> To: Scikit-learn user and developer mailing list
>
> Subject: Re: [scikit-learn] Recurrent Decision Tree
> Message-ID:
>
> Content-Type: text/plain; charset="utf-8"
>
> Hi,
>
> The reference paper seems pretty new with very few citations. Check our FAQ
> on inclusion criterion -
> http://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms
>
>
> On Mon, Nov 7, 2016 at 2:10 PM, Dale T Smith wrote:
>
>> Searching the mailing list would be the best way to find out this
>> information.
>>
>>
>>
>> It may be in the contrib packages on github ? have you checked?
>>
>>
>>
>>
>>
>> ____________________________________________________________
>> ____________________________________________________________
>> __________________
>> *Dale T. Smith* *|* Macy's Systems and Technology *|* IFS eCom CSE Data
>> Science
>> 5985 State Bridge Road, Johns Creek, GA 30097 *|* dale.t.smith at macys.com
>>
>>
>>
>> *From:* scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=
>> macys.com at python.org] *On Behalf Of *KevNo
>> *Sent:* Friday, November 4, 2016 4:44 PM
>> *To:* scikit-learn at python.org
>> *Subject:* [scikit-learn] Recurrent Decision Tree
>>
>>
>>
>> ? EXT MSG:
>>
>> Just wondering if Recurrent Decision Tree has been investigated
>> by Scikit previously.
>>
>> Main interest is in path dependant (time series data) problems,
>> the recurrence is often necessary to model the path dependent state.
>> In other words, wrong prediction will affect the subsequent predictions.
>>
>> Here, a research paper on Recurrent Decision Tree,
>> from Walt Disney Research (!)
>>
>> https://goo.gl/APGpvM
>>
>>
>> Any thought is welcome.
>> Thanks
>> Brookm
>>
>>
>>
>>
>>
>> scikit-learn-request at python.org wrote:
>>
>> Send scikit-learn mailing list submissions to
>>
>> scikit-learn at python.org
>>
>>
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>>
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>> or, via email, send a message with subject or body 'help' to
>>
>> scikit-learn-request at python.org
>>
>>
>>
>> You can reach the person managing the list at
>>
>> scikit-learn-owner at python.org
>>
>>
>>
>> When replying, please edit your Subject line so it is more specific
>>
>> than "Re: Contents of scikit-learn digest..."
>>
>>
>>
>>
>>
>> Today's Topics:
>>
>>
>>
>> 1. Re: hierarchical clustering (Gael Varoquaux)
>>
>> 2. Naive Bayes - Multinomial Naive Bayes tf-idf (Marcin Miro?czuk)
>>
>> 3. Re: hierarchical clustering (Jaime Lopez Carvajal)
>>
>> 4. Re: Naive Bayes - Multinomial Naive Bayes tf-idf (Andy)
>>
>>
>>
>>
>>
>> ----------------------------------------------------------------------
>>
>>
>>
>> Message: 1
>>
>> Date: Fri, 4 Nov 2016 10:36:49 +0100
>>
>> From: Gael Varoquaux
>>
>> To: Scikit-learn user and developer mailing list
>>
>>
>>
>> Subject: Re: [scikit-learn] hierarchical clustering
>>
>> Message-ID:<20161104093649.GA137008 at phare.normalesup.org> <20161104093649.GA137008 at phare.normalesup.org>
>>
>> Content-Type: text/plain; charset=us-ascii
>>
>>
>>
>> AgglomerativeClustering internally calls scikit learn's version of
>>
>> cut_tree. I would be curious to know whether this is equivalent to
>>
>> scipy's fcluster.
>>
>>
>>
>> It differs in that it enable to add connectivity contraints.
>>
>>
>>
>>
>>
>> ------------------------------
>>
>>
>>
>> Message: 2
>>
>> Date: Fri, 4 Nov 2016 11:45:39 +0100
>>
>> From: Marcin Miro?czuk
>>
>> To: scikit-learn at python.org
>>
>> Subject: [scikit-learn] Naive Bayes - Multinomial Naive Bayes tf-idf
>>
>> Message-ID:
>>
>>
>>
>> Content-Type: text/plain; charset="utf-8"
>>
>>
>>
>> Hi,
>>
>> In our experiments, we use a Multinomial Naive Bayes (MNB). The traditional
>>
>> MNB implies the TF weight of the words. We read in documentation
>>
>> http://scikit-learn.org/stable/modules/naive_bayes.html which describes
>>
>> Multinomial Naive Bayes that "... where the data are typically represented
>>
>> as word vector counts, although tf-idf vectors are also known to work well
>>
>> in practice". The "word vector counts" is a TF and it is well known. We
>>
>> have a problem which the "tf-idf vectors". In this case, i.e. tf-idf it
>>
>> was implemented the approach of the D. M. Rennie et all Tackling the Poor
>>
>> Assumptions of Naive Bayes Text Classification? In the documentation, there
>>
>> are not any citation of this solution.
>>
>>
>>
>> Best,
>>
>>
>>
>> * This is an EXTERNAL EMAIL. Stop and think before clicking a link or
>> opening attachments.
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From jmschreiber91 at gmail.com Mon Nov 7 13:08:51 2016
From: jmschreiber91 at gmail.com (Jacob Schreiber)
Date: Mon, 7 Nov 2016 10:08:51 -0800
Subject: [scikit-learn] Recurrent Decision Tree
In-Reply-To: <5820B744.9080800@gmail.com>
References:
<5820B744.9080800@gmail.com>
Message-ID:
It hasn't been investigated by the sklearn team to my knowledge. As Dale
said, there may be an independent implementation out there but not
officially related to sklearn.
On Mon, Nov 7, 2016 at 9:17 AM, KevNo wrote:
> This is nothing to do with Scikit guidelines criteria....
>
> This is about scientific/mathematic view Recurrent Decision Tree which is
> a specific tree by nature
> (you cannot apply standard algos on this).
>
> Suppose very little number of people has experience with recurrence in
> Decision Tree...
>
>
>
>
>
>
>
>
> scikit-learn-request at python.org wrote:
>
> Send scikit-learn mailing list submissions to
> scikit-learn at python.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://mail.python.org/mailman/listinfo/scikit-learn
> or, via email, send a message with subject or body 'help' to
> scikit-learn-request at python.org
>
> You can reach the person managing the list at
> scikit-learn-owner at python.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of scikit-learn digest..."
>
>
> Today's Topics:
>
> 1. Re: Recurrent Decision Tree (Raghav R V)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Mon, 7 Nov 2016 15:51:11 +0100
> From: Raghav R V
> To: Scikit-learn user and developer mailing list
>
> Subject: Re: [scikit-learn] Recurrent Decision Tree
> Message-ID:
>
> Content-Type: text/plain; charset="utf-8"
>
> Hi,
>
> The reference paper seems pretty new with very few citations. Check our FAQ
> on inclusion criterion -http://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms
>
>
> On Mon, Nov 7, 2016 at 2:10 PM, Dale T Smith wrote:
>
>
> Searching the mailing list would be the best way to find out this
> information.
>
>
>
> It may be in the contrib packages on github ? have you checked?
>
>
>
>
>
> ____________________________________________________________
> ____________________________________________________________
> __________________
> *Dale T. Smith* *|* Macy's Systems and Technology *|* IFS eCom CSE Data
> Science
> 5985 State Bridge Road, Johns Creek, GA 30097 *|* dale.t.smith at macys.com
>
>
>
> *From:* scikit-learn [mailto:scikit-learn-bounces+dale.t.smith= macys.com at python.org] *On Behalf Of *KevNo
> *Sent:* Friday, November 4, 2016 4:44 PM
> *To:* scikit-learn at python.org
> *Subject:* [scikit-learn] Recurrent Decision Tree
>
>
>
> ? EXT MSG:
>
> Just wondering if Recurrent Decision Tree has been investigated
> by Scikit previously.
>
> Main interest is in path dependant (time series data) problems,
> the recurrence is often necessary to model the path dependent state.
> In other words, wrong prediction will affect the subsequent predictions.
>
> Here, a research paper on Recurrent Decision Tree,
> from Walt Disney Research (!)
>
> https://goo.gl/APGpvM
>
>
> Any thought is welcome.
> Thanks
> Brookm
>
>
>
>
> scikit-learn-request at python.org wrote:
>
> Send scikit-learn mailing list submissions to
>
> scikit-learn at python.org
>
>
>
> To subscribe or unsubscribe via the World Wide Web, visit
>
> https://mail.python.org/mailman/listinfo/scikit-learn
>
> or, via email, send a message with subject or body 'help' to
>
> scikit-learn-request at python.org
>
>
>
> You can reach the person managing the list at
>
> scikit-learn-owner at python.org
>
>
>
> When replying, please edit your Subject line so it is more specific
>
> than "Re: Contents of scikit-learn digest..."
>
>
>
>
>
> Today's Topics:
>
>
>
> 1. Re: hierarchical clustering (Gael Varoquaux)
>
> 2. Naive Bayes - Multinomial Naive Bayes tf-idf (Marcin Miro?czuk)
>
> 3. Re: hierarchical clustering (Jaime Lopez Carvajal)
>
> 4. Re: Naive Bayes - Multinomial Naive Bayes tf-idf (Andy)
>
>
>
>
>
> ------------------------------------------------------------
> ----------
>
>
>
> Message: 1
>
> Date: Fri, 4 Nov 2016 10:36:49 +0100
>
> From: Gael Varoquaux
>
> To: Scikit-learn user and developer mailing list
>
>
>
> Subject: Re: [scikit-learn] hierarchical clustering
>
> Message-ID: <20161104093649.GA137008 at phare.normalesup.org> <20161104093649.GA137008 at phare.normalesup.org> <20161104093649.GA137008 at phare.normalesup.org> <20161104093649.GA137008 at phare.normalesup.org>
>
> Content-Type: text/plain; charset=us-ascii
>
>
>
> AgglomerativeClustering internally calls scikit learn's version of
>
> cut_tree. I would be curious to know whether this is equivalent to
>
> scipy's fcluster.
>
>
>
> It differs in that it enable to add connectivity contraints.
>
>
>
>
>
> ------------------------------
>
>
>
> Message: 2
>
> Date: Fri, 4 Nov 2016 11:45:39 +0100
>
> From: Marcin Miro?czuk
>
> To: scikit-learn at python.org
>
> Subject: [scikit-learn] Naive Bayes - Multinomial Naive Bayes tf-idf
>
> Message-ID:
>
>
>
> Content-Type: text/plain; charset="utf-8"
>
>
>
> Hi,
>
> In our experiments, we use a Multinomial Naive Bayes (MNB). The traditional
>
> MNB implies the TF weight of the words. We read in documentation
> http://scikit-learn.org/stable/modules/naive_bayes.html which describes
>
> Multinomial Naive Bayes that "... where the data are typically represented
>
> as word vector counts, although tf-idf vectors are also known to work well
>
> in practice". The "word vector counts" is a TF and it is well known. We
>
> have a problem which the "tf-idf vectors". In this case, i.e. tf-idf it
>
> was implemented the approach of the D. M. Rennie et all Tackling the Poor
>
> Assumptions of Naive Bayes Text Classification? In the documentation, there
>
> are not any citation of this solution.
>
>
>
> Best,
>
>
>
> * This is an EXTERNAL EMAIL. Stop and think before clicking a link or
> opening attachments.
>
> _______________________________________________
> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From alessio.quaglino at usi.ch Tue Nov 8 10:10:07 2016
From: alessio.quaglino at usi.ch (Quaglino Alessio)
Date: Tue, 8 Nov 2016 15:10:07 +0000
Subject: [scikit-learn] GPR intervals and MCMC
Message-ID: <03BFB7AA-8ED7-487E-A257-CB028F2BF99B@usi.ch>
Hello,
I am using scikit-learn 0.18 for doing GP regressions. I really like it and all works great, but I am having doubts concerning the confidence intervals computed by predict(X,return_std=True):
- Are they true confidence intervals (i.e. of the mean / latent function) or they are in fact prediction intervals? I tried computing the prediction intervals using sample_y(X) and I get the same answer as that returned by predict(X,return_std=True).
- My understanding is therefore that scikit-learn is not fully Bayesian, i.e. it does not compute probability distributions for the parameters, but rather the values that maximize the likelihood?
- If I want the confidence interval, is my best option to use an external MCMC optimizer such as PyMC?
Thank you in advance!
Regards,
-------------------------------------------------
Dr. Alessio Quaglino
Postdoctoral Researcher
Institute of Computational Science
Universit? della Svizzera Italiana
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From vaggi.federico at gmail.com Tue Nov 8 10:19:35 2016
From: vaggi.federico at gmail.com (federico vaggi)
Date: Tue, 08 Nov 2016 15:19:35 +0000
Subject: [scikit-learn] GPR intervals and MCMC
In-Reply-To: <03BFB7AA-8ED7-487E-A257-CB028F2BF99B@usi.ch>
References: <03BFB7AA-8ED7-487E-A257-CB028F2BF99B@usi.ch>
Message-ID:
Hi,
if you want to have the full posterior distribution over the values of the
hyper parameters, there is a good example on how to do that with George +
emcee, another GP package for Python.
http://dan.iel.fm/george/current/user/hyper/
On Tue, 8 Nov 2016 at 16:10 Quaglino Alessio
wrote:
> Hello,
>
> I am using scikit-learn 0.18 for doing GP regressions. I really like it
> and all works great, but I am having doubts concerning the confidence
> intervals computed by predict(X,return_std=True):
>
> - Are they true confidence intervals (i.e. of the mean / latent function)
> or they are in fact prediction intervals? I tried computing the prediction
> intervals using sample_y(X) and I get the same answer as that returned by
> predict(X,return_std=True).
>
> - My understanding is therefore that scikit-learn is not fully Bayesian,
> i.e. it does not compute probability distributions for the parameters, but
> rather the values that maximize the likelihood?
>
> - If I want the confidence interval, is my best option to use an external
> MCMC optimizer such as PyMC?
>
> Thank you in advance!
>
> Regards,
> -------------------------------------------------
> Dr. Alessio Quaglino
> Postdoctoral Researcher
> Institute of Computational Science
> Universit? della Svizzera Italiana
>
>
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From michael.eickenberg at gmail.com Tue Nov 8 10:24:01 2016
From: michael.eickenberg at gmail.com (Michael Eickenberg)
Date: Tue, 8 Nov 2016 16:24:01 +0100
Subject: [scikit-learn] GPR intervals and MCMC
In-Reply-To: <03BFB7AA-8ED7-487E-A257-CB028F2BF99B@usi.ch>
References: <03BFB7AA-8ED7-487E-A257-CB028F2BF99B@usi.ch>
Message-ID:
Dear Alessio,
if it helps, the implementation quite strictly follows what is described in
GPML: http://www.gaussianprocess.org/gpml/chapters/
https://github.com/scikit-learn/scikit-learn/blob/412996f09b6756752dfd3736c306d46fca8f1aa1/sklearn/gaussian_process/gpr.py#L23
Hyperparameter optimization is done by gradient descent.
Michael
On Tue, Nov 8, 2016 at 4:10 PM, Quaglino Alessio
wrote:
> Hello,
>
> I am using scikit-learn 0.18 for doing GP regressions. I really like it
> and all works great, but I am having doubts concerning the confidence
> intervals computed by predict(X,return_std=True):
>
> - Are they true confidence intervals (i.e. of the mean / latent function)
> or they are in fact prediction intervals? I tried computing the prediction
> intervals using sample_y(X) and I get the same answer as that returned by
> predict(X,return_std=True).
>
> - My understanding is therefore that scikit-learn is not fully Bayesian,
> i.e. it does not compute probability distributions for the parameters, but
> rather the values that maximize the likelihood?
>
> - If I want the confidence interval, is my best option to use an external
> MCMC optimizer such as PyMC?
>
> Thank you in advance!
>
> Regards,
> -------------------------------------------------
> Dr. Alessio Quaglino
> Postdoctoral Researcher
> Institute of Computational Science
> Universit? della Svizzera Italiana
>
>
>
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From a.suchaneck at gmail.com Fri Nov 11 05:23:12 2016
From: a.suchaneck at gmail.com (Anton Suchaneck)
Date: Fri, 11 Nov 2016 11:23:12 +0100
Subject: [scikit-learn] Automatic ThresholdClassifier based on cost-function
- Classifier Interface?
Message-ID:
Hi!
I tried writing a ThresholdClassifier, that wraps any classifier with
predict_proba() and based on a cost function adjusts the threshold for
predict(). This helps for imbalanced data.
My current cost function assigns cost +cost for a true positive and -1 for
a false positive.
It seems to run, but I'm not sure if I got the API for a classifier right.
Can you tell me whether this is how the functions should be implemented to
play together with other parts of sklearn?
Especially parameter settings for base.clone both in klass.__init__ and
.set_params() seemed weird.
Here is the code. The class ThresholdClassifier wraps a clf. RandomForest
in this case.
Anton
from sklearn.base import BaseEstimator, ClassifierMixin
from functools import partial
def find_threshold_cost_factor(clf, X, y, cost_factor):
y_pred = clf.predict_proba(X)
top_score = 0
top_threshold = None
cur_score=0
for y_pred_el, y_el in sorted(zip(y_pred[:, 1], y), reverse=True): #
FIXME: assumes 2 classes
if y_el == 0:
cur_score -= 1
if y_el == 1:
cur_score += cost_factor
if cur_score > top_score:
top_score = cur_score
top_threshold = y_pred_el
return top_threshold, top_score
class ThresholdClassifier(BaseEstimator, ClassifierMixin):
def __init__(self, clf, find_threshold, **params):
self.clf = clf
self.find_threshold = find_threshold
self.threshold = None
self.set_params(**params)
def score(self, X, y, sample_weight=None):
_threshold, score = self.find_threshold(self.clf, X, y)
return score
def fit(self, X, y):
self.clf.fit(X, y)
self.threshold, _score=self.find_threshold(self.clf, X, y)
self.classes_ = self.clf.classes_
def predict(self, X):
y_score=self.clf.predict_proba(X)
return np.array(y_score[:,1]>=self.threshold) # FIXME assumes 2
classes
def predict_proba(self, X):
return self.clf.predict_proba(X)
def set_params(self, **params):
for param_name in ["clf", "find_threshold", "threshold"]:
if param_name in params:
setattr(self, param_name, params[param_name])
del params[param_name]
self.clf.set_params(**params)
return self
def get_params(self, deep=True):
params={"clf":self.clf, "find_threshold": self.find_threshold,
"threshold":self.threshold}
params.update(self.clf.get_params(deep))
return params
if __name__ == '__main__':
import numpy as np
import random
from sklearn.grid_search import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.cross_validation import train_test_split
from sklearn.metrics import make_scorer, classification_report,
confusion_matrix
np.random.seed(111)
random.seed(111)
X, y = make_classification(1000,
n_features=20,
n_informative=4,
n_redundant=0,
n_repeated=0,
n_clusters_per_class=4,
# class_sep=0.5,
weights=[0.90]
)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3, stratify=y)
for cost in [10]:
find_threshold=partial(find_threshold_cost_factor, cost_factor=10)
def scorer(clf, X, y):
return find_threshold(clf, X, y)[1]
clfs = [RandomizedSearchCV(
ThresholdClassifier(RandomForestClassifier(),
find_threshold),
{"n_estimators": [100, 200],
"criterion": ["entropy"],
"min_samples_leaf": [1, 5],
"class_weight": ["balanced", None],
},
cv=3,
scoring=scorer, # Get rid of this, by letting
classifier tell it's cost-bsed score?
n_iter=8,
n_jobs=4),
]
for clf in clfs:
clf.fit(X_train, y_train)
clf_best = clf.best_estimator_
print(clf_best, cost, clf_best.score(X_test, y_test))
print(confusion_matrix(y_test, clf_best.predict(X_test)))
#print(find_threshold(clf_best, X_train, y_train))
#print(clf_best.threshold,
sorted(zip(clf_best.predict_proba(X_train)[:,1], y_train),
reverse=True)[:20])
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From t3kcit at gmail.com Fri Nov 11 13:09:32 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Fri, 11 Nov 2016 13:09:32 -0500
Subject: [scikit-learn] Automatic ThresholdClassifier based on
cost-function - Classifier Interface?
In-Reply-To:
References:
Message-ID: <89f50f08-5fa4-4ad6-8d15-3e5e70d15087@gmail.com>
Hi.
You don't have to implement set_params and get_params if you inherit
from BaseEstimator.
I find it weird that you pass find_threshold_cost_function as a
constructor parameter but otherwise the API looks ok.
You are not allowed to use **kwargs in __init___, though.
Andy
On 11/11/2016 05:23 AM, Anton Suchaneck wrote:
> Hi!
>
> I tried writing a ThresholdClassifier, that wraps any classifier with
> predict_proba() and based on a cost function adjusts the threshold for
> predict(). This helps for imbalanced data.
> My current cost function assigns cost +cost for a true positive and -1
> for a false positive.
> It seems to run, but I'm not sure if I got the API for a classifier right.
>
> Can you tell me whether this is how the functions should be
> implemented to play together with other parts of sklearn?
>
> Especially parameter settings for base.clone both in klass.__init__
> and .set_params() seemed weird.
>
> Here is the code. The class ThresholdClassifier wraps a clf.
> RandomForest in this case.
>
> Anton
>
> from sklearn.base import BaseEstimator, ClassifierMixin
> from functools import partial
>
> def find_threshold_cost_factor(clf, X, y, cost_factor):
> y_pred = clf.predict_proba(X)
>
> top_score = 0
> top_threshold = None
> cur_score=0
> for y_pred_el, y_el in sorted(zip(y_pred[:, 1], y), reverse=True):
> # FIXME: assumes 2 classes
> if y_el == 0:
> cur_score -= 1
> if y_el == 1:
> cur_score += cost_factor
> if cur_score > top_score:
> top_score = cur_score
> top_threshold = y_pred_el
> return top_threshold, top_score
>
>
> class ThresholdClassifier(BaseEstimator, ClassifierMixin):
> def __init__(self, clf, find_threshold, **params):
> self.clf = clf
> self.find_threshold = find_threshold
> self.threshold = None
> self.set_params(**params)
>
> def score(self, X, y, sample_weight=None):
> _threshold, score = self.find_threshold(self.clf, X, y)
> return score
>
> def fit(self, X, y):
> self.clf.fit(X, y)
> self.threshold, _score=self.find_threshold(self.clf, X, y)
> self.classes_ = self.clf.classes_
>
> def predict(self, X):
> y_score=self.clf.predict_proba(X)
> return np.array(y_score[:,1]>=self.threshold) # FIXME assumes
> 2 classes
>
> def predict_proba(self, X):
> return self.clf.predict_proba(X)
>
> def set_params(self, **params):
> for param_name in ["clf", "find_threshold", "threshold"]:
> if param_name in params:
> setattr(self, param_name, params[param_name])
> del params[param_name]
> self.clf.set_params(**params)
> return self
>
> def get_params(self, deep=True):
> params={"clf":self.clf, "find_threshold": self.find_threshold,
> "threshold":self.threshold}
> params.update(self.clf.get_params(deep))
> return params
>
>
> if __name__ == '__main__':
> import numpy as np
> import random
> from sklearn.grid_search import RandomizedSearchCV
> from sklearn.ensemble import RandomForestClassifier
> from sklearn.datasets import make_classification
> from sklearn.cross_validation import train_test_split
> from sklearn.metrics import make_scorer, classification_report,
> confusion_matrix
>
> np.random.seed(111)
> random.seed(111)
>
> X, y = make_classification(1000,
> n_features=20,
> n_informative=4,
> n_redundant=0,
> n_repeated=0,
> n_clusters_per_class=4,
> # class_sep=0.5,
> weights=[0.90]
> )
>
> X_train, X_test, y_train, y_test = train_test_split(X, y,
> test_size=0.3, stratify=y)
>
> for cost in [10]:
> find_threshold=partial(find_threshold_cost_factor, cost_factor=10)
>
> def scorer(clf, X, y):
> return find_threshold(clf, X, y)[1]
>
> clfs = [RandomizedSearchCV(
> ThresholdClassifier(RandomForestClassifier(), find_threshold),
> {"n_estimators": [100, 200],
> "criterion": ["entropy"],
> "min_samples_leaf": [1, 5],
> "class_weight": ["balanced", None],
> },
> cv=3,
> scoring=scorer, # Get rid of this, by letting
> classifier tell it's cost-bsed score?
> n_iter=8,
> n_jobs=4),
> ]
>
> for clf in clfs:
> clf.fit(X_train, y_train)
> clf_best = clf.best_estimator_
> print(clf_best, cost, clf_best.score(X_test, y_test))
> print(confusion_matrix(y_test, clf_best.predict(X_test)))
> #print(find_threshold(clf_best, X_train, y_train))
> #print(clf_best.threshold,
> sorted(zip(clf_best.predict_proba(X_train)[:,1], y_train),
> reverse=True)[:20])
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From a.suchaneck at gmail.com Sat Nov 12 04:17:29 2016
From: a.suchaneck at gmail.com (Anton)
Date: Sat, 12 Nov 2016 10:17:29 +0100
Subject: [scikit-learn] Automatic ThresholdClassifier based on
cost-function - Classifier Interface?
In-Reply-To: <89f50f08-5fa4-4ad6-8d15-3e5e70d15087@gmail.com>
References:
<89f50f08-5fa4-4ad6-8d15-3e5e70d15087@gmail.com>
Message-ID: <1478942249.8133.0@smtp.gmail.com>
Hi Andy!
Thank you for your feedback!
You say I shouldn't use __init__(**params) and it makes totally sense
and would make my code much simpler. However,
sklearn 0.18, base.clone, line 70: new_object =
klass(**new_object_params)
(called from RandomizedSearchCV)
screws you over since it passes the parameters to __init__(). I
expected the usage of set_params() here, but I'm getting my gridsearch
parameters passed to __init__().
Is this intended?
Note that I'm just wrapping a clf, so that I have to pass through the
parameters to self.clf, right? No-one can know that I'm storing it in
self.clf.
Therefore set_params needs to be implemented and cannot be inherited?!
My meta-classifier will find the optimal threshold upon .fit(). This
procedure depends on how to interpret what is optimal and this is what
find_threshold_cost_function is for.
One last question: Is self.classes_ a necessary part of the API (I
realize I forget the underscore) and am I missing any other API detail
I need to add for a binary classifier?
Regards,
Anton
Am Fr, 11. Nov, 2016 um 7:09 schrieb Andreas Mueller :
> Hi.
> You don't have to implement set_params and get_params if you inherit
> from BaseEstimator.
> I find it weird that you pass find_threshold_cost_function as a
> constructor parameter but otherwise the API looks ok.
> You are not allowed to use **kwargs in __init___, though.
>
> Andy
>
> On 11/11/2016 05:23 AM, Anton Suchaneck wrote:
>> Hi!
>>
>> I tried writing a ThresholdClassifier, that wraps any classifier
>> with predict_proba() and based on a cost function adjusts the
>> threshold for predict(). This helps for imbalanced data.
>> My current cost function assigns cost +cost for a true positive and
>> -1 for a false positive.
>> It seems to run, but I'm not sure if I got the API for a classifier
>> right.
>>
>> Can you tell me whether this is how the functions should be
>> implemented to play together with other parts of sklearn?
>>
>> Especially parameter settings for base.clone both in klass.__init__
>> and .set_params() seemed weird.
>>
>> Here is the code. The class ThresholdClassifier wraps a clf.
>> RandomForest in this case.
>>
>> Anton
>>
>> from sklearn.base import BaseEstimator, ClassifierMixin
>> from functools import partial
>>
>> def find_threshold_cost_factor(clf, X, y, cost_factor):
>> y_pred = clf.predict_proba(X)
>>
>> top_score = 0
>> top_threshold = None
>> cur_score=0
>> for y_pred_el, y_el in sorted(zip(y_pred[:, 1], y),
>> reverse=True): # FIXME: assumes 2 classes
>> if y_el == 0:
>> cur_score -= 1
>> if y_el == 1:
>> cur_score += cost_factor
>> if cur_score > top_score:
>> top_score = cur_score
>> top_threshold = y_pred_el
>> return top_threshold, top_score
>>
>>
>> class ThresholdClassifier(BaseEstimator, ClassifierMixin):
>> def __init__(self, clf, find_threshold, **params):
>> self.clf = clf
>> self.find_threshold = find_threshold
>> self.threshold = None
>> self.set_params(**params)
>>
>> def score(self, X, y, sample_weight=None):
>> _threshold, score = self.find_threshold(self.clf, X, y)
>> return score
>>
>> def fit(self, X, y):
>> self.clf.fit(X, y)
>> self.threshold, _score=self.find_threshold(self.clf, X, y)
>> self.classes_ = self.clf.classes_
>>
>> def predict(self, X):
>> y_score=self.clf.predict_proba(X)
>> return np.array(y_score[:,1]>=self.threshold) # FIXME
>> assumes 2 classes
>>
>> def predict_proba(self, X):
>> return self.clf.predict_proba(X)
>>
>> def set_params(self, **params):
>> for param_name in ["clf", "find_threshold", "threshold"]:
>> if param_name in params:
>> setattr(self, param_name, params[param_name])
>> del params[param_name]
>> self.clf.set_params(**params)
>> return self
>>
>> def get_params(self, deep=True):
>> params={"clf":self.clf, "find_threshold":
>> self.find_threshold, "threshold":self.threshold}
>> params.update(self.clf.get_params(deep))
>> return params
>>
>>
>> if __name__ == '__main__':
>> import numpy as np
>> import random
>> from sklearn.grid_search import RandomizedSearchCV
>> from sklearn.ensemble import RandomForestClassifier
>> from sklearn.datasets import make_classification
>> from sklearn.cross_validation import train_test_split
>> from sklearn.metrics import make_scorer, classification_report,
>> confusion_matrix
>>
>> np.random.seed(111)
>> random.seed(111)
>>
>> X, y = make_classification(1000,
>> n_features=20,
>> n_informative=4,
>> n_redundant=0,
>> n_repeated=0,
>> n_clusters_per_class=4,
>> # class_sep=0.5,
>> weights=[0.90]
>> )
>>
>> X_train, X_test, y_train, y_test = train_test_split(X, y,
>> test_size=0.3, stratify=y)
>>
>> for cost in [10]:
>> find_threshold=partial(find_threshold_cost_factor,
>> cost_factor=10)
>>
>> def scorer(clf, X, y):
>> return find_threshold(clf, X, y)[1]
>>
>> clfs = [RandomizedSearchCV(
>> ThresholdClassifier(RandomForestClassifier(),
>> find_threshold),
>> {"n_estimators": [100, 200],
>> "criterion": ["entropy"],
>> "min_samples_leaf": [1, 5],
>> "class_weight": ["balanced", None],
>> },
>> cv=3,
>> scoring=scorer, # Get rid of this, by letting
>> classifier tell it's cost-bsed score?
>> n_iter=8,
>> n_jobs=4),
>> ]
>>
>> for clf in clfs:
>> clf.fit(X_train, y_train)
>> clf_best = clf.best_estimator_
>> print(clf_best, cost, clf_best.score(X_test, y_test))
>> print(confusion_matrix(y_test, clf_best.predict(X_test)))
>> #print(find_threshold(clf_best, X_train, y_train))
>> #print(clf_best.threshold,
>> sorted(zip(clf_best.predict_proba(X_train)[:,1], y_train),
>> reverse=True)[:20])
>>
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From t3kcit at gmail.com Sun Nov 13 17:37:17 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Sun, 13 Nov 2016 17:37:17 -0500
Subject: [scikit-learn] Automatic ThresholdClassifier based on
cost-function - Classifier Interface?
In-Reply-To: <1478942249.8133.0@smtp.gmail.com>
References:
<89f50f08-5fa4-4ad6-8d15-3e5e70d15087@gmail.com>
<1478942249.8133.0@smtp.gmail.com>
Message-ID:
On 11/12/2016 04:17 AM, Anton wrote:
> screws you over since it passes the parameters to __init__(). I
> expected the usage of set_params() here, but I'm getting my gridsearch
> parameters passed to __init__().
> Is this intended?
>
I don't know what you mean by "screws you over". You just have to
explicitly list all parameters.
> Note that I'm just wrapping a clf, so that I have to pass through the
> parameters to self.clf, right? No-one can know that I'm storing it in
> self.clf.
> Therefore set_params needs to be implemented and cannot be inherited?!
right, if you don't want to use ``clf__params``
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From t3kcit at gmail.com Sun Nov 13 18:21:05 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Sun, 13 Nov 2016 18:21:05 -0500
Subject: [scikit-learn] Announcement: Scikit-learn 0.18.1 released!
Message-ID:
Hey all.
I just published the 0.18.1 wheels and source tarball to pypi.
The 0.18.1 release is a bugfix release, resolving some issues introduced
in 0.18 and also some earlier issues.
In particular there were some important relating to the new
model_selection module.
You can find the whole changelog (which I just realized does not contain
all the fixes) here:
http://scikit-learn.org/stable/whats_new.html#version-0-18-1
Best,
Andy
From joel.nothman at gmail.com Sun Nov 13 18:42:54 2016
From: joel.nothman at gmail.com (Joel Nothman)
Date: Mon, 14 Nov 2016 10:42:54 +1100
Subject: [scikit-learn] Announcement: Scikit-learn 0.18.1 released!
In-Reply-To:
References:
Message-ID:
Thanks, Andy.
As Andy said, this upgrade is strongly recommended. Due to a long-term bug
in Numpy (and insufficient testing on our part), the new
model_selection.GridSearchCV etc could not be pickled. There were also
issues with the use of iterators for cross-validation splitters. But there
are a lot of other valuable fixes in there too.
Please everyone, tell us there are no more bugs! :P
On 14 November 2016 at 10:21, Andreas Mueller wrote:
> Hey all.
> I just published the 0.18.1 wheels and source tarball to pypi.
> The 0.18.1 release is a bugfix release, resolving some issues introduced
> in 0.18 and also some earlier issues.
> In particular there were some important relating to the new
> model_selection module.
>
> You can find the whole changelog (which I just realized does not contain
> all the fixes) here:
> http://scikit-learn.org/stable/whats_new.html#version-0-18-1
>
> Best,
> Andy
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From t3kcit at gmail.com Sun Nov 13 20:51:01 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Sun, 13 Nov 2016 17:51:01 -0800
Subject: [scikit-learn] Announcement: Scikit-learn 0.18.1 released!
In-Reply-To:
References:
Message-ID:
Yeah it would be great if someone could update the whatsnew with the
complete list of fixed issues. Unfortunately I'm a bit overloaded right now.
Sent from phone. Please excuse spelling and brevity.
On Nov 13, 2016 18:44, "Joel Nothman" wrote:
> Thanks, Andy.
>
> As Andy said, this upgrade is strongly recommended. Due to a long-term bug
> in Numpy (and insufficient testing on our part), the new
> model_selection.GridSearchCV etc could not be pickled. There were also
> issues with the use of iterators for cross-validation splitters. But there
> are a lot of other valuable fixes in there too.
>
> Please everyone, tell us there are no more bugs! :P
>
> On 14 November 2016 at 10:21, Andreas Mueller wrote:
>
>> Hey all.
>> I just published the 0.18.1 wheels and source tarball to pypi.
>> The 0.18.1 release is a bugfix release, resolving some issues introduced
>> in 0.18 and also some earlier issues.
>> In particular there were some important relating to the new
>> model_selection module.
>>
>> You can find the whole changelog (which I just realized does not contain
>> all the fixes) here:
>> http://scikit-learn.org/stable/whats_new.html#version-0-18-1
>>
>> Best,
>> Andy
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From a.suchaneck at gmail.com Mon Nov 14 01:29:12 2016
From: a.suchaneck at gmail.com (Anton)
Date: Mon, 14 Nov 2016 07:29:12 +0100
Subject: [scikit-learn] Automatic ThresholdClassifier based on
cost-function - Classifier Interface?
In-Reply-To:
References:
<89f50f08-5fa4-4ad6-8d15-3e5e70d15087@gmail.com>
<1478942249.8133.0@smtp.gmail.com>
Message-ID: <1479104952.9218.0@smtp.gmail.com>
>
>> screws you over since it passes the parameters to __init__(). I
>> expected the usage of set_params() here, but I'm getting my
>> gridsearch parameters passed to __init__().
>> Is this intended?
>>
> I don't know what you mean by "screws you over". You just have to
> explicitly list all parameters.
There is a hidden assumption that next to set_param() some methods may
alternatively use __init__() to set parameters. That's why I had to
jump through multiple hoops to get a meta-classifier which
transparently shadows all variables. Usually you would expect that all
parts stick to the convention of using one way to set parameters only
(set_params())
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From tevang3 at gmail.com Mon Nov 14 06:14:06 2016
From: tevang3 at gmail.com (Thomas Evangelidis)
Date: Mon, 14 Nov 2016 12:14:06 +0100
Subject: [scikit-learn] suggested classification algorithm
Message-ID:
Greetings,
I want to design a program that can deal with classification problems of
the same type, where the number of positive observations is small but the
number of negative much larger. Speaking with numbers, the number of
positive observations could range usually between 2 to 20 and the number of
negative could be at least x30 times larger. The number of features could
be between 2 and 20 too, but that could be reduced using feature selection
and elimination algorithms. I 've read in the documentation that some
algorithms like the SVM are still effective when the number of dimensions
is greater than the number of samples, but I am not sure if they are
suitable for my case. Moreover, according to this Figure, the Nearest
Neighbors is the best and second is the RBF SVM:
http://scikit-learn.org/stable/_images/sphx_glr
_plot_classifier_comparison_001.png
However, I assume that Nearest Neighbors would not be effective in my case
where the number of positive observations is very low. For these reasons I
would like to know your expert opinion about which classification algorithm
should I try first.
thanks in advance
Thomas
--
======================================================================
Thomas Evangelidis
Research Specialist
CEITEC - Central European Institute of Technology
Masaryk University
Kamenice 5/A35/1S081,
62500 Brno, Czech Republic
email: tevang at pharm.uoa.gr
tevang3 at gmail.com
website: https://sites.google.com/site/thomasevangelidishomepage/
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From joel.nothman at gmail.com Mon Nov 14 06:20:22 2016
From: joel.nothman at gmail.com (Joel Nothman)
Date: Mon, 14 Nov 2016 22:20:22 +1100
Subject: [scikit-learn] suggested classification algorithm
In-Reply-To:
References:
Message-ID:
http://contrib.scikit-learn.org/imbalanced-learn/ might be of interest to
you.
On 14 November 2016 at 22:14, Thomas Evangelidis wrote:
> Greetings,
>
> I want to design a program that can deal with classification problems of
> the same type, where the number of positive observations is small but the
> number of negative much larger. Speaking with numbers, the number of
> positive observations could range usually between 2 to 20 and the number of
> negative could be at least x30 times larger. The number of features could
> be between 2 and 20 too, but that could be reduced using feature selection
> and elimination algorithms. I 've read in the documentation that some
> algorithms like the SVM are still effective when the number of dimensions
> is greater than the number of samples, but I am not sure if they are
> suitable for my case. Moreover, according to this Figure, the Nearest
> Neighbors is the best and second is the RBF SVM:
>
> http://scikit-learn.org/stable/_images/sphx_glr_plot_
> classifier_comparison_001.png
>
> However, I assume that Nearest Neighbors would not be effective in my
> case where the number of positive observations is very low. For these
> reasons I would like to know your expert opinion about which classification
> algorithm should I try first.
>
> thanks in advance
> Thomas
>
>
> --
>
> ======================================================================
>
> Thomas Evangelidis
>
> Research Specialist
> CEITEC - Central European Institute of Technology
> Masaryk University
> Kamenice 5/A35/1S081,
> 62500 Brno, Czech Republic
>
> email: tevang at pharm.uoa.gr
>
> tevang3 at gmail.com
>
>
> website: https://sites.google.com/site/thomasevangelidishomepage/
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From t3kcit at gmail.com Mon Nov 14 08:39:47 2016
From: t3kcit at gmail.com (Andy)
Date: Mon, 14 Nov 2016 08:39:47 -0500
Subject: [scikit-learn] Automatic ThresholdClassifier based on
cost-function - Classifier Interface?
In-Reply-To: <1479104952.9218.0@smtp.gmail.com>
References:
<89f50f08-5fa4-4ad6-8d15-3e5e70d15087@gmail.com>
<1478942249.8133.0@smtp.gmail.com>
<1479104952.9218.0@smtp.gmail.com>
Message-ID: <61e3dba3-f315-452d-9280-7f1f0cf06cb1@gmail.com>
On 11/14/2016 01:29 AM, Anton wrote:
>>
>>> screws you over since it passes the parameters to __init__(). I
>>> expected the usage of set_params() here, but I'm getting my
>>> gridsearch parameters passed to __init__().
>>> Is this intended?
>>>
>> I don't know what you mean by "screws you over". You just have to
>> explicitly list all parameters.
>
> There is a hidden assumption that next to set_param() some methods may
> alternatively use __init__() to set parameters. That's why I had to
> jump through multiple hoops to get a meta-classifier which
> transparently shadows all variables. Usually you would expect that all
> parts stick to the convention of using one way to set parameters only
> (set_params())
>
Why would you expect that? Given the way clone works, basically no part
of scikit-learn does that.
Have you read
http://scikit-learn.org/dev/developers/contributing.html#rolling-your-own-estimator
?
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From gael.varoquaux at normalesup.org Mon Nov 14 12:29:16 2016
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Mon, 14 Nov 2016 18:29:16 +0100
Subject: [scikit-learn] Announcement: Scikit-learn 0.18.1 released!
In-Reply-To:
References:
Message-ID: <20161114172916.GJ1918706@phare.normalesup.org>
Thank you so much Andy and the others that made this .1 release possible.
It brings huge value in ensuring quality.
Ga?l
On Sun, Nov 13, 2016 at 06:21:05PM -0500, Andreas Mueller wrote:
> Hey all.
> I just published the 0.18.1 wheels and source tarball to pypi.
> The 0.18.1 release is a bugfix release, resolving some issues introduced in
> 0.18 and also some earlier issues.
> In particular there were some important relating to the new model_selection
> module.
> You can find the whole changelog (which I just realized does not contain all
> the fixes) here:
> http://scikit-learn.org/stable/whats_new.html#version-0-18-1
> Best,
> Andy
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
--
Gael Varoquaux
Researcher, INRIA Parietal
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone: ++ 33-1-69-08-79-68
http://gael-varoquaux.info http://twitter.com/GaelVaroquaux
From ragvrv at gmail.com Tue Nov 15 10:30:53 2016
From: ragvrv at gmail.com (Raghav R V)
Date: Tue, 15 Nov 2016 16:30:53 +0100
Subject: [scikit-learn] Announcement: Scikit-learn 0.18.1 released!
In-Reply-To: <20161114172916.GJ1918706@phare.normalesup.org>
References:
<20161114172916.GJ1918706@phare.normalesup.org>
Message-ID:
Hurray :D
Thanks heaps Andy, Joel and the whole team!
On Mon, Nov 14, 2016 at 6:29 PM, Gael Varoquaux <
gael.varoquaux at normalesup.org> wrote:
> Thank you so much Andy and the others that made this .1 release possible.
> It brings huge value in ensuring quality.
>
> Ga?l
>
> On Sun, Nov 13, 2016 at 06:21:05PM -0500, Andreas Mueller wrote:
> > Hey all.
> > I just published the 0.18.1 wheels and source tarball to pypi.
> > The 0.18.1 release is a bugfix release, resolving some issues introduced
> in
> > 0.18 and also some earlier issues.
> > In particular there were some important relating to the new
> model_selection
> > module.
>
> > You can find the whole changelog (which I just realized does not contain
> all
> > the fixes) here:
> > http://scikit-learn.org/stable/whats_new.html#version-0-18-1
>
> > Best,
> > Andy
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> --
> Gael Varoquaux
> Researcher, INRIA Parietal
> NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
> Phone: ++ 33-1-69-08-79-68
> http://gael-varoquaux.info http://twitter.com/GaelVaroquaux
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
--
Raghav RV
https://github.com/raghavrv
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From avn at mccme.ru Wed Nov 16 05:58:26 2016
From: avn at mccme.ru (avn at mccme.ru)
Date: Wed, 16 Nov 2016 13:58:26 +0300
Subject: [scikit-learn] Including figures from scikit-learn documentation in
scientific publications
Message-ID:
Hello,
I'm writing a paper meant for submission to TPAMI and would like to
include that wonderful figure of clustering algorithms comparison found
in scikit-learn documentation
(http://scikit-learn.org/stable/_images/sphx_glr_plot_cluster_comparison_001.png).
So, my question is: can I include the figure PNG file in my paper
directly (with proper reference, of course) or should I only provide a
reference to this figure?
With best regards,
-- Valery
From gael.varoquaux at normalesup.org Wed Nov 16 06:08:32 2016
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Wed, 16 Nov 2016 12:08:32 +0100
Subject: [scikit-learn] Including figures from scikit-learn
documentation in scientific publications
In-Reply-To:
References:
Message-ID: <20161116110832.GD3227973@phare.normalesup.org>
Grabbing the PNG and including a reference is perfectly fine.
I think that the right way would be to cite the paper and the URL of the
page where the figure is.
From avn at mccme.ru Wed Nov 16 06:34:17 2016
From: avn at mccme.ru (avn at mccme.ru)
Date: Wed, 16 Nov 2016 14:34:17 +0300
Subject: [scikit-learn] Including figures from scikit-learn
documentation in scientific publications
In-Reply-To: <20161116110832.GD3227973@phare.normalesup.org>
References:
<20161116110832.GD3227973@phare.normalesup.org>
Message-ID:
Ok, I'll provide a reference to
http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html
(the paper is anyway cited since scikit-learn is used in my work).
Hope that this URL is not going to change in future releases of
scikit-learn.
Thanks for the answer, Gael!
Gael Varoquaux ????? 2016-11-16 14:08:
> Grabbing the PNG and including a reference is perfectly fine.
>
> I think that the right way would be to cite the paper and the URL of
> the
> page where the figure is.
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
From nfliu at uw.edu Wed Nov 16 12:32:19 2016
From: nfliu at uw.edu (Nelson Liu)
Date: Wed, 16 Nov 2016 09:32:19 -0800
Subject: [scikit-learn] Including figures from scikit-learn
documentation in scientific publications
In-Reply-To:
References:
<20161116110832.GD3227973@phare.normalesup.org>
Message-ID:
It might be worthwhile to put a reference to
http://scikit-learn.org/0.18/auto_examples/cluster/plot_cluster_comparison.html
instead,
in case the figure changes in future versions.
Nelson
On Wed, Nov 16, 2016 at 3:34 AM, wrote:
> Ok, I'll provide a reference to http://scikit-learn.org/stable
> /auto_examples/cluster/plot_cluster_comparison.html (the paper is anyway
> cited since scikit-learn is used in my work).
> Hope that this URL is not going to change in future releases of
> scikit-learn.
>
> Thanks for the answer, Gael!
>
> Gael Varoquaux ????? 2016-11-16 14:08:
>
> Grabbing the PNG and including a reference is perfectly fine.
>>
>> I think that the right way would be to cite the paper and the URL of the
>> page where the figure is.
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From avn at mccme.ru Wed Nov 16 14:14:52 2016
From: avn at mccme.ru (avn at mccme.ru)
Date: Wed, 16 Nov 2016 22:14:52 +0300
Subject: [scikit-learn] Including figures from scikit-learn
documentation in scientific publications
In-Reply-To:
References:
<20161116110832.GD3227973@phare.normalesup.org>
Message-ID:
Yes, it seems to be more adequate URL.
Nelson Liu ????? 2016-11-16 20:32:
> It might be worthwhile to put a reference to
> http://scikit-learn.org/0.18/auto_examples/cluster/plot_cluster_comparison.html
> instead, in case the figure changes in future versions.
>
> Nelson
>
> On Wed, Nov 16, 2016 at 3:34 AM, wrote:
>
>> Ok, I'll provide a reference to
>>
> http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html
>> [2] (the paper is anyway cited since scikit-learn is used in my
>> work).
>> Hope that this URL is not going to change in future releases of
>> scikit-learn.
>>
>> Thanks for the answer, Gael!
>>
>> Gael Varoquaux ????? 2016-11-16 14:08:
>>
>>> Grabbing the PNG and including a reference is perfectly fine.
>>>
>>> I think that the right way would be to cite the paper and the URL
>>> of the
>>> page where the figure is.
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn [1]
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn [1]
>
>
>
> Links:
> ------
> [1] https://mail.python.org/mailman/listinfo/scikit-learn
> [2]
> http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
From fernando.wittmann at gmail.com Wed Nov 16 15:10:48 2016
From: fernando.wittmann at gmail.com (Fernando Marcos Wittmann)
Date: Wed, 16 Nov 2016 18:10:48 -0200
Subject: [scikit-learn] suggested classification algorithm
In-Reply-To:
References:
Message-ID:
Three based algorithms (like Random Forest) usually work well for
imbalanced datasets. You can also take a look at the SMOTE technique (
http://jair.org/media/953/live-953-2037-jair.pdf) which you can use for
over-sampling the positive observations.
On Mon, Nov 14, 2016 at 9:14 AM, Thomas Evangelidis
wrote:
> Greetings,
>
> I want to design a program that can deal with classification problems of
> the same type, where the number of positive observations is small but the
> number of negative much larger. Speaking with numbers, the number of
> positive observations could range usually between 2 to 20 and the number of
> negative could be at least x30 times larger. The number of features could
> be between 2 and 20 too, but that could be reduced using feature selection
> and elimination algorithms. I 've read in the documentation that some
> algorithms like the SVM are still effective when the number of dimensions
> is greater than the number of samples, but I am not sure if they are
> suitable for my case. Moreover, according to this Figure, the Nearest
> Neighbors is the best and second is the RBF SVM:
>
> http://scikit-learn.org/stable/_images/sphx_glr_plot_
> classifier_comparison_001.png
>
> However, I assume that Nearest Neighbors would not be effective in my
> case where the number of positive observations is very low. For these
> reasons I would like to know your expert opinion about which classification
> algorithm should I try first.
>
> thanks in advance
> Thomas
>
>
> --
>
> ======================================================================
>
> Thomas Evangelidis
>
> Research Specialist
> CEITEC - Central European Institute of Technology
> Masaryk University
> Kamenice 5/A35/1S081,
> 62500 Brno, Czech Republic
>
> email: tevang at pharm.uoa.gr
>
> tevang3 at gmail.com
>
>
> website: https://sites.google.com/site/thomasevangelidishomepage/
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
--
Fernando Marcos Wittmann
MS Student - Energy Systems Dept.
School of Electrical and Computer Engineering, FEEC
University of Campinas, UNICAMP, Brazil
+55 (19) 987-211302
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From Dale.T.Smith at macys.com Wed Nov 16 15:54:21 2016
From: Dale.T.Smith at macys.com (Dale T Smith)
Date: Wed, 16 Nov 2016 20:54:21 +0000
Subject: [scikit-learn] suggested classification algorithm
In-Reply-To:
References:
Message-ID:
Unbalanced class classification has been a topic here in past years, and there are posts if you search the archives. There are also plenty of resources available to help you, from actual code on Stackoverflow, to papers that address various ideas. I don?t think it?s necessary to repeat any of this on the mailing list.
__________________________________________________________________________________________________________________________________________
Dale T. Smith | Macy's Systems and Technology | IFS eCom CSE Data Science
5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com
From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Fernando Marcos Wittmann
Sent: Wednesday, November 16, 2016 3:11 PM
To: Scikit-learn user and developer mailing list
Subject: Re: [scikit-learn] suggested classification algorithm
? EXT MSG:
Three based algorithms (like Random Forest) usually work well for imbalanced datasets. You can also take a look at the SMOTE technique (http://jair.org/media/953/live-953-2037-jair.pdf) which you can use for over-sampling the positive observations.
On Mon, Nov 14, 2016 at 9:14 AM, Thomas Evangelidis > wrote:
Greetings,
I want to design a program that can deal with classification problems of the same type, where the number of positive observations is small but the number of negative much larger. Speaking with numbers, the number of positive observations could range usually between 2 to 20 and the number of negative could be at least x30 times larger. The number of features could be between 2 and 20 too, but that could be reduced using feature selection and elimination algorithms. I 've read in the documentation that some algorithms like the SVM are still effective when the number of dimensions is greater than the number of samples, but I am not sure if they are suitable for my case. Moreover, according to this Figure, the Nearest Neighbors is the best and second is the RBF SVM:
http://scikit-learn.org/stable/_images/sphx_glr_plot_classifier_comparison_001.png
However, I assume that Nearest Neighbors would not be effective in my case where the number of positive observations is very low. For these reasons I would like to know your expert opinion about which classification algorithm should I try first.
thanks in advance
Thomas
--
======================================================================
Thomas Evangelidis
Research Specialist
CEITEC - Central European Institute of Technology
Masaryk University
Kamenice 5/A35/1S081,
62500 Brno, Czech Republic
email: tevang at pharm.uoa.gr
tevang3 at gmail.com
website: https://sites.google.com/site/thomasevangelidishomepage/
_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn
--
Fernando Marcos Wittmann
MS Student - Energy Systems Dept.
School of Electrical and Computer Engineering, FEEC
University of Campinas, UNICAMP, Brazil
+55 (19) 987-211302
* This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From se.raschka at gmail.com Wed Nov 16 16:20:17 2016
From: se.raschka at gmail.com (Sebastian Raschka)
Date: Wed, 16 Nov 2016 16:20:17 -0500
Subject: [scikit-learn] suggested classification algorithm
In-Reply-To:
References:
Message-ID: <0C66AA1E-D7FC-4DCD-9DBD-FED8020A0296@gmail.com>
Yeah, there are many useful resources and implementations scattered around the web. However, a good, brief overview of the general ideas and concepts would be this one, for example: http://www.svds.com/learning-imbalanced-classes/
> On Nov 16, 2016, at 3:54 PM, Dale T Smith wrote:
>
> Unbalanced class classification has been a topic here in past years, and there are posts if you search the archives. There are also plenty of resources available to help you, from actual code on Stackoverflow, to papers that address various ideas. I don?t think it?s necessary to repeat any of this on the mailing list.
>
>
> __________________________________________________________________________________________________________________________________________
> Dale T. Smith | Macy's Systems and Technology | IFS eCom CSE Data Science
> 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com
>
> From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Fernando Marcos Wittmann
> Sent: Wednesday, November 16, 2016 3:11 PM
> To: Scikit-learn user and developer mailing list
> Subject: Re: [scikit-learn] suggested classification algorithm
>
> ? EXT MSG:
> Three based algorithms (like Random Forest) usually work well for imbalanced datasets. You can also take a look at the SMOTE technique (http://jair.org/media/953/live-953-2037-jair.pdf) which you can use for over-sampling the positive observations.
>
> On Mon, Nov 14, 2016 at 9:14 AM, Thomas Evangelidis wrote:
> Greetings,
>
> I want to design a program that can deal with classification problems of the same type, where the number of positive observations is small but the number of negative much larger. Speaking with numbers, the number of positive observations could range usually between 2 to 20 and the number of negative could be at least x30 times larger. The number of features could be between 2 and 20 too, but that could be reduced using feature selection and elimination algorithms. I 've read in the documentation that some algorithms like the SVM are still effective when the number of dimensions is greater than the number of samples, but I am not sure if they are suitable for my case. Moreover, according to this Figure, the Nearest Neighbors is the best and second is the RBF SVM:
>
> http://scikit-learn.org/stable/_images/sphx_glr_plot_classifier_comparison_001.png
>
> However, I assume that Nearest Neighbors would not be effective in my case where the number of positive observations is very low. For these reasons I would like to know your expert opinion about which classification algorithm should I try first.
>
> thanks in advance
> Thomas
>
>
> --
> ======================================================================
> Thomas Evangelidis
> Research Specialist
> CEITEC - Central European Institute of Technology
> Masaryk University
> Kamenice 5/A35/1S081,
> 62500 Brno, Czech Republic
>
> email: tevang at pharm.uoa.gr
> tevang3 at gmail.com
>
> website: https://sites.google.com/site/thomasevangelidishomepage/
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
>
> --
>
> Fernando Marcos Wittmann
> MS Student - Energy Systems Dept.
> School of Electrical and Computer Engineering, FEEC
> University of Campinas, UNICAMP, Brazil
> +55 (19) 987-211302
>
> * This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
From tevang3 at gmail.com Thu Nov 17 09:00:33 2016
From: tevang3 at gmail.com (Thomas Evangelidis)
Date: Thu, 17 Nov 2016 15:00:33 +0100
Subject: [scikit-learn] suggested classification algorithm
In-Reply-To: <0C66AA1E-D7FC-4DCD-9DBD-FED8020A0296@gmail.com>
References:
<0C66AA1E-D7FC-4DCD-9DBD-FED8020A0296@gmail.com>
Message-ID:
Guys thank you all for your hints! Practical experience is irreplaceable
that's why I posted this query here. I could read all week the mailing list
archives and the respective internet resources but still not find the key
info I could potentially get by someone here.
I did PCA on my training set (this one has 24 positive and 1278 negative
observation) and projected the 19 features on the first 2 PCs, which
explain 87.6 % of the variance in the data. Does this plot help to decide
which classification algorithms and/or over- or under-sampling would be
more suitable?
https://dl.dropboxusercontent.com/u/48168252/PCA_of_features.png
thanks for your advices
Thomas
On 16 November 2016 at 22:20, Sebastian Raschka
wrote:
> Yeah, there are many useful resources and implementations scattered around
> the web. However, a good, brief overview of the general ideas and concepts
> would be this one, for example: http://www.svds.com/learning-
> imbalanced-classes/
>
>
> > On Nov 16, 2016, at 3:54 PM, Dale T Smith
> wrote:
> >
> > Unbalanced class classification has been a topic here in past years, and
> there are posts if you search the archives. There are also plenty of
> resources available to help you, from actual code on Stackoverflow, to
> papers that address various ideas. I don?t think it?s necessary to repeat
> any of this on the mailing list.
> >
> >
> > ____________________________________________________________
> ____________________________________________________________
> __________________
> > Dale T. Smith | Macy's Systems and Technology | IFS eCom CSE Data Science
> > 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com
> >
> > From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=
> macys.com at python.org] On Behalf Of Fernando Marcos Wittmann
> > Sent: Wednesday, November 16, 2016 3:11 PM
> > To: Scikit-learn user and developer mailing list
> > Subject: Re: [scikit-learn] suggested classification algorithm
> >
> > ? EXT MSG:
> > Three based algorithms (like Random Forest) usually work well for
> imbalanced datasets. You can also take a look at the SMOTE technique (
> http://jair.org/media/953/live-953-2037-jair.pdf) which you can use for
> over-sampling the positive observations.
> >
> > On Mon, Nov 14, 2016 at 9:14 AM, Thomas Evangelidis
> wrote:
> > Greetings,
> >
> > I want to design a program that can deal with classification problems of
> the same type, where the number of positive observations is small but the
> number of negative much larger. Speaking with numbers, the number of
> positive observations could range usually between 2 to 20 and the number of
> negative could be at least x30 times larger. The number of features could
> be between 2 and 20 too, but that could be reduced using feature selection
> and elimination algorithms. I 've read in the documentation that some
> algorithms like the SVM are still effective when the number of dimensions
> is greater than the number of samples, but I am not sure if they are
> suitable for my case. Moreover, according to this Figure, the Nearest
> Neighbors is the best and second is the RBF SVM:
> >
> > http://scikit-learn.org/stable/_images/sphx_glr_plot_
> classifier_comparison_001.png
> >
> > However, I assume that Nearest Neighbors would not be effective in my
> case where the number of positive observations is very low. For these
> reasons I would like to know your expert opinion about which classification
> algorithm should I try first.
> >
> > thanks in advance
> > Thomas
> >
> >
> > --
> > ======================================================================
> > Thomas Evangelidis
> > Research Specialist
> > CEITEC - Central European Institute of Technology
> > Masaryk University
> > Kamenice 5/A35/1S081,
> > 62500 Brno, Czech Republic
> >
> > email: tevang at pharm.uoa.gr
> > tevang3 at gmail.com
> >
> > website: https://sites.google.com/site/thomasevangelidishomepage/
> >
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> >
> >
> >
> > --
> >
> > Fernando Marcos Wittmann
> > MS Student - Energy Systems Dept.
> > School of Electrical and Computer Engineering, FEEC
> > University of Campinas, UNICAMP, Brazil
> > +55 (19) 987-211302
> >
> > * This is an EXTERNAL EMAIL. Stop and think before clicking a link or
> opening attachments.
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
--
======================================================================
Thomas Evangelidis
Research Specialist
CEITEC - Central European Institute of Technology
Masaryk University
Kamenice 5/A35/1S081,
62500 Brno, Czech Republic
email: tevang at pharm.uoa.gr
tevang3 at gmail.com
website: https://sites.google.com/site/thomasevangelidishomepage/
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PCA_of_features.png
Type: image/png
Size: 106770 bytes
Desc: not available
URL:
From Dale.T.Smith at macys.com Thu Nov 17 09:10:38 2016
From: Dale.T.Smith at macys.com (Dale T Smith)
Date: Thu, 17 Nov 2016 14:10:38 +0000
Subject: [scikit-learn] suggested classification algorithm
In-Reply-To:
References:
<0C66AA1E-D7FC-4DCD-9DBD-FED8020A0296@gmail.com>
Message-ID: