From yrohinkumar at gmail.com Tue Aug 1 09:15:56 2017
From: yrohinkumar at gmail.com (Rohin Kumar)
Date: Tue, 1 Aug 2017 18:45:56 +0530
Subject: [scikit-learn] Nearest neighbor search with 2 distance measures
In-Reply-To:
References: <379121501436421@mxfront4j.mail.yandex.net>
Message-ID:
Since you seem to be from Astrophysics/Cosmology background (I am assuming
you are jakevdp - the creator of astroML - if you are - I am lucky!), I can
explain my application scenario. I am trying to calculate the anisotropic
two-point correlation function something like done in rp_pi_tpcf
or s_mu_tpcf
using pairs (DD,DR,RR) calculated from BallTree.two_point_correlation
In halotools (
http://halotools.readthedocs.io/en/latest/function_usage/mock_observables_functions.html)
it is implemented using rectangular grids. I could calculate 2pcf with
custom metrics using one variable with BallTree as done in astroML. I
intend to find the anisotropic counter part.
Thanks & Regards,
Rohin
Y.Rohin Kumar,
+919818092877.
On Tue, Aug 1, 2017 at 5:18 PM, Rohin Kumar wrote:
> Dear Jake,
>
> Thanks for your response. I meant to group/count pairs in boxes (using two
> arrays simultaneously-hence needing 2 metrics) instead of one distance
> array as the binning parameter. I don't know if the algorithm supports such
> a thing. For now, I am proceeding with your suggestion of two ball trees at
> huge computational cost. I hope I am able to frame my question properly.
>
> Thanks & Regards,
> Rohin.
>
>
>
> On Mon, Jul 31, 2017 at 8:16 PM, Jacob Vanderplas <
> jakevdp at cs.washington.edu> wrote:
>
>> On Sun, Jul 30, 2017 at 11:18 AM, Rohin Kumar
>> wrote:
>>
>>> *update*
>>>
>>> May be it doesn't have to be done at the tree creation level. It could
>>> be using loops and creating two different balltrees. Something like
>>>
>>> tree1=BallTree(X,metric='metric1') #for x-z plane
>>> tree2=BallTree(X,metric='metric2') #for y-z plane
>>>
>>> And then calculate correlation functions in a loop to get tpcf(X,r1,r2)
>>> using tree1.two_point_correlation(X,r1) and
>>> tree2.two_point_correlation(X,r2)
>>>
>>
>> Hi Rohin,
>> It's not exactly clear to me what you wish the tree to do with the two
>> different metrics, but in any case the ball tree only supports one metric
>> at a time. If you can construct your desired result from two ball trees
>> each with its own metric, then that's probably the best way to proceed,
>> Jake
>>
>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From yrohinkumar at gmail.com Tue Aug 1 07:48:23 2017
From: yrohinkumar at gmail.com (Rohin Kumar)
Date: Tue, 1 Aug 2017 17:18:23 +0530
Subject: [scikit-learn] Nearest neighbor search with 2 distance measures
In-Reply-To:
References: <379121501436421@mxfront4j.mail.yandex.net>
Message-ID:
Dear Jake,
Thanks for your response. I meant to group/count pairs in boxes (using two
arrays simultaneously-hence needing 2 metrics) instead of one distance
array as the binning parameter. I don't know if the algorithm supports such
a thing. For now, I am proceeding with your suggestion of two ball trees at
huge computational cost. I hope I am able to frame my question properly.
Thanks & Regards,
Rohin.
On Mon, Jul 31, 2017 at 8:16 PM, Jacob Vanderplas wrote:
> On Sun, Jul 30, 2017 at 11:18 AM, Rohin Kumar
> wrote:
>
>> *update*
>>
>> May be it doesn't have to be done at the tree creation level. It could be
>> using loops and creating two different balltrees. Something like
>>
>> tree1=BallTree(X,metric='metric1') #for x-z plane
>> tree2=BallTree(X,metric='metric2') #for y-z plane
>>
>> And then calculate correlation functions in a loop to get tpcf(X,r1,r2)
>> using tree1.two_point_correlation(X,r1) and tree2.two_point_correlation(
>> X,r2)
>>
>
> Hi Rohin,
> It's not exactly clear to me what you wish the tree to do with the two
> different metrics, but in any case the ball tree only supports one metric
> at a time. If you can construct your desired result from two ball trees
> each with its own metric, then that's probably the best way to proceed,
> Jake
>
>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From Jeremiah.Johnson at unh.edu Tue Aug 1 12:03:01 2017
From: Jeremiah.Johnson at unh.edu (Johnson, Jeremiah)
Date: Tue, 1 Aug 2017 16:03:01 +0000
Subject: [scikit-learn] question about class_weights in LogisticRegression
Message-ID:
Hello all,
I'm looking for confirmation on an implementation detail that is somewhere in liblinear, but I haven't found documentation for yet. When the class_weights='balanced' parameter is set in LogisticRegression, then the regularisation parameter for an observation from class I is class_weight[I] * C, where C is the usual regularization parameter - is this correct?
Thanks,
Jeremiah
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From stuart at stuartreynolds.net Tue Aug 1 12:19:54 2017
From: stuart at stuartreynolds.net (Stuart Reynolds)
Date: Tue, 1 Aug 2017 09:19:54 -0700
Subject: [scikit-learn] question about class_weights in
LogisticRegression
In-Reply-To:
References:
Message-ID:
I hope not. And not accoring to the docs...
https://github.com/scikit-learn/scikit-learn/blob/ab93d65/sklearn/linear_model/logistic.py#L947
class_weight : dict or 'balanced', optional
Weights associated with classes in the form ``{class_label: weight}``.
If not given, all classes are supposed to have weight one.
The "balanced" mode uses the values of y to automatically adjust
weights inversely proportional to class frequencies in the input data
as ``n_samples / (n_classes * np.bincount(y))``.
Note that these weights will be multiplied with sample_weight (passed
through the fit method) if sample_weight is specified.
On Tue, Aug 1, 2017 at 9:03 AM, Johnson, Jeremiah
wrote:
> Hello all,
>
> I?m looking for confirmation on an implementation detail that is somewhere
> in liblinear, but I haven?t found documentation for yet. When the
> class_weights=?balanced? parameter is set in LogisticRegression, then the
> regularisation parameter for an observation from class I is class_weight[I]
> * C, where C is the usual regularization parameter ? is this correct?
>
> Thanks,
> Jeremiah
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
From Jeremiah.Johnson at unh.edu Tue Aug 1 12:30:22 2017
From: Jeremiah.Johnson at unh.edu (Johnson, Jeremiah)
Date: Tue, 1 Aug 2017 16:30:22 +0000
Subject: [scikit-learn] question about class_weights in
LogisticRegression
In-Reply-To:
References:
Message-ID:
Right, I know how the class_weight calculation is performed. But then
those class weights are utilized during the model fit process in some way
in liblinear, and that?s what I am interested in. libSVM does
class_weight[I] * C (https://www.csie.ntu.edu.tw/~cjlin/libsvm/); is the
implementation in liblinear the same?
Best,
Jeremiah
On 8/1/17, 12:19 PM, "scikit-learn on behalf of Stuart Reynolds"
wrote:
>I hope not. And not accoring to the docs...
>https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_scikit-2Dl
>earn_scikit-2Dlearn_blob_ab93d65_sklearn_linear-5Fmodel_logistic.py-23L947
>&d=DwIGaQ&c=c6MrceVCY5m5A_KAUkrdoA&r=hQNTLb4Jonm4n54VBW80WEzIAaqvTOcTEjhIk
>rRJWXo&m=2XR2z3VWvEaERt4miGabDte3xkz_FwzMKMwnvEOWj8o&s=4uJZS3EaQgysmQlzjt-
>yuLkSlcXTd5G50LkEFMcbZLQ&e=
>
>class_weight : dict or 'balanced', optional
>Weights associated with classes in the form ``{class_label: weight}``.
>If not given, all classes are supposed to have weight one.
>The "balanced" mode uses the values of y to automatically adjust
>weights inversely proportional to class frequencies in the input data
>as ``n_samples / (n_classes * np.bincount(y))``.
>Note that these weights will be multiplied with sample_weight (passed
>through the fit method) if sample_weight is specified.
>
>On Tue, Aug 1, 2017 at 9:03 AM, Johnson, Jeremiah
> wrote:
>> Hello all,
>>
>> I?m looking for confirmation on an implementation detail that is
>>somewhere
>> in liblinear, but I haven?t found documentation for yet. When the
>> class_weights=?balanced? parameter is set in LogisticRegression, then
>>the
>> regularisation parameter for an observation from class I is
>>class_weight[I]
>> * C, where C is the usual regularization parameter ? is this correct?
>>
>> Thanks,
>> Jeremiah
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>>
>>https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.python.org_mail
>>man_listinfo_scikit-2Dlearn&d=DwIGaQ&c=c6MrceVCY5m5A_KAUkrdoA&r=hQNTLb4Jo
>>nm4n54VBW80WEzIAaqvTOcTEjhIkrRJWXo&m=2XR2z3VWvEaERt4miGabDte3xkz_FwzMKMwn
>>vEOWj8o&s=MgZoI9VOHFh3omGKHTASFx3NAVjj6AY3j_75mnOUg04&e=
>>
>_______________________________________________
>scikit-learn mailing list
>scikit-learn at python.org
>https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.python.org_mailm
>an_listinfo_scikit-2Dlearn&d=DwIGaQ&c=c6MrceVCY5m5A_KAUkrdoA&r=hQNTLb4Jonm
>4n54VBW80WEzIAaqvTOcTEjhIkrRJWXo&m=2XR2z3VWvEaERt4miGabDte3xkz_FwzMKMwnvEO
>Wj8o&s=MgZoI9VOHFh3omGKHTASFx3NAVjj6AY3j_75mnOUg04&e=
From jakevdp at cs.washington.edu Tue Aug 1 13:25:52 2017
From: jakevdp at cs.washington.edu (Jacob Vanderplas)
Date: Tue, 1 Aug 2017 10:25:52 -0700
Subject: [scikit-learn] Nearest neighbor search with 2 distance measures
In-Reply-To:
References: <379121501436421@mxfront4j.mail.yandex.net>
Message-ID:
Hi Rohin,
Ah, I see. I don't think a BallTree is the right data structure for an
anisotropic N-point query, because it fundamentally assumes spherical
symmetry of the metric. You may be able to do something like this with a
specialized KD-tree, but scikit-learn doesn't support this, and I don't
imagine that it ever will given the very specialized nature of the
application.
I'm certain someone has written efficient code for this operation in the
astronomy community, but I don't know of any good Python package to
recommend for this ? I'd suggest googling for keywords and seeing where
that gets you.
Thanks,
Jake
Jake VanderPlas
Senior Data Science Fellow
Director of Open Software
University of Washington eScience Institute
On Tue, Aug 1, 2017 at 6:15 AM, Rohin Kumar wrote:
> Since you seem to be from Astrophysics/Cosmology background (I am assuming
> you are jakevdp - the creator of astroML - if you are - I am lucky!), I can
> explain my application scenario. I am trying to calculate the anisotropic
> two-point correlation function something like done in rp_pi_tpcf
>
> or s_mu_tpcf
>
> using pairs (DD,DR,RR) calculated from BallTree.two_point_correlation
>
> In halotools (http://halotools.readthedocs.io/en/latest/function_usage/
> mock_observables_functions.html) it is implemented using rectangular
> grids. I could calculate 2pcf with custom metrics using one variable with
> BallTree as done in astroML. I intend to find the anisotropic counter part.
>
> Thanks & Regards,
> Rohin
>
> Y.Rohin Kumar,
> +919818092877 <+91%2098180%2092877>.
>
> On Tue, Aug 1, 2017 at 5:18 PM, Rohin Kumar wrote:
>
>> Dear Jake,
>>
>> Thanks for your response. I meant to group/count pairs in boxes (using
>> two arrays simultaneously-hence needing 2 metrics) instead of one distance
>> array as the binning parameter. I don't know if the algorithm supports such
>> a thing. For now, I am proceeding with your suggestion of two ball trees at
>> huge computational cost. I hope I am able to frame my question properly.
>>
>> Thanks & Regards,
>> Rohin.
>>
>>
>>
>> On Mon, Jul 31, 2017 at 8:16 PM, Jacob Vanderplas <
>> jakevdp at cs.washington.edu> wrote:
>>
>>> On Sun, Jul 30, 2017 at 11:18 AM, Rohin Kumar
>>> wrote:
>>>
>>>> *update*
>>>>
>>>> May be it doesn't have to be done at the tree creation level. It could
>>>> be using loops and creating two different balltrees. Something like
>>>>
>>>> tree1=BallTree(X,metric='metric1') #for x-z plane
>>>> tree2=BallTree(X,metric='metric2') #for y-z plane
>>>>
>>>> And then calculate correlation functions in a loop to get tpcf(X,r1,r2)
>>>> using tree1.two_point_correlation(X,r1) and
>>>> tree2.two_point_correlation(X,r2)
>>>>
>>>
>>> Hi Rohin,
>>> It's not exactly clear to me what you wish the tree to do with the two
>>> different metrics, but in any case the ball tree only supports one metric
>>> at a time. If you can construct your desired result from two ball trees
>>> each with its own metric, then that's probably the best way to proceed,
>>> Jake
>>>
>>>
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From yrohinkumar at gmail.com Tue Aug 1 13:50:58 2017
From: yrohinkumar at gmail.com (Rohin Kumar)
Date: Tue, 1 Aug 2017 23:20:58 +0530
Subject: [scikit-learn] Nearest neighbor search with 2 distance measures
In-Reply-To:
References: <379121501436421@mxfront4j.mail.yandex.net>
Message-ID:
Dear Jake,
Thank you for your prompt reply. I started with KD-tree but after realising
it doesn't support custom metrics (I don't know the reason for this - would
be nice feature) I shifted to BallTree and was looking for a 2 metric based
categorisation. After looking around, the best I could find at most were
brute-force methods written in python (had my own version too) or better
optimised ones in C or FORTRAN. The closest one was halotools which again
works with euclidean metric. For now, I will try to get my work done with 2
different BallTrees iteratively in bins. If I find a better option will try
to post an update.
Regards,
Rohin.
On Tue, Aug 1, 2017 at 10:55 PM, Jacob Vanderplas wrote:
> Hi Rohin,
> Ah, I see. I don't think a BallTree is the right data structure for an
> anisotropic N-point query, because it fundamentally assumes spherical
> symmetry of the metric. You may be able to do something like this with a
> specialized KD-tree, but scikit-learn doesn't support this, and I don't
> imagine that it ever will given the very specialized nature of the
> application.
>
> I'm certain someone has written efficient code for this operation in the
> astronomy community, but I don't know of any good Python package to
> recommend for this ? I'd suggest googling for keywords and seeing where
> that gets you.
>
> Thanks,
> Jake
>
> Jake VanderPlas
> Senior Data Science Fellow
> Director of Open Software
> University of Washington eScience Institute
>
> On Tue, Aug 1, 2017 at 6:15 AM, Rohin Kumar wrote:
>
>> Since you seem to be from Astrophysics/Cosmology background (I am
>> assuming you are jakevdp - the creator of astroML - if you are - I am
>> lucky!), I can explain my application scenario. I am trying to calculate
>> the anisotropic two-point correlation function something like done in
>> rp_pi_tpcf
>>
>> or s_mu_tpcf
>>
>> using pairs (DD,DR,RR) calculated from BallTree.two_point_correlation
>>
>> In halotools (http://halotools.readthedocs.io/en/latest/function_usage/mo
>> ck_observables_functions.html) it is implemented using rectangular
>> grids. I could calculate 2pcf with custom metrics using one variable with
>> BallTree as done in astroML. I intend to find the anisotropic counter part.
>>
>> Thanks & Regards,
>> Rohin
>>
>>
>> On Tue, Aug 1, 2017 at 5:18 PM, Rohin Kumar
>> wrote:
>>
>>> Dear Jake,
>>>
>>> Thanks for your response. I meant to group/count pairs in boxes (using
>>> two arrays simultaneously-hence needing 2 metrics) instead of one distance
>>> array as the binning parameter. I don't know if the algorithm supports such
>>> a thing. For now, I am proceeding with your suggestion of two ball trees at
>>> huge computational cost. I hope I am able to frame my question properly.
>>>
>>> Thanks & Regards,
>>> Rohin.
>>>
>>>
>>>
>>> On Mon, Jul 31, 2017 at 8:16 PM, Jacob Vanderplas <
>>> jakevdp at cs.washington.edu> wrote:
>>>
>>>> On Sun, Jul 30, 2017 at 11:18 AM, Rohin Kumar
>>>> wrote:
>>>>
>>>>> *update*
>>>>>
>>>>> May be it doesn't have to be done at the tree creation level. It could
>>>>> be using loops and creating two different balltrees. Something like
>>>>>
>>>>> tree1=BallTree(X,metric='metric1') #for x-z plane
>>>>> tree2=BallTree(X,metric='metric2') #for y-z plane
>>>>>
>>>>> And then calculate correlation functions in a loop to get
>>>>> tpcf(X,r1,r2) using tree1.two_point_correlation(X,r1) and
>>>>> tree2.two_point_correlation(X,r2)
>>>>>
>>>>
>>>> Hi Rohin,
>>>> It's not exactly clear to me what you wish the tree to do with the two
>>>> different metrics, but in any case the ball tree only supports one metric
>>>> at a time. If you can construct your desired result from two ball trees
>>>> each with its own metric, then that's probably the best way to proceed,
>>>> Jake
>>>>
>>>>
>>>>>
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> scikit-learn at python.org
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>>
>>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From jakevdp at cs.washington.edu Tue Aug 1 13:59:21 2017
From: jakevdp at cs.washington.edu (Jacob Vanderplas)
Date: Tue, 1 Aug 2017 10:59:21 -0700
Subject: [scikit-learn] Nearest neighbor search with 2 distance measures
In-Reply-To:
References: <379121501436421@mxfront4j.mail.yandex.net>
Message-ID:
On Tue, Aug 1, 2017 at 10:50 AM, Rohin Kumar wrote:
> I started with KD-tree but after realising it doesn't support custom
> metrics (I don't know the reason for this - would be nice feature)
>
The scikit-learn KD-tree doesn't support custom metrics because it utilizes
relatively strong assumptions about the form of the metric when
constructing the tree. The Ball Tree makes fewer assumptions, which is why
it can support arbitrary metrics. It would in principal be possible to
create a KD Tree that supports custom *axis-aligned* metrics, but again I
think that would be too specialized for inclusion in scikit-learn.
One project you might check out is cykdtree:
https://pypi.python.org/pypi/cykdtree
I'm not certain whether it supports the queries you need, but I would bet
the team behind that would be willing to work toward these sorts of
specialized queries if they don't already exist.
Jake
> I shifted to BallTree and was looking for a 2 metric based categorisation.
> After looking around, the best I could find at most were brute-force
> methods written in python (had my own version too) or better optimised ones
> in C or FORTRAN. The closest one was halotools which again works with
> euclidean metric. For now, I will try to get my work done with 2 different
> BallTrees iteratively in bins. If I find a better option will try to post
> an update.
>
> Regards,
> Rohin.
>
>
> On Tue, Aug 1, 2017 at 10:55 PM, Jacob Vanderplas <
> jakevdp at cs.washington.edu> wrote:
>
>> Hi Rohin,
>> Ah, I see. I don't think a BallTree is the right data structure for an
>> anisotropic N-point query, because it fundamentally assumes spherical
>> symmetry of the metric. You may be able to do something like this with a
>> specialized KD-tree, but scikit-learn doesn't support this, and I don't
>> imagine that it ever will given the very specialized nature of the
>> application.
>>
>> I'm certain someone has written efficient code for this operation in the
>> astronomy community, but I don't know of any good Python package to
>> recommend for this ? I'd suggest googling for keywords and seeing where
>> that gets you.
>>
>> Thanks,
>> Jake
>>
>> Jake VanderPlas
>> Senior Data Science Fellow
>> Director of Open Software
>> University of Washington eScience Institute
>>
>> On Tue, Aug 1, 2017 at 6:15 AM, Rohin Kumar
>> wrote:
>>
>>> Since you seem to be from Astrophysics/Cosmology background (I am
>>> assuming you are jakevdp - the creator of astroML - if you are - I am
>>> lucky!), I can explain my application scenario. I am trying to calculate
>>> the anisotropic two-point correlation function something like done in
>>> rp_pi_tpcf
>>>
>>> or s_mu_tpcf
>>>
>>> using pairs (DD,DR,RR) calculated from BallTree.two_point_correlation
>>>
>>> In halotools (http://halotools.readthedocs.
>>> io/en/latest/function_usage/mock_observables_functions.html) it is
>>> implemented using rectangular grids. I could calculate 2pcf with custom
>>> metrics using one variable with BallTree as done in astroML. I intend to
>>> find the anisotropic counter part.
>>>
>>> Thanks & Regards,
>>> Rohin
>>>
>>>
>>> On Tue, Aug 1, 2017 at 5:18 PM, Rohin Kumar
>>> wrote:
>>>
>>>> Dear Jake,
>>>>
>>>> Thanks for your response. I meant to group/count pairs in boxes (using
>>>> two arrays simultaneously-hence needing 2 metrics) instead of one distance
>>>> array as the binning parameter. I don't know if the algorithm supports such
>>>> a thing. For now, I am proceeding with your suggestion of two ball trees at
>>>> huge computational cost. I hope I am able to frame my question properly.
>>>>
>>>> Thanks & Regards,
>>>> Rohin.
>>>>
>>>>
>>>>
>>>> On Mon, Jul 31, 2017 at 8:16 PM, Jacob Vanderplas <
>>>> jakevdp at cs.washington.edu> wrote:
>>>>
>>>>> On Sun, Jul 30, 2017 at 11:18 AM, Rohin Kumar
>>>>> wrote:
>>>>>
>>>>>> *update*
>>>>>>
>>>>>> May be it doesn't have to be done at the tree creation level. It
>>>>>> could be using loops and creating two different balltrees. Something like
>>>>>>
>>>>>> tree1=BallTree(X,metric='metric1') #for x-z plane
>>>>>> tree2=BallTree(X,metric='metric2') #for y-z plane
>>>>>>
>>>>>> And then calculate correlation functions in a loop to get
>>>>>> tpcf(X,r1,r2) using tree1.two_point_correlation(X,r1) and
>>>>>> tree2.two_point_correlation(X,r2)
>>>>>>
>>>>>
>>>>> Hi Rohin,
>>>>> It's not exactly clear to me what you wish the tree to do with the two
>>>>> different metrics, but in any case the ball tree only supports one metric
>>>>> at a time. If you can construct your desired result from two ball trees
>>>>> each with its own metric, then that's probably the best way to proceed,
>>>>> Jake
>>>>>
>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> scikit-learn mailing list
>>>>>> scikit-learn at python.org
>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> scikit-learn at python.org
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>
>>>>>
>>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From sambarnett95 at gmail.com Wed Aug 2 08:38:50 2017
From: sambarnett95 at gmail.com (Sam Barnett)
Date: Wed, 2 Aug 2017 13:38:50 +0100
Subject: [scikit-learn] Problems with running GridSearchCV on a pipeline
with a custom transformer
Message-ID:
Dear all,
I have created a 2-step pipeline with a custom transformer followed by a
simple SVC classifier, and I wish to run a grid-search over it. I am able
to successfully create the transformer and the pipeline, and each of these
elements work fine. However, when I try to use the fit() method on my
GridSearchCV object, I get the following error:
57 # during fit.
58 if X.shape != self.input_shape_:
---> 59 raise ValueError('Shape of input is different from what
was seen '
60 'in `fit`')
61
ValueError: Shape of input is different from what was seen in `fit`
For a full breakdown of the problem, I have written a Jupyter notebook
showing exactly how the error occurs (this also contains all .py files
necessary to run the notebook). Can anybody see how to work through this?
Many thanks,
Sam Barnett
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Sequential Kernel Test.zip
Type: application/zip
Size: 6759 bytes
Desc: not available
URL:
From viewsonic234 at aim.com Wed Aug 2 11:36:24 2017
From: viewsonic234 at aim.com (Chris Carrion)
Date: Wed, 2 Aug 2017 11:36:24 -0400
Subject: [scikit-learn] minibatchkmeans deprecation warning?
Message-ID: <3xMy9f2YqXzFqm1@mail.python.org>
Hi,
I?m working in an environment provided by Quantopian, an algorithmic-traders hub for research. I imported the minibatch kmeans from sklearn.clusters in the environment they provided, but I?m getting a deprecation warning. After reaching out to Quantopian support, they claim it?s something with the way sklearn is coded, and nothing can be done on their end. I was wondering whether this was true or not.
Curious,
Chris
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From t3kcit at gmail.com Wed Aug 2 12:05:17 2017
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 2 Aug 2017 12:05:17 -0400
Subject: [scikit-learn] Problems with running GridSearchCV on a pipeline
with a custom transformer
In-Reply-To:
References:
Message-ID:
Hi Sam.
GridSearchCV will do cross-validation, which requires to "transform" the
test data.
The shape of the test-data will be different from the shape of the
training data.
You need to have the ability to compute the kernel between the training
data and new test data.
A more hacky solution would be to compute the full kernel matrix in
advance and pass that to GridSearchCV.
You probably don't need it here, but you should also checkout what the
_pairwise attribute does in cross-validation,
because that it likely to come up when playing with kernels.
Hth,
Andy
On 08/02/2017 08:38 AM, Sam Barnett wrote:
> Dear all,
>
> I have created a 2-step pipeline with a custom transformer followed by
> a simple SVC classifier, and I wish to run a grid-search over it. I am
> able to successfully create the transformer and the pipeline, and each
> of these elements work fine. However, when I try to use the fit()
> method on my GridSearchCV object, I get the following error:
>
> 57 # during fit.
> 58 if X.shape != self.input_shape_:
> ---> 59 raise ValueError('Shape of input is different from
> what was seen '
> 60 'in `fit`')
> 61
>
> ValueError: Shape of input is different from what was seen in `fit`
>
> For a full breakdown of the problem, I have written a Jupyter notebook
> showing exactly how the error occurs (this also contains all .py files
> necessary to run the notebook). Can anybody see how to work through this?
>
> Many thanks,
> Sam Barnett
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From t3kcit at gmail.com Wed Aug 2 12:05:44 2017
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 2 Aug 2017 12:05:44 -0400
Subject: [scikit-learn] minibatchkmeans deprecation warning?
In-Reply-To: <3xMy9f2YqXzFqm1@mail.python.org>
References: <3xMy9f2YqXzFqm1@mail.python.org>
Message-ID:
Hi Chris.
What is the warning?
Andy
On 08/02/2017 11:36 AM, Chris Carrion via scikit-learn wrote:
>
> Hi,
>
> I?m working in an environment provided by Quantopian, an
> algorithmic-traders hub for research. I imported the minibatch kmeans
> from sklearn.clusters in the environment they provided, but I?m
> getting a deprecation warning. After reaching out to Quantopian
> support, they claim it?s something with the way sklearn is coded, and
> nothing can be done on their end. I was wondering whether this was
> true or not.
>
> Curious,
>
> Chris
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From viewsonic234 at aim.com Wed Aug 2 12:10:30 2017
From: viewsonic234 at aim.com (Chris Carrion)
Date: Wed, 2 Aug 2017 12:10:30 -0400
Subject: [scikit-learn] minibatchkmeans deprecation warning?
In-Reply-To:
References: <3xMy9f2YqXzFqm1@mail.python.org>
Message-ID: <3xMypN0FjFzFqw2@mail.python.org>
Hi Andy,
WARN sklearn/cluster/k_means_.py:1301: DeprecationWarning: This function is deprecated. Please call randint(0, 179 + 1) instead
That?s all I?m given
From: Andreas Mueller
Sent: Wednesday, August 2, 2017 12:09 PM
To: Chris Carrion via scikit-learn
Subject: Re: [scikit-learn] minibatchkmeans deprecation warning?
Hi Chris.
What is the warning?
Andy
On 08/02/2017 11:36 AM, Chris Carrion via scikit-learn wrote:
Hi,
?
I?m working in an environment provided by Quantopian, an algorithmic-traders hub for research. I imported the minibatch kmeans from sklearn.clusters in the environment they provided, but I?m getting a deprecation warning. After reaching out to Quantopian support, they claim it?s something with the way sklearn is coded, and nothing can be done on their end. I was wondering whether this was true or not.
?
Curious,
Chris
_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From t3kcit at gmail.com Wed Aug 2 12:32:03 2017
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 2 Aug 2017 12:32:03 -0400
Subject: [scikit-learn] minibatchkmeans deprecation warning?
In-Reply-To: <3xMypN0FjFzFqw2@mail.python.org>
References: <3xMy9f2YqXzFqm1@mail.python.org>
<3xMypN0FjFzFqw2@mail.python.org>
Message-ID: <66043d0e-dce5-ebac-a100-31bc02760aa3@gmail.com>
Ah.
That's actually a deprecation warning coming from numpy, and it think
it'll be removed in 0.19 (if not already in 0.18.1).
It's really nothing to worry about, though.
Andy
On 08/02/2017 12:10 PM, Chris Carrion via scikit-learn wrote:
>
> Hi Andy,
>
> WARNsklearn/cluster/k_means_.py:1301: DeprecationWarning: This
> function is deprecated. Please call randint(0, 179 + 1) instead
>
> That?s all I?m given
>
> *From: *Andreas Mueller
> *Sent: *Wednesday, August 2, 2017 12:09 PM
> *To: *Chris Carrion via scikit-learn
> *Subject: *Re: [scikit-learn] minibatchkmeans deprecation warning?
>
> Hi Chris.
>
> What is the warning?
>
> Andy
>
> On 08/02/2017 11:36 AM, Chris Carrion via scikit-learn wrote:
>
> Hi,
>
> I?m working in an environment provided by Quantopian, an
> algorithmic-traders hub for research. I imported the minibatch
> kmeans from sklearn.clusters in the environment they provided, but
> I?m getting a deprecation warning. After reaching out to
> Quantopian support, they claim it?s something with the way sklearn
> is coded, and nothing can be done on their end. I was wondering
> whether this was true or not.
>
> Curious,
>
> Chris
>
>
>
>
> _______________________________________________
>
> scikit-learn mailing list
>
> scikit-learn at python.org
>
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From viewsonic234 at aim.com Wed Aug 2 12:38:47 2017
From: viewsonic234 at aim.com (Chris Carrion)
Date: Wed, 2 Aug 2017 12:38:47 -0400
Subject: [scikit-learn] minibatchkmeans deprecation warning?
In-Reply-To: <66043d0e-dce5-ebac-a100-31bc02760aa3@gmail.com>
References: <3xMy9f2YqXzFqm1@mail.python.org>
<3xMypN0FjFzFqw2@mail.python.org>
<66043d0e-dce5-ebac-a100-31bc02760aa3@gmail.com>
Message-ID: <3xMzR06sl2zFqwt@mail.python.org>
That?s great to hear, thanks!
Chris
From: Andreas Mueller
Sent: Wednesday, August 2, 2017 12:34 PM
To: Chris Carrion via scikit-learn
Subject: Re: [scikit-learn] minibatchkmeans deprecation warning?
Ah.
That's actually a deprecation warning coming from numpy, and it think it'll be removed in 0.19 (if not already in 0.18.1).
It's really nothing to worry about, though.
Andy
On 08/02/2017 12:10 PM, Chris Carrion via scikit-learn wrote:
Hi Andy,
WARN sklearn/cluster/k_means_.py:1301: DeprecationWarning: This function is deprecated. Please call randint(0, 179 + 1) instead
?
That?s all I?m given
From: Andreas Mueller
Sent: Wednesday, August 2, 2017 12:09 PM
To: Chris Carrion via scikit-learn
Subject: Re: [scikit-learn] minibatchkmeans deprecation warning?
?
Hi Chris.
What is the warning?
Andy
On 08/02/2017 11:36 AM, Chris Carrion via scikit-learn wrote:
Hi,
?
I?m working in an environment provided by Quantopian, an algorithmic-traders hub for research. I imported the minibatch kmeans from sklearn.clusters in the environment they provided, but I?m getting a deprecation warning. After reaching out to Quantopian support, they claim it?s something with the way sklearn is coded, and nothing can be done on their end. I was wondering whether this was true or not.
?
Curious,
Chris
_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn
?
?
_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From viewsonic234 at aim.com Wed Aug 2 12:48:06 2017
From: viewsonic234 at aim.com (Chris Carrion)
Date: Wed, 2 Aug 2017 12:48:06 -0400
Subject: [scikit-learn] minibatchkmeans deprecation warning?
In-Reply-To: <66043d0e-dce5-ebac-a100-31bc02760aa3@gmail.com>
References: <3xMy9f2YqXzFqm1@mail.python.org>
<3xMypN0FjFzFqw2@mail.python.org>
<66043d0e-dce5-ebac-a100-31bc02760aa3@gmail.com>
Message-ID: <3xMzdn0TcWzFqVr@mail.python.org>
Before I forget, is there an ETA for .19, or an average time between upgrades?
From: Andreas Mueller
Sent: Wednesday, August 2, 2017 12:34 PM
To: Chris Carrion via scikit-learn
Subject: Re: [scikit-learn] minibatchkmeans deprecation warning?
Ah.
That's actually a deprecation warning coming from numpy, and it think it'll be removed in 0.19 (if not already in 0.18.1).
It's really nothing to worry about, though.
Andy
On 08/02/2017 12:10 PM, Chris Carrion via scikit-learn wrote:
Hi Andy,
WARN sklearn/cluster/k_means_.py:1301: DeprecationWarning: This function is deprecated. Please call randint(0, 179 + 1) instead
?
That?s all I?m given
From: Andreas Mueller
Sent: Wednesday, August 2, 2017 12:09 PM
To: Chris Carrion via scikit-learn
Subject: Re: [scikit-learn] minibatchkmeans deprecation warning?
?
Hi Chris.
What is the warning?
Andy
On 08/02/2017 11:36 AM, Chris Carrion via scikit-learn wrote:
Hi,
?
I?m working in an environment provided by Quantopian, an algorithmic-traders hub for research. I imported the minibatch kmeans from sklearn.clusters in the environment they provided, but I?m getting a deprecation warning. After reaching out to Quantopian support, they claim it?s something with the way sklearn is coded, and nothing can be done on their end. I was wondering whether this was true or not.
?
Curious,
Chris
_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn
?
?
_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From t3kcit at gmail.com Wed Aug 2 14:36:02 2017
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 2 Aug 2017 14:36:02 -0400
Subject: [scikit-learn] minibatchkmeans deprecation warning?
In-Reply-To: <3xMzdn0TcWzFqVr@mail.python.org>
References: <3xMy9f2YqXzFqm1@mail.python.org>
<3xMypN0FjFzFqw2@mail.python.org>
<66043d0e-dce5-ebac-a100-31bc02760aa3@gmail.com>
<3xMzdn0TcWzFqVr@mail.python.org>
Message-ID: <31bc0362-f3b4-94af-b240-0a1d4bb9e7e0@gmail.com>
The docs say 3 month, I think. Though it's been more like 8.
0.19 will come out in August.
On 08/02/2017 12:48 PM, Chris Carrion via scikit-learn wrote:
>
> Before I forget, is there an ETA for .19, or an average time between
> upgrades?
>
> *From: *Andreas Mueller
> *Sent: *Wednesday, August 2, 2017 12:34 PM
> *To: *Chris Carrion via scikit-learn
> *Subject: *Re: [scikit-learn] minibatchkmeans deprecation warning?
>
> Ah.
> That's actually a deprecation warning coming from numpy, and it think
> it'll be removed in 0.19 (if not already in 0.18.1).
> It's really nothing to worry about, though.
>
> Andy
>
> On 08/02/2017 12:10 PM, Chris Carrion via scikit-learn wrote:
>
> Hi Andy,
>
> WARNsklearn/cluster/k_means_.py:1301: DeprecationWarning: This
> function is deprecated. Please call randint(0, 179 + 1) instead
>
> That?s all I?m given
>
> *From: *Andreas Mueller
> *Sent: *Wednesday, August 2, 2017 12:09 PM
> *To: *Chris Carrion via scikit-learn
> *Subject: *Re: [scikit-learn] minibatchkmeans deprecation warning?
>
> Hi Chris.
>
> What is the warning?
>
> Andy
>
> On 08/02/2017 11:36 AM, Chris Carrion via scikit-learn wrote:
>
> Hi,
>
> I?m working in an environment provided by Quantopian, an
> algorithmic-traders hub for research. I imported the minibatch
> kmeans from sklearn.clusters in the environment they provided,
> but I?m getting a deprecation warning. After reaching out to
> Quantopian support, they claim it?s something with the way
> sklearn is coded, and nothing can be done on their end. I was
> wondering whether this was true or not.
>
> Curious,
>
> Chris
>
>
>
>
>
> _______________________________________________
>
> scikit-learn mailing list
>
> scikit-learn at python.org
>
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
>
> _______________________________________________
>
> scikit-learn mailing list
>
> scikit-learn at python.org
>
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From sambarnett95 at gmail.com Wed Aug 2 15:08:07 2017
From: sambarnett95 at gmail.com (Sam Barnett)
Date: Wed, 2 Aug 2017 20:08:07 +0100
Subject: [scikit-learn] Problems with running GridSearchCV on a pipeline
with a custom transformer
In-Reply-To:
References:
Message-ID:
Hi Andy,
The purpose of the transformer is to take an ordinary kernel (in this case
I have taken 'rbf' as a default) and return a 'sequentialised' kernel using
a few extra parameters. Hence, the transformer takes an ordinary
data-target pair X, y as its input, and the fit_transform(X, y) method will
output the Gram matrix for X that is associated with this sequentialised
kernel. In the pipeline, this Gram matrix is passed into an SVC classifier
with the kernel parameter set to 'precomputed'.
Therefore, I do not think your hacky solution would be possible. However, I
am still unsure how to implement your first solution: won't the Gram matrix
from the transformer contain all the necessary kernel values? Could you
elaborate further?
Best,
Sam
On Wed, Aug 2, 2017 at 5:05 PM, Andreas Mueller wrote:
> Hi Sam.
> GridSearchCV will do cross-validation, which requires to "transform" the
> test data.
> The shape of the test-data will be different from the shape of the
> training data.
> You need to have the ability to compute the kernel between the training
> data and new test data.
>
> A more hacky solution would be to compute the full kernel matrix in
> advance and pass that to GridSearchCV.
>
> You probably don't need it here, but you should also checkout what the
> _pairwise attribute does in cross-validation,
> because that it likely to come up when playing with kernels.
>
> Hth,
> Andy
>
>
> On 08/02/2017 08:38 AM, Sam Barnett wrote:
>
> Dear all,
>
> I have created a 2-step pipeline with a custom transformer followed by a
> simple SVC classifier, and I wish to run a grid-search over it. I am able
> to successfully create the transformer and the pipeline, and each of these
> elements work fine. However, when I try to use the fit() method on my
> GridSearchCV object, I get the following error:
>
> 57 # during fit.
> 58 if X.shape != self.input_shape_:
> ---> 59 raise ValueError('Shape of input is different from
> what was seen '
> 60 'in `fit`')
> 61
>
> ValueError: Shape of input is different from what was seen in `fit`
>
> For a full breakdown of the problem, I have written a Jupyter notebook
> showing exactly how the error occurs (this also contains all .py files
> necessary to run the notebook). Can anybody see how to work through this?
>
> Many thanks,
> Sam Barnett
>
>
>
> _______________________________________________
> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From pybokeh at gmail.com Wed Aug 2 22:01:36 2017
From: pybokeh at gmail.com (pybokeh)
Date: Wed, 2 Aug 2017 22:01:36 -0400
Subject: [scikit-learn] Help With Text Classification
Message-ID:
Hello,
I am studying this example from scikit-learn's site:
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_
data.html
The problem that I need to solve is very similar to this example, except I
have one
additional feature column (part #) that is categorical of type string. My
label or target
values consist of just 2 values: 0 or 1.
With that additional feature column, I am transforming it with a
LabelEncoder and
then I am encoding it with the OneHotEncoder.
Then I am concatenating that one-hot encoded column (part #) to the
text/document
feature column (complaint), which I had applied the CountVectorizer and
TfidfTransformer transformations.
Then I chose the MultinomialNB model to fit my concatenated training data
with.
The problem I run into is when I invoke the prediction, I get a dimension
mis-match error.
Here's my jupyter notebook gist:
http://nbviewer.jupyter.org/gist/anonymous/59ba930a783571c85ef86ba41424b311
I would gladly appreciate it if someone can guide me where I went wrong.
Thanks!
- Daniel
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From joel.nothman at gmail.com Wed Aug 2 22:38:34 2017
From: joel.nothman at gmail.com (Joel Nothman)
Date: Thu, 3 Aug 2017 12:38:34 +1000
Subject: [scikit-learn] Help With Text Classification
In-Reply-To:
References:
Message-ID:
Use a Pipeline to help avoid this kind of issue (and others). You might
also want to do something like
http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html
On 3 August 2017 at 12:01, pybokeh wrote:
> Hello,
> I am studying this example from scikit-learn's site:
> http://scikit-learn.org/stable/tutorial/text_analytics/
> working_with_text_data.html
>
> The problem that I need to solve is very similar to this example, except I
> have one
> additional feature column (part #) that is categorical of type string. My
> label or target
> values consist of just 2 values: 0 or 1.
>
> With that additional feature column, I am transforming it with a
> LabelEncoder and
> then I am encoding it with the OneHotEncoder.
>
> Then I am concatenating that one-hot encoded column (part #) to the
> text/document
> feature column (complaint), which I had applied the CountVectorizer and
> TfidfTransformer transformations.
>
> Then I chose the MultinomialNB model to fit my concatenated training data
> with.
>
> The problem I run into is when I invoke the prediction, I get a dimension
> mis-match error.
>
> Here's my jupyter notebook gist:
> http://nbviewer.jupyter.org/gist/anonymous/59ba930a783571c85
> ef86ba41424b311
>
> I would gladly appreciate it if someone can guide me where I went wrong.
> Thanks!
>
> - Daniel
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From pybokeh at gmail.com Wed Aug 2 23:12:36 2017
From: pybokeh at gmail.com (pybokeh)
Date: Wed, 2 Aug 2017 23:12:36 -0400
Subject: [scikit-learn] Help With Text Classification
In-Reply-To:
References:
Message-ID:
Thanks Joel for recommending FeatureUnion. I did run across that. But for
just 2 features, I thought that might be overkill. I am aware of Pipeline
which the scikit-learn example explains very well, which I was going to
utilize once I finalize my script. I did not want to abstract away too
much early on since I am in the beginning stages of learning machine
learning and scikit-learn.
- Daniel
On Wed, Aug 2, 2017 at 10:38 PM, Joel Nothman
wrote:
> Use a Pipeline to help avoid this kind of issue (and others). You might
> also want to do something like http://scikit-learn.org/
> stable/auto_examples/hetero_feature_union.html
>
> On 3 August 2017 at 12:01, pybokeh wrote:
>
>> Hello,
>> I am studying this example from scikit-learn's site:
>> http://scikit-learn.org/stable/tutorial/text_analytics/worki
>> ng_with_text_data.html
>>
>> The problem that I need to solve is very similar to this example, except
>> I have one
>> additional feature column (part #) that is categorical of type string.
>> My label or target
>> values consist of just 2 values: 0 or 1.
>>
>> With that additional feature column, I am transforming it with a
>> LabelEncoder and
>> then I am encoding it with the OneHotEncoder.
>>
>> Then I am concatenating that one-hot encoded column (part #) to the
>> text/document
>> feature column (complaint), which I had applied the CountVectorizer and
>> TfidfTransformer transformations.
>>
>> Then I chose the MultinomialNB model to fit my concatenated training data
>> with.
>>
>> The problem I run into is when I invoke the prediction, I get a dimension
>> mis-match error.
>>
>> Here's my jupyter notebook gist:
>> http://nbviewer.jupyter.org/gist/anonymous/59ba930a783571c85
>> ef86ba41424b311
>>
>> I would gladly appreciate it if someone can guide me where I went wrong.
>> Thanks!
>>
>> - Daniel
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From yrohinkumar at gmail.com Wed Aug 2 23:42:58 2017
From: yrohinkumar at gmail.com (Rohin Kumar)
Date: Thu, 3 Aug 2017 09:12:58 +0530
Subject: [scikit-learn] Nearest neighbor search with 2 distance measures
In-Reply-To:
References: <379121501436421@mxfront4j.mail.yandex.net>
Message-ID:
Dear Jake,
Thank you for your inputs. Had a look at cykdtree. Core implementation of
the algorithm is in C/C++ modifying which is currently beyond my skill.
Will try to contact their team if they entertain special requests. I should
be able fork and modify the sklearn algorithm in cython once my current
project is complete. Currently going ahead with brute-force method. For
now, this thread may be considered closed. Thanks once again!
Regards,
Rohin.
On Tue, Aug 1, 2017 at 11:29 PM, Jacob Vanderplas wrote:
> On Tue, Aug 1, 2017 at 10:50 AM, Rohin Kumar
> wrote:
>
>> I started with KD-tree but after realising it doesn't support custom
>> metrics (I don't know the reason for this - would be nice feature)
>>
>
> The scikit-learn KD-tree doesn't support custom metrics because it
> utilizes relatively strong assumptions about the form of the metric when
> constructing the tree. The Ball Tree makes fewer assumptions, which is why
> it can support arbitrary metrics. It would in principal be possible to
> create a KD Tree that supports custom *axis-aligned* metrics, but again I
> think that would be too specialized for inclusion in scikit-learn.
>
> One project you might check out is cykdtree: https://pypi.python.
> org/pypi/cykdtree
> I'm not certain whether it supports the queries you need, but I would bet
> the team behind that would be willing to work toward these sorts of
> specialized queries if they don't already exist.
>
> Jake
>
>
>
>
>> I shifted to BallTree and was looking for a 2 metric based
>> categorisation. After looking around, the best I could find at most were
>> brute-force methods written in python (had my own version too) or better
>> optimised ones in C or FORTRAN. The closest one was halotools which again
>> works with euclidean metric. For now, I will try to get my work done with 2
>> different BallTrees iteratively in bins. If I find a better option will try
>> to post an update.
>>
>> Regards,
>> Rohin.
>>
>>
>> On Tue, Aug 1, 2017 at 10:55 PM, Jacob Vanderplas <
>> jakevdp at cs.washington.edu> wrote:
>>
>>> Hi Rohin,
>>> Ah, I see. I don't think a BallTree is the right data structure for an
>>> anisotropic N-point query, because it fundamentally assumes spherical
>>> symmetry of the metric. You may be able to do something like this with a
>>> specialized KD-tree, but scikit-learn doesn't support this, and I don't
>>> imagine that it ever will given the very specialized nature of the
>>> application.
>>>
>>> I'm certain someone has written efficient code for this operation in the
>>> astronomy community, but I don't know of any good Python package to
>>> recommend for this ? I'd suggest googling for keywords and seeing where
>>> that gets you.
>>>
>>> Thanks,
>>> Jake
>>>
>>> Jake VanderPlas
>>> Senior Data Science Fellow
>>> Director of Open Software
>>> University of Washington eScience Institute
>>>
>>> On Tue, Aug 1, 2017 at 6:15 AM, Rohin Kumar
>>> wrote:
>>>
>>>> Since you seem to be from Astrophysics/Cosmology background (I am
>>>> assuming you are jakevdp - the creator of astroML - if you are - I am
>>>> lucky!), I can explain my application scenario. I am trying to calculate
>>>> the anisotropic two-point correlation function something like done in
>>>> rp_pi_tpcf
>>>>
>>>> or s_mu_tpcf
>>>>
>>>> using pairs (DD,DR,RR) calculated from BallTree.two_point_correlation
>>>>
>>>> In halotools (http://halotools.readthedocs.
>>>> io/en/latest/function_usage/mock_observables_functions.html) it is
>>>> implemented using rectangular grids. I could calculate 2pcf with custom
>>>> metrics using one variable with BallTree as done in astroML. I intend to
>>>> find the anisotropic counter part.
>>>>
>>>> Thanks & Regards,
>>>> Rohin
>>>>
>>>>
>>>> On Tue, Aug 1, 2017 at 5:18 PM, Rohin Kumar
>>>> wrote:
>>>>
>>>>> Dear Jake,
>>>>>
>>>>> Thanks for your response. I meant to group/count pairs in boxes (using
>>>>> two arrays simultaneously-hence needing 2 metrics) instead of one distance
>>>>> array as the binning parameter. I don't know if the algorithm supports such
>>>>> a thing. For now, I am proceeding with your suggestion of two ball trees at
>>>>> huge computational cost. I hope I am able to frame my question properly.
>>>>>
>>>>> Thanks & Regards,
>>>>> Rohin.
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Jul 31, 2017 at 8:16 PM, Jacob Vanderplas <
>>>>> jakevdp at cs.washington.edu> wrote:
>>>>>
>>>>>> On Sun, Jul 30, 2017 at 11:18 AM, Rohin Kumar
>>>>>> wrote:
>>>>>>
>>>>>>> *update*
>>>>>>>
>>>>>>> May be it doesn't have to be done at the tree creation level. It
>>>>>>> could be using loops and creating two different balltrees. Something like
>>>>>>>
>>>>>>> tree1=BallTree(X,metric='metric1') #for x-z plane
>>>>>>> tree2=BallTree(X,metric='metric2') #for y-z plane
>>>>>>>
>>>>>>> And then calculate correlation functions in a loop to get
>>>>>>> tpcf(X,r1,r2) using tree1.two_point_correlation(X,r1) and
>>>>>>> tree2.two_point_correlation(X,r2)
>>>>>>>
>>>>>>
>>>>>> Hi Rohin,
>>>>>> It's not exactly clear to me what you wish the tree to do with the
>>>>>> two different metrics, but in any case the ball tree only supports one
>>>>>> metric at a time. If you can construct your desired result from two ball
>>>>>> trees each with its own metric, then that's probably the best way to
>>>>>> proceed,
>>>>>> Jake
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> scikit-learn mailing list
>>>>>>> scikit-learn at python.org
>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> scikit-learn mailing list
>>>>>> scikit-learn at python.org
>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>
>>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From joel.nothman at gmail.com Thu Aug 3 00:54:18 2017
From: joel.nothman at gmail.com (Joel Nothman)
Date: Thu, 3 Aug 2017 14:54:18 +1000
Subject: [scikit-learn] Help With Text Classification
In-Reply-To:
References:
Message-ID:
One of the key advantages of Pipeline is that it makes sure that equivalent
processing happens at training and prediction time (assuming you do not
write your own transformers that break their contract). This is what
appears to have broken in your current attempts.
On 3 August 2017 at 13:12, pybokeh wrote:
> Thanks Joel for recommending FeatureUnion. I did run across that. But
> for just 2 features, I thought that might be overkill. I am aware of
> Pipeline which the scikit-learn example explains very well, which I was
> going to utilize once I finalize my script. I did not want to abstract
> away too much early on since I am in the beginning stages of learning
> machine learning and scikit-learn.
>
> - Daniel
>
> On Wed, Aug 2, 2017 at 10:38 PM, Joel Nothman
> wrote:
>
>> Use a Pipeline to help avoid this kind of issue (and others). You might
>> also want to do something like http://scikit-learn.org/stable
>> /auto_examples/hetero_feature_union.html
>>
>> On 3 August 2017 at 12:01, pybokeh wrote:
>>
>>> Hello,
>>> I am studying this example from scikit-learn's site:
>>> http://scikit-learn.org/stable/tutorial/text_analytics/worki
>>> ng_with_text_data.html
>>>
>>> The problem that I need to solve is very similar to this example, except
>>> I have one
>>> additional feature column (part #) that is categorical of type string.
>>> My label or target
>>> values consist of just 2 values: 0 or 1.
>>>
>>> With that additional feature column, I am transforming it with a
>>> LabelEncoder and
>>> then I am encoding it with the OneHotEncoder.
>>>
>>> Then I am concatenating that one-hot encoded column (part #) to the
>>> text/document
>>> feature column (complaint), which I had applied the CountVectorizer and
>>> TfidfTransformer transformations.
>>>
>>> Then I chose the MultinomialNB model to fit my concatenated training
>>> data with.
>>>
>>> The problem I run into is when I invoke the prediction, I get a
>>> dimension mis-match error.
>>>
>>> Here's my jupyter notebook gist:
>>> http://nbviewer.jupyter.org/gist/anonymous/59ba930a783571c85
>>> ef86ba41424b311
>>>
>>> I would gladly appreciate it if someone can guide me where I went
>>> wrong. Thanks!
>>>
>>> - Daniel
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From abhishekraj10 at yahoo.com Thu Aug 3 06:15:50 2017
From: abhishekraj10 at yahoo.com (Abhishek Raj)
Date: Thu, 3 Aug 2017 15:45:50 +0530
Subject: [scikit-learn] OneClassSvm | Different results on different runs
Message-ID:
Hi,
I am using one class svm for developing an anomaly detection model. I
observed that different runs of training on the same data set outputs
different accuracy. One run takes the accuracy as high as 98% and another
run on the same data brings it down to 93%. Googling a little bit I found
out that this is happening because of the random_state
parameter
but I am not clear of the details.
Can anyone expand on how is the parameter exactly affecting my training and
how I can figure out the best value to get the model with best accuracy?
Thanks,
Abhishek
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From jaquesgrobler at gmail.com Thu Aug 3 06:39:44 2017
From: jaquesgrobler at gmail.com (Jaques Grobler)
Date: Thu, 3 Aug 2017 12:39:44 +0200
Subject: [scikit-learn] OneClassSvm | Different results on different runs
In-Reply-To:
References:
Message-ID:
Hi,
The random_state parameter is used to generate a pseudo random number that
is used when shuffling your data for probability estimation
The seed of the pseudo random number generator to use when shuffling the
data for probability estimation.
A seed can be provided to control the shuffling for reproducible behavior.
Also, from the SVM docs
The underlying LinearSVC
> implementation
> uses a random number generator to select features when fitting the model.
> It is thus not uncommon, to have slightly different results for the same
> input data. If that happens, try with a smaller *tol *parameter.
Hope that helps
2017-08-03 12:15 GMT+02:00 Abhishek Raj via scikit-learn <
scikit-learn at python.org>:
> Hi,
>
> I am using one class svm for developing an anomaly detection model. I
> observed that different runs of training on the same data set outputs
> different accuracy. One run takes the accuracy as high as 98% and another
> run on the same data brings it down to 93%. Googling a little bit I found
> out that this is happening because of the random_state
> parameter
> but I am not clear of the details.
>
> Can anyone expand on how is the parameter exactly affecting my training
> and how I can figure out the best value to get the model with best accuracy?
>
> Thanks,
> Abhishek
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From albertthomas88 at gmail.com Thu Aug 3 07:26:17 2017
From: albertthomas88 at gmail.com (Albert Thomas)
Date: Thu, 03 Aug 2017 11:26:17 +0000
Subject: [scikit-learn] OneClassSvm | Different results on different runs
In-Reply-To:
References:
Message-ID:
Hi Abhishek,
Could you provide a small code snippet? I don't think the random_state
parameter should influence the result of the OneClassSVM as there is no
probability estimation for this estimator.
Albert
On Thu, Aug 3, 2017 at 12:41 PM Jaques Grobler
wrote:
> Hi,
>
> The random_state parameter is used to generate a pseudo random number that
> is used when shuffling your data for probability estimation
>
> The seed of the pseudo random number generator to use when shuffling the
> data for probability estimation.
> A seed can be provided to control the shuffling for reproducible behavior.
>
> Also, from the SVM docs
>
>
> The underlying LinearSVC
>> implementation
>> uses a random number generator to select features when fitting the model.
>> It is thus not uncommon, to have slightly different results for the same
>> input data. If that happens, try with a smaller *tol *parameter.
>
>
> Hope that helps
>
> 2017-08-03 12:15 GMT+02:00 Abhishek Raj via scikit-learn <
> scikit-learn at python.org>:
>
>> Hi,
>>
>> I am using one class svm for developing an anomaly detection model. I
>> observed that different runs of training on the same data set outputs
>> different accuracy. One run takes the accuracy as high as 98% and another
>> run on the same data brings it down to 93%. Googling a little bit I found
>> out that this is happening because of the random_state
>> parameter
>> but I am not clear of the details.
>>
>> Can anyone expand on how is the parameter exactly affecting my training
>> and how I can figure out the best value to get the model with best accuracy?
>>
>> Thanks,
>> Abhishek
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From goix.nicolas at gmail.com Thu Aug 3 07:54:37 2017
From: goix.nicolas at gmail.com (Nicolas Goix)
Date: Thu, 3 Aug 2017 13:54:37 +0200
Subject: [scikit-learn] OneClassSvm | Different results on different runs
In-Reply-To:
References:
Message-ID:
@albertcthomas isn't there some randomness in SMO which could influence the
result if the tolerance parameter is too large?
On Aug 3, 2017 1:28 PM, "Albert Thomas" wrote:
> Hi Abhishek,
>
> Could you provide a small code snippet? I don't think the random_state
> parameter should influence the result of the OneClassSVM as there is no
> probability estimation for this estimator.
>
> Albert
>
> On Thu, Aug 3, 2017 at 12:41 PM Jaques Grobler
> wrote:
>
>> Hi,
>>
>> The random_state parameter is used to generate a pseudo random number
>> that is used when shuffling your data for probability estimation
>>
>> The seed of the pseudo random number generator to use when shuffling the
>> data for probability estimation.
>> A seed can be provided to control the shuffling for reproducible behavior.
>>
>> Also, from the SVM docs
>>
>>
>> The underlying LinearSVC
>>>
>>> implementation uses a random number generator to select features when
>>> fitting the model. It is thus not uncommon, to have slightly different
>>> results for the same input data. If that happens, try with a smaller *tol
>>> *parameter.
>>
>>
>> Hope that helps
>>
>> 2017-08-03 12:15 GMT+02:00 Abhishek Raj via scikit-learn <
>> scikit-learn at python.org>:
>>
>>> Hi,
>>>
>>> I am using one class svm for developing an anomaly detection model. I
>>> observed that different runs of training on the same data set outputs
>>> different accuracy. One run takes the accuracy as high as 98% and another
>>> run on the same data brings it down to 93%. Googling a little bit I found
>>> out that this is happening because of the random_state
>>> parameter
>>> but I am not clear of the details.
>>>
>>> Can anyone expand on how is the parameter exactly affecting my training
>>> and how I can figure out the best value to get the model with best accuracy?
>>>
>>> Thanks,
>>> Abhishek
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From albertthomas88 at gmail.com Thu Aug 3 09:17:38 2017
From: albertthomas88 at gmail.com (Albert Thomas)
Date: Thu, 03 Aug 2017 13:17:38 +0000
Subject: [scikit-learn] OneClassSvm | Different results on different runs
In-Reply-To:
References:
Message-ID:
Yes, in fact, changing the random_state might have an influence on the
result. The docstring of the random_state parameter for the OneClassSVM
seems incorrect though...
Albert
On Thu, Aug 3, 2017 at 1:55 PM Nicolas Goix wrote:
> @albertcthomas isn't there some randomness in SMO which could influence
> the result if the tolerance parameter is too large?
>
> On Aug 3, 2017 1:28 PM, "Albert Thomas" wrote:
>
>> Hi Abhishek,
>>
>> Could you provide a small code snippet? I don't think the random_state
>> parameter should influence the result of the OneClassSVM as there is no
>> probability estimation for this estimator.
>>
>> Albert
>>
>> On Thu, Aug 3, 2017 at 12:41 PM Jaques Grobler
>> wrote:
>>
>>> Hi,
>>>
>>> The random_state parameter is used to generate a pseudo random number
>>> that is used when shuffling your data for probability estimation
>>>
>>> The seed of the pseudo random number generator to use when shuffling the
>>> data for probability estimation.
>>> A seed can be provided to control the shuffling for reproducible
>>> behavior.
>>>
>>> Also, from the SVM docs
>>>
>>>
>>> The underlying LinearSVC
>>>> implementation
>>>> uses a random number generator to select features when fitting the model.
>>>> It is thus not uncommon, to have slightly different results for the same
>>>> input data. If that happens, try with a smaller *tol *parameter.
>>>
>>>
>>> Hope that helps
>>>
>>> 2017-08-03 12:15 GMT+02:00 Abhishek Raj via scikit-learn <
>>> scikit-learn at python.org>:
>>>
>>>> Hi,
>>>>
>>>> I am using one class svm for developing an anomaly detection model. I
>>>> observed that different runs of training on the same data set outputs
>>>> different accuracy. One run takes the accuracy as high as 98% and another
>>>> run on the same data brings it down to 93%. Googling a little bit I found
>>>> out that this is happening because of the random_state
>>>> parameter
>>>> but I am not clear of the details.
>>>>
>>>> Can anyone expand on how is the parameter exactly affecting my training
>>>> and how I can figure out the best value to get the model with best accuracy?
>>>>
>>>> Thanks,
>>>> Abhishek
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From m.waseem.ahmad at gmail.com Thu Aug 3 10:37:02 2017
From: m.waseem.ahmad at gmail.com (muhammad waseem)
Date: Thu, 3 Aug 2017 15:37:02 +0100
Subject: [scikit-learn] Extra trees tuning parameters
Message-ID:
Hi All,
I was wondering if you could please tell me what is the "nmin , the minimum
sample size for splitting a node" (referred by Geurts et al., 2006) in
scikit-learn API for Extra trees? Is it min_samples_split in skearn?
Regards
Waseem
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From tom.duprelatour at orange.fr Thu Aug 3 11:18:17 2017
From: tom.duprelatour at orange.fr (Tom DLT)
Date: Thu, 3 Aug 2017 17:18:17 +0200
Subject: [scikit-learn] question about class_weights in
LogisticRegression
In-Reply-To:
References:
Message-ID:
The class weights and sample weights are used in the same way, as a factor
specific to each sample, in the loss function.
In LogisticRegression, it is equivalent to incorporate this factor into a
regularization parameter C specific to each sample.
Tom
2017-08-01 18:30 GMT+02:00 Johnson, Jeremiah :
> Right, I know how the class_weight calculation is performed. But then
> those class weights are utilized during the model fit process in some way
> in liblinear, and that?s what I am interested in. libSVM does
> class_weight[I] * C (https://www.csie.ntu.edu.tw/~cjlin/libsvm/); is the
> implementation in liblinear the same?
>
> Best,
> Jeremiah
>
>
>
> On 8/1/17, 12:19 PM, "scikit-learn on behalf of Stuart Reynolds"
> stuart at stuartreynolds.net> wrote:
>
> >I hope not. And not accoring to the docs...
> >https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__github.com_scikit-2Dl
> >earn_scikit-2Dlearn_blob_ab93d65_sklearn_linear-
> 5Fmodel_logistic.py-23L947
> >&d=DwIGaQ&c=c6MrceVCY5m5A_KAUkrdoA&r=hQNTLb4Jonm4n54VBW80WEzIAaqvTO
> cTEjhIk
> >rRJWXo&m=2XR2z3VWvEaERt4miGabDte3xkz_FwzMKMwnvEOWj8o&s=
> 4uJZS3EaQgysmQlzjt-
> >yuLkSlcXTd5G50LkEFMcbZLQ&e=
> >
> >class_weight : dict or 'balanced', optional
> >Weights associated with classes in the form ``{class_label: weight}``.
> >If not given, all classes are supposed to have weight one.
> >The "balanced" mode uses the values of y to automatically adjust
> >weights inversely proportional to class frequencies in the input data
> >as ``n_samples / (n_classes * np.bincount(y))``.
> >Note that these weights will be multiplied with sample_weight (passed
> >through the fit method) if sample_weight is specified.
> >
> >On Tue, Aug 1, 2017 at 9:03 AM, Johnson, Jeremiah
> > wrote:
> >> Hello all,
> >>
> >> I?m looking for confirmation on an implementation detail that is
> >>somewhere
> >> in liblinear, but I haven?t found documentation for yet. When the
> >> class_weights=?balanced? parameter is set in LogisticRegression, then
> >>the
> >> regularisation parameter for an observation from class I is
> >>class_weight[I]
> >> * C, where C is the usual regularization parameter ? is this correct?
> >>
> >> Thanks,
> >> Jeremiah
> >>
> >>
> >> _______________________________________________
> >> scikit-learn mailing list
> >> scikit-learn at python.org
> >>
> >>https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__mail.python.org_mail
> >>man_listinfo_scikit-2Dlearn&d=DwIGaQ&c=c6MrceVCY5m5A_
> KAUkrdoA&r=hQNTLb4Jo
> >>nm4n54VBW80WEzIAaqvTOcTEjhIkrRJWXo&m=2XR2z3VWvEaERt4miGabDte3xkz_
> FwzMKMwn
> >>vEOWj8o&s=MgZoI9VOHFh3omGKHTASFx3NAVjj6AY3j_75mnOUg04&e=
> >>
> >_______________________________________________
> >scikit-learn mailing list
> >scikit-learn at python.org
> >https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__mail.python.org_mailm
> >an_listinfo_scikit-2Dlearn&d=DwIGaQ&c=c6MrceVCY5m5A_
> KAUkrdoA&r=hQNTLb4Jonm
> >4n54VBW80WEzIAaqvTOcTEjhIkrRJWXo&m=2XR2z3VWvEaERt4miGabDte3xkz_
> FwzMKMwnvEO
> >Wj8o&s=MgZoI9VOHFh3omGKHTASFx3NAVjj6AY3j_75mnOUg04&e=
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From t3kcit at gmail.com Thu Aug 3 12:12:12 2017
From: t3kcit at gmail.com (Andreas Mueller)
Date: Thu, 3 Aug 2017 12:12:12 -0400
Subject: [scikit-learn] OneClassSvm | Different results on different runs
In-Reply-To:
References:
Message-ID:
On 08/03/2017 09:17 AM, Albert Thomas wrote:
> Yes, in fact, changing the random_state might have an influence on the
> result. The docstring of the random_state parameter for the
> OneClassSVM seems incorrect though...
PR or issue welcome.
From t3kcit at gmail.com Thu Aug 3 13:35:46 2017
From: t3kcit at gmail.com (Andreas Mueller)
Date: Thu, 3 Aug 2017 13:35:46 -0400
Subject: [scikit-learn] Problems with running GridSearchCV on a pipeline
with a custom transformer
In-Reply-To:
References:
Message-ID:
Hi Sam.
You need to put these into a reachable namespace (possibly as private
functions) so that they can be pickled.
Please stay on the sklearn mailing list, I might not have time to reply.
Andy
On 08/03/2017 01:24 PM, Sam Barnett wrote:
> Hi Andy,
>
> I've since tried a different solution: instead of a pipeline, I've
> simply created a classifier that is for the most part like svm.SVC,
> though it takes a few extra inputs for the sequentialisation step.
> I've used a Python function that can compute the Gram matrix between
> two datasets of any shape to pass into SVC(), though I'm now having
> trouble with pickling on the check_estimator test. It appears that
> SeqSVC.fit() doesn't like to have methods defined within it. Can you
> see how to pass this test? (the .ipynb file shows the error).
>
> Best,
> Sam
>
> On Wed, Aug 2, 2017 at 9:44 PM, Sam Barnett > wrote:
>
> You're right: it does fail without GridSearchCV when I change the
> size of seq_test. I will look at the transform tomorrow to see if
> I can work this out. Thank you for your help so far!
>
> On Wed, Aug 2, 2017 at 9:20 PM, Andreas Mueller > wrote:
>
> Change the size of seq_test in your notebook and you'll see
> the failure without GridSearchCV.
> I haven't looked at your code in detail, but transform is
> supposed to work on arbitrary new data with the same number of
> features.
> Your code requires the test data to have the same shape as the
> training data.
> Cross-validation will lead to training data and test data
> having different sizes. But I feel like something is already
> wrong if your
> test data size depends on your training data size.
>
>
>
> On 08/02/2017 03:08 PM, Sam Barnett wrote:
>> Hi Andy,
>>
>> The purpose of the transformer is to take an ordinary kernel
>> (in this case I have taken 'rbf' as a default) and return a
>> 'sequentialised' kernel using a few extra parameters. Hence,
>> the transformer takes an ordinary data-target pair X, y as
>> its input, and the fit_transform(X, y) method will output the
>> Gram matrix for X that is associated with this sequentialised
>> kernel. In the pipeline, this Gram matrix is passed into an
>> SVC classifier with the kernel parameter set to 'precomputed'.
>>
>> Therefore, I do not think your hacky solution would be
>> possible. However, I am still unsure how to implement your
>> first solution: won't the Gram matrix from the transformer
>> contain all the necessary kernel values? Could you elaborate
>> further?
>>
>>
>> Best,
>> Sam
>>
>> On Wed, Aug 2, 2017 at 5:05 PM, Andreas Mueller
>> > wrote:
>>
>> Hi Sam.
>> GridSearchCV will do cross-validation, which requires to
>> "transform" the test data.
>> The shape of the test-data will be different from the
>> shape of the training data.
>> You need to have the ability to compute the kernel
>> between the training data and new test data.
>>
>> A more hacky solution would be to compute the full kernel
>> matrix in advance and pass that to GridSearchCV.
>>
>> You probably don't need it here, but you should also
>> checkout what the _pairwise attribute does in
>> cross-validation,
>> because that it likely to come up when playing with kernels.
>>
>> Hth,
>> Andy
>>
>>
>> On 08/02/2017 08:38 AM, Sam Barnett wrote:
>>> Dear all,
>>>
>>> I have created a 2-step pipeline with a custom
>>> transformer followed by a simple SVC classifier, and I
>>> wish to run a grid-search over it. I am able to
>>> successfully create the transformer and the pipeline,
>>> and each of these elements work fine. However, when I
>>> try to use the fit() method on my GridSearchCV object, I
>>> get the following error:
>>>
>>> 57 # during fit.
>>> 58 if X.shape != self.input_shape_:
>>> ---> 59 raise ValueError('Shape of input is
>>> different from what was seen '
>>> 60 'in `fit`')
>>> 61
>>>
>>> ValueError: Shape of input is different from what was
>>> seen in `fit`
>>>
>>> For a full breakdown of the problem, I have written a
>>> Jupyter notebook showing exactly how the error occurs
>>> (this also contains all .py files necessary to run the
>>> notebook). Can anybody see how to work through this?
>>>
>>> Many thanks,
>>> Sam Barnett
>>>
>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>>
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From pybokeh at gmail.com Thu Aug 3 17:48:26 2017
From: pybokeh at gmail.com (pybokeh)
Date: Thu, 3 Aug 2017 17:48:26 -0400
Subject: [scikit-learn] Help With Text Classification
In-Reply-To:
References:
Message-ID:
I found my problem. When I one-hot encoded my test part #, it resulted in
being a 1x1 matrix, when I need it to be a 1x153. This happened because I
used the default setting ('auto') for n_values, when I needed it set it to
153. Now when I horizontally stacked it to my other feature matrix, the
resulting total # of columns now correctly comes to 1294, instead of
1142. Looking back now, not sure if using Pipeline or using FeatureUnion
would have helped in this case or prevented this since this error occurred
on the prediction side, not on training or modeling side.
On Wed, Aug 2, 2017 at 10:38 PM, Joel Nothman
wrote:
> Use a Pipeline to help avoid this kind of issue (and others). You might
> also want to do something like http://scikit-learn.org/
> stable/auto_examples/hetero_feature_union.html
>
> On 3 August 2017 at 12:01, pybokeh wrote:
>
>> Hello,
>> I am studying this example from scikit-learn's site:
>> http://scikit-learn.org/stable/tutorial/text_analytics/worki
>> ng_with_text_data.html
>>
>> The problem that I need to solve is very similar to this example, except
>> I have one
>> additional feature column (part #) that is categorical of type string.
>> My label or target
>> values consist of just 2 values: 0 or 1.
>>
>> With that additional feature column, I am transforming it with a
>> LabelEncoder and
>> then I am encoding it with the OneHotEncoder.
>>
>> Then I am concatenating that one-hot encoded column (part #) to the
>> text/document
>> feature column (complaint), which I had applied the CountVectorizer and
>> TfidfTransformer transformations.
>>
>> Then I chose the MultinomialNB model to fit my concatenated training data
>> with.
>>
>> The problem I run into is when I invoke the prediction, I get a dimension
>> mis-match error.
>>
>> Here's my jupyter notebook gist:
>> http://nbviewer.jupyter.org/gist/anonymous/59ba930a783571c85
>> ef86ba41424b311
>>
>> I would gladly appreciate it if someone can guide me where I went wrong.
>> Thanks!
>>
>> - Daniel
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From joel.nothman at gmail.com Thu Aug 3 18:29:10 2017
From: joel.nothman at gmail.com (Joel Nothman)
Date: Fri, 4 Aug 2017 08:29:10 +1000
Subject: [scikit-learn] Help With Text Classification
In-Reply-To:
References:
Message-ID:
pipeline helps in prediction time too.
On 4 Aug 2017 7:49 am, "pybokeh" wrote:
> I found my problem. When I one-hot encoded my test part #, it resulted in
> being a 1x1 matrix, when I need it to be a 1x153. This happened because I
> used the default setting ('auto') for n_values, when I needed it set it to
> 153. Now when I horizontally stacked it to my other feature matrix, the
> resulting total # of columns now correctly comes to 1294, instead of
> 1142. Looking back now, not sure if using Pipeline or using FeatureUnion
> would have helped in this case or prevented this since this error occurred
> on the prediction side, not on training or modeling side.
>
> On Wed, Aug 2, 2017 at 10:38 PM, Joel Nothman
> wrote:
>
>> Use a Pipeline to help avoid this kind of issue (and others). You might
>> also want to do something like http://scikit-learn.org/stable
>> /auto_examples/hetero_feature_union.html
>>
>> On 3 August 2017 at 12:01, pybokeh wrote:
>>
>>> Hello,
>>> I am studying this example from scikit-learn's site:
>>> http://scikit-learn.org/stable/tutorial/text_analytics/worki
>>> ng_with_text_data.html
>>>
>>> The problem that I need to solve is very similar to this example, except
>>> I have one
>>> additional feature column (part #) that is categorical of type string.
>>> My label or target
>>> values consist of just 2 values: 0 or 1.
>>>
>>> With that additional feature column, I am transforming it with a
>>> LabelEncoder and
>>> then I am encoding it with the OneHotEncoder.
>>>
>>> Then I am concatenating that one-hot encoded column (part #) to the
>>> text/document
>>> feature column (complaint), which I had applied the CountVectorizer and
>>> TfidfTransformer transformations.
>>>
>>> Then I chose the MultinomialNB model to fit my concatenated training
>>> data with.
>>>
>>> The problem I run into is when I invoke the prediction, I get a
>>> dimension mis-match error.
>>>
>>> Here's my jupyter notebook gist:
>>> http://nbviewer.jupyter.org/gist/anonymous/59ba930a783571c85
>>> ef86ba41424b311
>>>
>>> I would gladly appreciate it if someone can guide me where I went
>>> wrong. Thanks!
>>>
>>> - Daniel
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From sambarnett95 at gmail.com Fri Aug 4 06:29:50 2017
From: sambarnett95 at gmail.com (Sam Barnett)
Date: Fri, 4 Aug 2017 11:29:50 +0100
Subject: [scikit-learn] Problems with running GridSearchCV on a pipeline
with a custom transformer
Message-ID:
Hi Andy,
I have since been able to resolve the pickling issue, though I am now
getting an error message saying that an error message does not include the
expected string 'fit'. In general, I am trying to use the fit() method of
my classifier to instantiate a separate SVC() classifier with a custom
kernel, fit THAT to the data, then return this instance as the fitted
version of the new classifier. Is this possible in theory? If so, what is
the best way to implement it?
As before, the requisite code and a .ipynb file is attached.
Best,
Sam
On Thu, Aug 3, 2017 at 6:35 PM, Andreas Mueller wrote:
> Hi Sam.
> You need to put these into a reachable namespace (possibly as private
> functions) so that they can be pickled.
> Please stay on the sklearn mailing list, I might not have time to reply.
>
> Andy
>
>
> On 08/03/2017 01:24 PM, Sam Barnett wrote:
>
> Hi Andy,
>
> I've since tried a different solution: instead of a pipeline, I've simply
> created a classifier that is for the most part like svm.SVC, though it
> takes a few extra inputs for the sequentialisation step. I've used a Python
> function that can compute the Gram matrix between two datasets of any shape
> to pass into SVC(), though I'm now having trouble with pickling on the
> check_estimator test. It appears that SeqSVC.fit() doesn't like to have
> methods defined within it. Can you see how to pass this test? (the .ipynb
> file shows the error).
>
> Best,
> Sam
>
> On Wed, Aug 2, 2017 at 9:44 PM, Sam Barnett
> wrote:
>
>> You're right: it does fail without GridSearchCV when I change the size of
>> seq_test. I will look at the transform tomorrow to see if I can work this
>> out. Thank you for your help so far!
>>
>> On Wed, Aug 2, 2017 at 9:20 PM, Andreas Mueller wrote:
>>
>>> Change the size of seq_test in your notebook and you'll see the failure
>>> without GridSearchCV.
>>> I haven't looked at your code in detail, but transform is supposed to
>>> work on arbitrary new data with the same number of features.
>>> Your code requires the test data to have the same shape as the training
>>> data.
>>> Cross-validation will lead to training data and test data having
>>> different sizes. But I feel like something is already wrong if your
>>> test data size depends on your training data size.
>>>
>>>
>>>
>>> On 08/02/2017 03:08 PM, Sam Barnett wrote:
>>>
>>> Hi Andy,
>>>
>>> The purpose of the transformer is to take an ordinary kernel (in this
>>> case I have taken 'rbf' as a default) and return a 'sequentialised' kernel
>>> using a few extra parameters. Hence, the transformer takes an ordinary
>>> data-target pair X, y as its input, and the fit_transform(X, y) method will
>>> output the Gram matrix for X that is associated with this sequentialised
>>> kernel. In the pipeline, this Gram matrix is passed into an SVC classifier
>>> with the kernel parameter set to 'precomputed'.
>>>
>>> Therefore, I do not think your hacky solution would be possible.
>>> However, I am still unsure how to implement your first solution: won't the
>>> Gram matrix from the transformer contain all the necessary kernel values?
>>> Could you elaborate further?
>>>
>>>
>>> Best,
>>> Sam
>>>
>>> On Wed, Aug 2, 2017 at 5:05 PM, Andreas Mueller
>>> wrote:
>>>
>>>> Hi Sam.
>>>> GridSearchCV will do cross-validation, which requires to "transform"
>>>> the test data.
>>>> The shape of the test-data will be different from the shape of the
>>>> training data.
>>>> You need to have the ability to compute the kernel between the training
>>>> data and new test data.
>>>>
>>>> A more hacky solution would be to compute the full kernel matrix in
>>>> advance and pass that to GridSearchCV.
>>>>
>>>> You probably don't need it here, but you should also checkout what the
>>>> _pairwise attribute does in cross-validation,
>>>> because that it likely to come up when playing with kernels.
>>>>
>>>> Hth,
>>>> Andy
>>>>
>>>>
>>>> On 08/02/2017 08:38 AM, Sam Barnett wrote:
>>>>
>>>> Dear all,
>>>>
>>>> I have created a 2-step pipeline with a custom transformer followed by
>>>> a simple SVC classifier, and I wish to run a grid-search over it. I am able
>>>> to successfully create the transformer and the pipeline, and each of these
>>>> elements work fine. However, when I try to use the fit() method on my
>>>> GridSearchCV object, I get the following error:
>>>>
>>>> 57 # during fit.
>>>> 58 if X.shape != self.input_shape_:
>>>> ---> 59 raise ValueError('Shape of input is different from
>>>> what was seen '
>>>> 60 'in `fit`')
>>>> 61
>>>>
>>>> ValueError: Shape of input is different from what was seen in `fit`
>>>>
>>>> For a full breakdown of the problem, I have written a Jupyter notebook
>>>> showing exactly how the error occurs (this also contains all .py files
>>>> necessary to run the notebook). Can anybody see how to work through this?
>>>>
>>>> Many thanks,
>>>> Sam Barnett
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>>
>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: seqsvc.py
Type: text/x-python-script
Size: 3051 bytes
Desc: not available
URL:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Sequential Kernel SVC GridSearchCV Test.ipynb
Type: application/octet-stream
Size: 7678 bytes
Desc: not available
URL:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: SeqKernelLucy.py
Type: text/x-python-script
Size: 2628 bytes
Desc: not available
URL:
From albertthomas88 at gmail.com Fri Aug 4 08:49:16 2017
From: albertthomas88 at gmail.com (Albert Thomas)
Date: Fri, 04 Aug 2017 12:49:16 +0000
Subject: [scikit-learn] OneClassSvm | Different results on different runs
In-Reply-To:
References: