[scikit-learn] class_weight: How to assign a higher weightage to values in a specific column as opposed to values in another column

Naoya Kanai naopon at gmail.com
Tue Jan 24 02:51:43 EST 2017


You need to write your own function to compute a vector assigning a weight
to each sample in X, then pass that as sample_weight parameter on
RandomForestClassifier.fit()
<http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.fit>.
If you also use class_weight on the model constructor, class_weight and
sample_weight are multiplied through for each sample.

On Mon, Jan 23, 2017 at 11:36 PM, Debabrata Ghosh <mailfordebu at gmail.com>
wrote:

> What would be the sample command for achieving it ? Sorry a bit new in
> this area and that's why I will be better able to understand it through
> certain example commands .
>
> Thanks again !
>
> On Tue, Jan 24, 2017 at 6:58 AM, Josh Vredevoogd <cleverless at gmail.com>
> wrote:
>
>> If you do not want the weights to be uniform by class, then you need to
>> generate weights for each sample and pass the sample weight vector to the
>> fit method of the classifier.
>>
>> On Mon, Jan 23, 2017 at 4:48 PM, Debabrata Ghosh <mailfordebu at gmail.com>
>> wrote:
>>
>>> Thanks Josh for your quick feedback ! It's quite helpful indeed .
>>>
>>> Further to it , I am having another burning question. In my sample
>>> dataset , I have 2 label columns (let's say x and y)
>>>
>>> My objective is to give the labels within column 'x' 10 times more
>>> weight as compared to labels within column y.
>>>
>>> My question is the parameter class_weight={0: 1, 1: 10} works for a
>>> single column, i.e., within a single column I have assigned 10 times weight
>>> to the positive labels.
>>>
>>> But my objective is to provide a 10 times weight to the positive labels
>>> within column 'x' as compared to the positive labels within column 'y'.
>>>
>>> May I please get a feedback from you around how to achieve this please.
>>> Thanks for your help in advance !
>>>
>>> On Mon, Jan 23, 2017 at 9:56 AM, Josh Vredevoogd <cleverless at gmail.com>
>>> wrote:
>>>
>>>> If you undersample, taking only 10% of the negative class, the
>>>> classifier will see different combinations of attributes and produce a
>>>> different fit to explain those distributions. In the worse case, imagine
>>>> you are classifying birds and through sampling you eliminate all `red`
>>>> examples. Your classifier likely now will not understand that red objects
>>>> can be birds. That's an overly simple example, but given a classifier
>>>> capable of exploring and explaining feature combinations, less obvious
>>>> versions of this are bound to happen.
>>>>
>>>> The extrapolation only works in the other direction: if you manually
>>>> duplicate samples by the sampling factor, you should get the exact same fit
>>>> as if you increased the class weight.
>>>>
>>>> Hope that helps,
>>>> Josh
>>>>
>>>>
>>>> On Sun, Jan 22, 2017 at 5:00 AM, Debabrata Ghosh <mailfordebu at gmail.com
>>>> > wrote:
>>>>
>>>>> Thanks Josh !
>>>>>
>>>>> I have used the parameter class_weight={0: 1, 1: 10} and the model
>>>>> code has run successfully. However, just to get a further clarity around
>>>>> it's concept, I am having another question for you please. I did the
>>>>> following 2 tests:
>>>>>
>>>>> 1. In my dataset , I have 1 million negative classes and 10,000
>>>>> positive classes. First I ran my model code without supplying any
>>>>> class_weight parameter and it gave me certain True Positive and False
>>>>> Positive results.
>>>>>
>>>>> 2. Now in the second test, I had the same 1 million negative classes
>>>>> but reduced the positive classes to 1000 . But this time, I supplied the
>>>>> parameter class_weight={0: 1, 1: 10} and got my True Positive and False
>>>>> Positive Results
>>>>>
>>>>> My question is , when I multiply the results obtained from my second
>>>>> test with a factor of 10, I don't match with the results obtained from my
>>>>> first test. In other words, say I get the true positive against a threshold
>>>>> from the second test as 8 , while the true positive from the first test
>>>>> against the same threshold is 260. I am getting similar observations for
>>>>> the false positive results wherein if I multiply the results obtained in
>>>>> the second test by 10, I don't come close to the results obtained from the
>>>>> first set.
>>>>>
>>>>> Is my expectation correct ? Is my way of executing the test (i.e.,
>>>>> reducing the the positive classes by 10 times and then feeding a class
>>>>> weight of 10 times the negative classes) and comparing the results with a
>>>>> model run without any class weight parameter correct ?
>>>>>
>>>>> Please let me know as per your convenience as this will help me a big
>>>>> way to understand the concept further.
>>>>>
>>>>> Thanks in advance !
>>>>>
>>>>> On Sun, Jan 22, 2017 at 1:56 AM, Josh Vredevoogd <cleverless at gmail.com
>>>>> > wrote:
>>>>>
>>>>>> The class_weight parameter doesn't behave the way you're expecting.
>>>>>>
>>>>>> The value in class_weight is the weight applied to each sample in
>>>>>> that class - in your example, each class zero sample has weight 0.001 and
>>>>>> each class one sample has weight 0.999, so each class one samples carries
>>>>>> 999 times the weight of a class zero sample.
>>>>>>
>>>>>> If you would like each class one sample to have ten times the weight,
>>>>>> you would set `class_weight={0: 1, 1: 10}` or `class_weight={0:0.1, 1:1}`
>>>>>> equivalently.
>>>>>>
>>>>>>
>>>>>> On Sat, Jan 21, 2017 at 10:18 AM, Debabrata Ghosh <
>>>>>> mailfordebu at gmail.com> wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>              Greetings !
>>>>>>>
>>>>>>>               I have a very basic question regarding the usage of
>>>>>>> the parameter class_weight in scikit learn's Random Forest Classifier's fit
>>>>>>> method.
>>>>>>>
>>>>>>>               I have a fairly unbalanced sample and my positive
>>>>>>> class : negative class ratio is 1:100. In other words, I have a million
>>>>>>> records corresponding to negative class and 10,000 records corresponding to
>>>>>>> positive class. I have trained the random forest classifier model using the
>>>>>>> above record set successfully.
>>>>>>>
>>>>>>>               Further, for a different problem, I want to test the
>>>>>>> parameter class_weight. So, I am setting the class_weight as [0:0.001 ,
>>>>>>> 1:0.999] and I have tried running my model on the same dataset as mentioned
>>>>>>> in the above paragraph but with the positive class records reduced to 1000
>>>>>>> [because now each positive class is given approximately 10 times more
>>>>>>> weight than a negative class]. However, the model run results are very very
>>>>>>> different between the 2 runs (with and without class_weight). And I
>>>>>>> expected a similar run results.
>>>>>>>
>>>>>>>                 Would you please be able to let me know where am I
>>>>>>> getting wrong. I know it's something silly but just want to improve on my
>>>>>>> concept.
>>>>>>>
>>>>>>> Thanks !
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> scikit-learn mailing list
>>>>>>> scikit-learn at python.org
>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> scikit-learn mailing list
>>>>>> scikit-learn at python.org
>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> scikit-learn at python.org
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170123/a4231a74/attachment-0001.html>


More information about the scikit-learn mailing list