[scikit-learn] class_weight: How to assign a higher weightage to values in a specific column as opposed to values in another column

Debabrata Ghosh mailfordebu at gmail.com
Tue Jan 24 02:36:57 EST 2017


What would be the sample command for achieving it ? Sorry a bit new in this
area and that's why I will be better able to understand it through certain
example commands .

Thanks again !

On Tue, Jan 24, 2017 at 6:58 AM, Josh Vredevoogd <cleverless at gmail.com>
wrote:

> If you do not want the weights to be uniform by class, then you need to
> generate weights for each sample and pass the sample weight vector to the
> fit method of the classifier.
>
> On Mon, Jan 23, 2017 at 4:48 PM, Debabrata Ghosh <mailfordebu at gmail.com>
> wrote:
>
>> Thanks Josh for your quick feedback ! It's quite helpful indeed .
>>
>> Further to it , I am having another burning question. In my sample
>> dataset , I have 2 label columns (let's say x and y)
>>
>> My objective is to give the labels within column 'x' 10 times more weight
>> as compared to labels within column y.
>>
>> My question is the parameter class_weight={0: 1, 1: 10} works for a
>> single column, i.e., within a single column I have assigned 10 times weight
>> to the positive labels.
>>
>> But my objective is to provide a 10 times weight to the positive labels
>> within column 'x' as compared to the positive labels within column 'y'.
>>
>> May I please get a feedback from you around how to achieve this please.
>> Thanks for your help in advance !
>>
>> On Mon, Jan 23, 2017 at 9:56 AM, Josh Vredevoogd <cleverless at gmail.com>
>> wrote:
>>
>>> If you undersample, taking only 10% of the negative class, the
>>> classifier will see different combinations of attributes and produce a
>>> different fit to explain those distributions. In the worse case, imagine
>>> you are classifying birds and through sampling you eliminate all `red`
>>> examples. Your classifier likely now will not understand that red objects
>>> can be birds. That's an overly simple example, but given a classifier
>>> capable of exploring and explaining feature combinations, less obvious
>>> versions of this are bound to happen.
>>>
>>> The extrapolation only works in the other direction: if you manually
>>> duplicate samples by the sampling factor, you should get the exact same fit
>>> as if you increased the class weight.
>>>
>>> Hope that helps,
>>> Josh
>>>
>>>
>>> On Sun, Jan 22, 2017 at 5:00 AM, Debabrata Ghosh <mailfordebu at gmail.com>
>>> wrote:
>>>
>>>> Thanks Josh !
>>>>
>>>> I have used the parameter class_weight={0: 1, 1: 10} and the model code
>>>> has run successfully. However, just to get a further clarity around it's
>>>> concept, I am having another question for you please. I did the following 2
>>>> tests:
>>>>
>>>> 1. In my dataset , I have 1 million negative classes and 10,000
>>>> positive classes. First I ran my model code without supplying any
>>>> class_weight parameter and it gave me certain True Positive and False
>>>> Positive results.
>>>>
>>>> 2. Now in the second test, I had the same 1 million negative classes
>>>> but reduced the positive classes to 1000 . But this time, I supplied the
>>>> parameter class_weight={0: 1, 1: 10} and got my True Positive and False
>>>> Positive Results
>>>>
>>>> My question is , when I multiply the results obtained from my second
>>>> test with a factor of 10, I don't match with the results obtained from my
>>>> first test. In other words, say I get the true positive against a threshold
>>>> from the second test as 8 , while the true positive from the first test
>>>> against the same threshold is 260. I am getting similar observations for
>>>> the false positive results wherein if I multiply the results obtained in
>>>> the second test by 10, I don't come close to the results obtained from the
>>>> first set.
>>>>
>>>> Is my expectation correct ? Is my way of executing the test (i.e.,
>>>> reducing the the positive classes by 10 times and then feeding a class
>>>> weight of 10 times the negative classes) and comparing the results with a
>>>> model run without any class weight parameter correct ?
>>>>
>>>> Please let me know as per your convenience as this will help me a big
>>>> way to understand the concept further.
>>>>
>>>> Thanks in advance !
>>>>
>>>> On Sun, Jan 22, 2017 at 1:56 AM, Josh Vredevoogd <cleverless at gmail.com>
>>>> wrote:
>>>>
>>>>> The class_weight parameter doesn't behave the way you're expecting.
>>>>>
>>>>> The value in class_weight is the weight applied to each sample in that
>>>>> class - in your example, each class zero sample has weight 0.001 and each
>>>>> class one sample has weight 0.999, so each class one samples carries 999
>>>>> times the weight of a class zero sample.
>>>>>
>>>>> If you would like each class one sample to have ten times the weight,
>>>>> you would set `class_weight={0: 1, 1: 10}` or `class_weight={0:0.1, 1:1}`
>>>>> equivalently.
>>>>>
>>>>>
>>>>> On Sat, Jan 21, 2017 at 10:18 AM, Debabrata Ghosh <
>>>>> mailfordebu at gmail.com> wrote:
>>>>>
>>>>>> Hi All,
>>>>>>              Greetings !
>>>>>>
>>>>>>               I have a very basic question regarding the usage of the
>>>>>> parameter class_weight in scikit learn's Random Forest Classifier's fit
>>>>>> method.
>>>>>>
>>>>>>               I have a fairly unbalanced sample and my positive class
>>>>>> : negative class ratio is 1:100. In other words, I have a million records
>>>>>> corresponding to negative class and 10,000 records corresponding to
>>>>>> positive class. I have trained the random forest classifier model using the
>>>>>> above record set successfully.
>>>>>>
>>>>>>               Further, for a different problem, I want to test the
>>>>>> parameter class_weight. So, I am setting the class_weight as [0:0.001 ,
>>>>>> 1:0.999] and I have tried running my model on the same dataset as mentioned
>>>>>> in the above paragraph but with the positive class records reduced to 1000
>>>>>> [because now each positive class is given approximately 10 times more
>>>>>> weight than a negative class]. However, the model run results are very very
>>>>>> different between the 2 runs (with and without class_weight). And I
>>>>>> expected a similar run results.
>>>>>>
>>>>>>                 Would you please be able to let me know where am I
>>>>>> getting wrong. I know it's something silly but just want to improve on my
>>>>>> concept.
>>>>>>
>>>>>> Thanks !
>>>>>>
>>>>>> _______________________________________________
>>>>>> scikit-learn mailing list
>>>>>> scikit-learn at python.org
>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> scikit-learn at python.org
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170124/af7337e7/attachment.html>


More information about the scikit-learn mailing list