[scikit-learn] Query Regarding Model Scoring using scikit learn's joblib library

Debabrata Ghosh mailfordebu at gmail.com
Tue Dec 27 00:26:22 EST 2016


Hi Joel,

                Thanks for your quick feedback – I certainly understand
what you mean and please allow me to explain one more time through a
sequence of steps corresponding to the approach I followed:



   1. I considered a dataset containing 600 K (0.6 million) records for
   training my model using scikit learn’s Random Forest Classifier library



   1. I did a training and test sample split on 600 k – forming 480 K
   training dataset and 120 K test dataset (80:20 split)



   1. I trained scikit learn’s Random Forest Classifier model on the 480 K
   (80% split) training sample



   1. Then I ran prediction (predict_proba method of scikit learn’s RF
   library) on the 120 K test sample



   1. I got a prediction result with True Positive Rate (TPR) as 10-12 % on
   probability thresholds above 0.5



   1. I saved the above Random Forest Classifier model using scikit learn’s
   joblib library (dump method) in the form of a pickle file



   1. I reloaded the model in a different python instance from the pickle
   file mentioned above and did my scoring , i.e., used joblib library load
   method and then instantiated prediction (predict_proba method) on the
   entire set of my original 600 K records



   1. Now when I am running (scoring) my model using joblib.predict_proba
   on the entire set of original data (600 K), I am getting a True Positive
   rate of around 80%.



   1. I did some  further analysis and figured out that during the training
   process, when the model was predicting on the test sample of 120K it could
   only predict 10-12% of 120K data beyond a probability threshold of 0.5.
   When I am now trying to score my model on the entire set of 600 K records,
   it appears that the model is remembering some of it’s past behavior and
   data and accordingly throwing 80% True positive rate



   1. When I tried to score the model using joblib.predict_proba on a
   completely disjoint dataset from the one used for training (i.e., no
   overlap between training and scoring data) then it’s giving me the right
   True Positive Rate (in the range of 10 – 12%)

          *Here lies my question once again:* Should I be using 2 different
input datasets (completely exclusive / disjoint) for training and scoring
the models ? In case the input datasets for scoring and training overlaps
then I get incorrect results. Will that be a fair assumption ?

          Another question – is there an alternate model scoring library
(apart from joblib, the one I am using) ?


         Thanks once again for your feedback in advance !


Cheers,


Debu

On Tue, Dec 27, 2016 at 1:56 AM, Joel Nothman <joel.nothman at gmail.com>
wrote:

> Hi Debu,
>
> Your post is terminologically confusing, so I'm not sure I've understood
> your problem. Where is the "different sample" used for scoring coming from?
> Is it possible it is more related to the training data than the test sample?
>
> Joel
>
> On 27 December 2016 at 05:28, Debabrata Ghosh <mailfordebu at gmail.com>
> wrote:
>
>> Dear All,
>>
>>                                 Greetings!
>>
>>                                 I need some urgent guidance and help
>> from you all in model scoring. What I mean by model scoring is around the
>> following steps:
>>
>>
>>
>>    1. I have trained a Random Classifier model using scikit-learn
>>    (RandomForestClassifier library)
>>    2. Then I have generated the True Positive and False Positive
>>    predictions on my test data set using predict_proba method (I have splitted
>>    my data into training and test samples in 80:20 ratio)
>>    3. Finally, I have dumped the model into a pkl file.
>>    4. Next in another instance, I have loaded the .pkl file
>>    5. I have initiated job_lib.predict_proba method for predicting the
>>    True Positive and False positives on a different sample. I am terming this
>>    step as scoring whether I am predicting without retraining the model
>>
>>                 My question is when I generate the True Positive Rate on
>> the test data set (as part of model training approach), the rate which I am
>> getting is 10 – 12%. But when I do the scoring (using the steps mentioned
>> above), my True Positive Rate is shooting high upto 80%. Although, I am
>> happy to get a very high TPR but my question is whether getting such a high
>> TPR during the scoring phase is an expected outcome? In other words,
>> whether achieving a high TPR through joblib is an accepted outcome
>> vis-à-vis getting the TPR on training / test data set.
>>
>>                 Your views on the above ask will be really helpful as I
>> am very confused whether to consider scoring the model using joblib.
>> Otherwise is there any other alternative to joblib, which can help me to do
>> scoring without retraining the model. Please let me know as per your
>> earliest convenience as am a bit pressed
>>
>>
>>
>> Thanks for your help in advance!
>>
>>
>>
>> Cheers,
>>
>> Debu
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161227/b0948936/attachment-0001.html>


More information about the scikit-learn mailing list