[scikit-learn] Query Regarding Model Scoring using scikit learn's joblib library
Andrew Howe
ahowe42 at gmail.com
Tue Dec 27 02:18:42 EST 2016
Hi Debu
"Should I be using 2 different input datasets (completely exclusive /
disjoint) for training and scoring the models ?" Yes - this is the reason
for partitioning the data into training / testing sets. However, I can't
imagine that it's the cause of your odd results. What is the total
classification result in both training & testing (not just TPs)?
Andrew
<~~~~~~~~~~~~~~~~~~~~~~~~~~~>
J. Andrew Howe, PhD
www.andrewhowe.com
http://www.linkedin.com/in/ahowe42
https://www.researchgate.net/profile/John_Howe12/
I live to learn, so I can learn to live. - me
<~~~~~~~~~~~~~~~~~~~~~~~~~~~>
On Tue, Dec 27, 2016 at 8:26 AM, Debabrata Ghosh <mailfordebu at gmail.com>
wrote:
> Hi Joel,
>
> Thanks for your quick feedback – I certainly understand
> what you mean and please allow me to explain one more time through a
> sequence of steps corresponding to the approach I followed:
>
>
>
> 1. I considered a dataset containing 600 K (0.6 million) records for
> training my model using scikit learn’s Random Forest Classifier library
>
>
>
> 1. I did a training and test sample split on 600 k – forming 480 K
> training dataset and 120 K test dataset (80:20 split)
>
>
>
> 1. I trained scikit learn’s Random Forest Classifier model on the 480
> K (80% split) training sample
>
>
>
> 1. Then I ran prediction (predict_proba method of scikit learn’s RF
> library) on the 120 K test sample
>
>
>
> 1. I got a prediction result with True Positive Rate (TPR) as 10-12 %
> on probability thresholds above 0.5
>
>
>
> 1. I saved the above Random Forest Classifier model using scikit
> learn’s joblib library (dump method) in the form of a pickle file
>
>
>
> 1. I reloaded the model in a different python instance from the pickle
> file mentioned above and did my scoring , i.e., used joblib library load
> method and then instantiated prediction (predict_proba method) on the
> entire set of my original 600 K records
>
>
>
> 1. Now when I am running (scoring) my model using joblib.predict_proba
> on the entire set of original data (600 K), I am getting a True Positive
> rate of around 80%.
>
>
>
> 1. I did some further analysis and figured out that during the
> training process, when the model was predicting on the test sample of 120K
> it could only predict 10-12% of 120K data beyond a probability threshold of
> 0.5. When I am now trying to score my model on the entire set of 600 K
> records, it appears that the model is remembering some of it’s past
> behavior and data and accordingly throwing 80% True positive rate
>
>
>
> 1. When I tried to score the model using joblib.predict_proba on a
> completely disjoint dataset from the one used for training (i.e., no
> overlap between training and scoring data) then it’s giving me the right
> True Positive Rate (in the range of 10 – 12%)
>
> *Here lies my question once again:* Should I be using 2
> different input datasets (completely exclusive / disjoint) for training and
> scoring the models ? In case the input datasets for scoring and training
> overlaps then I get incorrect results. Will that be a fair assumption ?
>
> Another question – is there an alternate model scoring library
> (apart from joblib, the one I am using) ?
>
>
> Thanks once again for your feedback in advance !
>
>
> Cheers,
>
>
> Debu
>
> On Tue, Dec 27, 2016 at 1:56 AM, Joel Nothman <joel.nothman at gmail.com>
> wrote:
>
>> Hi Debu,
>>
>> Your post is terminologically confusing, so I'm not sure I've understood
>> your problem. Where is the "different sample" used for scoring coming from?
>> Is it possible it is more related to the training data than the test sample?
>>
>> Joel
>>
>> On 27 December 2016 at 05:28, Debabrata Ghosh <mailfordebu at gmail.com>
>> wrote:
>>
>>> Dear All,
>>>
>>> Greetings!
>>>
>>> I need some urgent guidance and help
>>> from you all in model scoring. What I mean by model scoring is around the
>>> following steps:
>>>
>>>
>>>
>>> 1. I have trained a Random Classifier model using scikit-learn
>>> (RandomForestClassifier library)
>>> 2. Then I have generated the True Positive and False Positive
>>> predictions on my test data set using predict_proba method (I have splitted
>>> my data into training and test samples in 80:20 ratio)
>>> 3. Finally, I have dumped the model into a pkl file.
>>> 4. Next in another instance, I have loaded the .pkl file
>>> 5. I have initiated job_lib.predict_proba method for predicting the
>>> True Positive and False positives on a different sample. I am terming this
>>> step as scoring whether I am predicting without retraining the model
>>>
>>> My question is when I generate the True Positive Rate
>>> on the test data set (as part of model training approach), the rate which I
>>> am getting is 10 – 12%. But when I do the scoring (using the steps
>>> mentioned above), my True Positive Rate is shooting high upto 80%.
>>> Although, I am happy to get a very high TPR but my question is whether
>>> getting such a high TPR during the scoring phase is an expected outcome? In
>>> other words, whether achieving a high TPR through joblib is an accepted
>>> outcome vis-à-vis getting the TPR on training / test data set.
>>>
>>> Your views on the above ask will be really helpful as I
>>> am very confused whether to consider scoring the model using joblib.
>>> Otherwise is there any other alternative to joblib, which can help me to do
>>> scoring without retraining the model. Please let me know as per your
>>> earliest convenience as am a bit pressed
>>>
>>>
>>>
>>> Thanks for your help in advance!
>>>
>>>
>>>
>>> Cheers,
>>>
>>> Debu
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161227/24dcadcb/attachment.html>
More information about the scikit-learn
mailing list