<div dir="ltr">

<p class="MsoNormal">Hi Joel,</p>

<p class="MsoNormal"><span>                </span>Thanks

for your quick feedback – I certainly understand what you mean and please allow me to

explain one more time through a sequence of steps corresponding to the approach

I followed:</p><p class="MsoNormal"><br></p>

<ol start="1" style="margin-top:0in" type="1"><li class="MsoNormal">I considered a dataset

     containing 600 K (0.6 million) records for training my model using scikit

     learn’s Random Forest Classifier library</li></ol>

<p style="margin-left:0.25in" class="MsoNormal"> </p>

<ol start="2" style="margin-top:0in" type="1"><li class="MsoNormal">I did a training and test

     sample split on 600 k – forming 480 K training dataset and 120 K test dataset

     (80:20 split)</li></ol>

<p style="margin-left:0.25in" class="MsoNormal">  <br></p>

<ol start="3" style="margin-top:0in" type="1"><li class="MsoNormal">I trained scikit learn’s

     Random Forest Classifier model on the 480 K (80% split) training sample</li></ol>

<p style="margin-left:0.25in" class="MsoNormal"> </p>

<ol start="4" style="margin-top:0in" type="1"><li class="MsoNormal">Then I ran prediction

     (predict_proba method of scikit learn’s RF library) on the 120 K test

     sample</li></ol>

<p style="margin-left:0.25in" class="MsoNormal">  <br></p>

<ol start="5" style="margin-top:0in" type="1"><li class="MsoNormal">I got a prediction result

     with True Positive Rate (TPR) as 10-12 % on probability thresholds above

     0.5</li></ol>

<p style="margin-left:0.25in" class="MsoNormal"> </p>

<ol start="6" style="margin-top:0in" type="1"><li class="MsoNormal">I saved the above Random

     Forest Classifier model using scikit learn’s joblib library (dump method)

     in the form of a pickle file</li></ol>

<p style="margin-left:0.25in" class="MsoNormal">  <br></p>

<ol start="7" style="margin-top:0in" type="1"><li class="MsoNormal">I reloaded the model in a

     different python instance from the pickle file mentioned above and did my

     scoring , i.e., used joblib library load method and then instantiated prediction

     (predict_proba method) on the entire set of my original 600 K records </li></ol>

<p style="margin-left:0.25in" class="MsoNormal"> </p>

<ol start="8" style="margin-top:0in" type="1"><li class="MsoNormal">Now when I am running

     (scoring) my model using joblib.predict_proba on the entire set of

     original data (600 K), I am getting a True Positive rate of around 80%. </li></ol>

<p class="gmail-MsoListParagraph"> </p><ol start="9" style="margin-top:0in" type="1"><li class="MsoNormal">I did some <span> </span>further analysis and figured out that

     during the training process, when the model was predicting on the test

     sample of 120K it could only predict 10-12% of 120K data beyond a

     probability threshold of 0.5. When I am now trying to score my model on

     the entire set of 600 K records, it appears that the model is remembering some

     of it’s past behavior and data and accordingly throwing 80% True positive

     rate</li></ol>

<p style="margin-left:0.25in" class="MsoNormal"> </p>

<ol start="10" style="margin-top:0in" type="1"><li class="MsoNormal">When I tried to score the

     model using joblib.predict_proba on a completely disjoint dataset from the

     one used for training (i.e., no overlap between training and scoring data)

     then it’s giving me the right True Positive Rate (in the range of 10 –

     12%)</li></ol>

<p class="MsoNormal"> <span>         </span><u><b>Here lies my

question once again:</b></u> Should I be using 2 different input datasets (completely

exclusive / disjoint) for training and scoring the models ? In case the input

datasets for scoring and training overlaps then I get incorrect results. Will

that be a fair assumption ?</p>

<p class="MsoNormal"><span>          </span>Another question

– is there an alternate model scoring library (apart from joblib, the one I am

using) ?</p>

<p class="MsoNormal"><br></p><p class="MsoNormal">         Thanks once again for your feedback in advance !</p><p class="MsoNormal"><br></p><p class="MsoNormal">Cheers,</p><p class="MsoNormal"><br></p><p class="MsoNormal">Debu<br></p>

</div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Dec 27, 2016 at 1:56 AM, Joel Nothman <span dir="ltr"><<a href="mailto:joel.nothman@gmail.com" target="_blank">joel.nothman@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hi Debu,<div><br></div><div>Your post is terminologically confusing, so I'm not sure I've understood your problem. Where is the "different sample" used for scoring coming from? Is it possible it is more related to the training data than the test sample?</div><div><br></div><div>Joel</div></div><div class="gmail_extra"><br><div class="gmail_quote"><div><div class="h5">On 27 December 2016 at 05:28, Debabrata Ghosh <span dir="ltr"><<a href="mailto:mailfordebu@gmail.com" target="_blank">mailfordebu@gmail.com</a>></span> wrote:<br></div></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div class="h5"><div dir="ltr">

<p class="MsoNormal">Dear All,</p>

<p class="MsoNormal"><span>                              <wbr>  </span>Greetings!</p>

<p class="MsoNormal"><span>                              <wbr>  </span>I

need some urgent guidance and help from you all in model scoring. What I mean

by model scoring is around the following steps:</p><p class="MsoNormal"><br></p>

<ol start="1" style="margin-top:0in" type="1"><li class="MsoNormal">I have trained a Random

     Classifier model using scikit-learn (RandomForestClassifier library)</li><li class="MsoNormal">Then I have generated the

     True Positive and False Positive predictions on my test data set using

     predict_proba method (I have splitted my data into training and test

     samples in 80:20 ratio)</li><li class="MsoNormal">Finally, I have dumped the

     model into a pkl file.</li><li class="MsoNormal">Next in another instance,

     I have loaded the .pkl file </li><li class="MsoNormal">I have initiated job_lib.predict_proba

     method for predicting the True Positive and False positives on a different

     sample. I am terming this step as scoring whether I am predicting without

     retraining the model</li></ol>

<p style="margin-left:0.5in" class="MsoNormal"><span>                </span>My

question is when I generate the True Positive Rate on the test data set (as

part of model training approach), the rate which I am getting is 10 – 12%. But

when I do the scoring (using the steps mentioned above), my True Positive Rate

is shooting high upto 80%. Although, I am happy to get a very high TPR but my

question is whether getting such a high TPR during the scoring phase is an

expected outcome? In other words, whether achieving a high TPR through joblib

is an accepted outcome vis-à-vis getting the TPR on training / test data set.</p>

<p style="margin-left:0.5in" class="MsoNormal"><span>                </span>Your

views on the above ask will be really helpful as I am very confused whether to

consider scoring the model using joblib. Otherwise is there any other

alternative to joblib, which can help me to do scoring without retraining the model.

Please let me know as per your earliest convenience as am a bit pressed<br></p>

<p style="margin-left:0.5in" class="MsoNormal"> </p>

<p style="margin-left:0.5in" class="MsoNormal">Thanks for your help in advance!</p>

<p style="margin-left:0.5in" class="MsoNormal"> </p>

<p style="margin-left:0.5in" class="MsoNormal">Cheers,</p>

<p style="margin-left:0.5in" class="MsoNormal">Debu</p>

</div>

<br></div></div>______________________________<wbr>_________________<br>

scikit-learn mailing list<br>

<a href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.org</a><br>

<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/mailma<wbr>n/listinfo/scikit-learn</a><br>

<br></blockquote></div><br></div>

<br>______________________________<wbr>_________________<br>

scikit-learn mailing list<br>

<a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a><br>

<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/<wbr>mailman/listinfo/scikit-learn</a><br>

<br></blockquote></div><br></div>