[scikit-learn] scikit-learn Digest, Vol 14, Issue 6
Mamun Rashid
mamunbabu2001 at gmail.com
Fri May 19 06:05:12 EDT 2017
Hi J.B and the list.
Please accept my apology for a much delayed response. Was ill for last few days and did not access my email.
Thanks for your detailed response.
> What does "performing better" mean in this case? How are you defining performance? A particular metric such as MCC, PPV, or NPV?
I was looking at precision recall. I have a huge class imbalance [positive class is much smaller than negative class].
So, I am testing performance of various classifiers with an increasing negative set size ( every time I am randomly selecting a larger negative set ).
It seems SVC shows better performance in Precision recall space ( SVC precision recall curve is above RFC curve ).
Because of the two following issues :
1. I have a major class imbalance
2. Some of my positive observations are sometimes tightly packed within negative observation clusters [ Observations from 2 dimensional PCA and tSNE plot ].
My aim is to obtain a very clean set of positive predictions as a trade-off I am happy to sacrifice some of the positive observations
> Also, how is the cross-validation being done - is the data shuffled before creating train/test groups are created? Is the exact same split of training and test data per fold used for both
> SVC and RF?
I am currently testing it. Thanks for the suggestion.
> Normalizing your continuous values seems quite fine, but consider these
> aspects:
> --Does it make sense in the domain of your problem to Z-normalize the integral (integer-valued) descriptors/features?
> For the integral values, would subtracting about the median value make more sense? This is similar to the previous consideration.
Yes. Z-score normalisation does not make much sense. Thanks for pointing it out. Currently testing it.
> --What happens to SVC if you don't normalise?
SVC performs quite badly.
> --What happens to RF if you do normalise?
This is interesting. My understating was that decision tree based algorithms does not require normalised data. I took your suggestion and tested an RFC with and without normalised data.
Their result [Confusion matrix at 0.5 operating point] seems to be identical. It felt odd to me. I have only tested on a small data set. Currently running it on different data sets to see if this is
persistent. Would you have expected this ?
> If you can provide a good amount of concrete data to present along with your "problem", this community is excellent at providing intelligent, helpful responses.
Thanks a lot for your suggestion. I will try to create some example data sets and results from the current analysis and post it as soon as possible.
Thanks in advance for your help.
Regards,
Mamun
> Today's Topics:
>
> 1. SVC data normalisation (Mamun Rashid)
>
> Message: 2
> Date: Mon, 8 May 2017 21:48:28 +0900
> From: "Brown J.B." <jbbrown at kuhp.kyoto-u.ac.jp>
>
> Dear Mamun,
>
> *A.* 80% features are binary [ 0 or 1 ]
>> *B.* 10% are integer values representing counts / occurrences.
>> *C.* 10% are continuous values between different ranges.
>>
>> My prior understanding was that decision tree based algorithms work better
>> on mixed data types. In this particular case I am noticing
>> SVC is performing much better than Random forest.
>>
>
> What does "performing better" mean in this case?
> How are you defining performance?
> A particular metric such as MCC, PPV, or NPV?
>
> Also, how is the cross-validation being done - is the data shuffled before
> creating train/test groups are created?
> Is the exact same split of training and test data per fold used for both
> SVC and RF?
>
>
>> I Z-score normalise the data before I sent it to support vector
>> classifier.
>> - Binary features ( type *A) *are left as it it.
>> - Integer and Continuous features are Z-score normalised [ ( feat -
>> mean(feat) ) / sd(feat) ) .
>>
>
> Normalizing your continuous values seems quite fine, but consider these
> aspects:
> --Does it make sense in the domain of your problem to Z-normalize the
> integral (integer-valued) descriptors/features?
> --For the integral values, would subtracting about the median value make
> more sense? This is similar to the previous consideration.
> --What happens to SVC if you don't normalize?
> --What happens to RF if you do normalize?
>
> While my various comments above are all geared toward empirical aspects and
> not toward theoretical aspects, picking some of them to explore is likely
> to help you gain practical insight on your situation/inquiry.
> I'm sure you already know this, but while machine learning may have some
> "practical guidelines for best practices", they are guidelines and not hard
> rules.
> So, again, I would recommend doing some more empirical tests and
> re-evaluating your situation once you have new data in hand.
>
> If you can provide a good amount of concrete data to present along with
> your "problem", this community is excellent at providing intelligent,
> helpful responses.
>
> Hope this helps.
>
> J.B.
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170508/e538f9fc/attachment-0001.html>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ------------------------------
>
> End of scikit-learn Digest, Vol 14, Issue 6
> *******************************************
>
> Date: Mon, 8 May 2017 10:45:26 +0100
> From: Mamun Rashid <mamunbabu2001 at gmail.com>
> Subject: [scikit-learn] SVC data normalisation
>
>
> Hi All,
> I am testing two classifiers [ 1. Random forest 2. SVC with radial basis kernel ] on a data set via 5 fold cross validation.
>
> The feature matrix contains :
>
> A. 80% features are binary [ 0 or 1 ]
> B. 10% are integer values representing counts / occurrences.
> C. 10% are continuous values between different ranges.
>
> My prior understanding was that decision tree based algorithms work better on mixed data types. In this particular case I am noticing
> SVC is performing much better than Random forest.
>
> I Z-score normalise the data before I sent it to support vector classifier.
> - Binary features ( type A) are left as it it.
> - Integer and Continuous features are Z-score normalised [ ( feat - mean(feat) ) / sd(feat) ) .
>
> I was wondering if anyone can tell me if this normalisation approach it correct for SVC run.
>
> Thanks in advance for your help.
>
> Regards,
> Mamun
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170508/f991b267/attachment-0001.html>
>
> ------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170519/bb28a897/attachment.html>
More information about the scikit-learn
mailing list