[scikit-learn] SVC data normalisation
jbbrown at kuhp.kyoto-u.ac.jp
Mon May 8 08:48:28 EDT 2017
*A.* 80% features are binary [ 0 or 1 ]
> *B.* 10% are integer values representing counts / occurrences.
> *C.* 10% are continuous values between different ranges.
> My prior understanding was that decision tree based algorithms work better
> on mixed data types. In this particular case I am noticing
> SVC is performing much better than Random forest.
What does "performing better" mean in this case?
How are you defining performance?
A particular metric such as MCC, PPV, or NPV?
Also, how is the cross-validation being done - is the data shuffled before
creating train/test groups are created?
Is the exact same split of training and test data per fold used for both
SVC and RF?
> I Z-score normalise the data before I sent it to support vector
> - Binary features ( type *A) *are left as it it.
> - Integer and Continuous features are Z-score normalised [ ( feat -
> mean(feat) ) / sd(feat) ) .
Normalizing your continuous values seems quite fine, but consider these
--Does it make sense in the domain of your problem to Z-normalize the
integral (integer-valued) descriptors/features?
--For the integral values, would subtracting about the median value make
more sense? This is similar to the previous consideration.
--What happens to SVC if you don't normalize?
--What happens to RF if you do normalize?
While my various comments above are all geared toward empirical aspects and
not toward theoretical aspects, picking some of them to explore is likely
to help you gain practical insight on your situation/inquiry.
I'm sure you already know this, but while machine learning may have some
"practical guidelines for best practices", they are guidelines and not hard
So, again, I would recommend doing some more empirical tests and
re-evaluating your situation once you have new data in hand.
If you can provide a good amount of concrete data to present along with
your "problem", this community is excellent at providing intelligent,
Hope this helps.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the scikit-learn