[scikit-learn] Supervised anomaly detection in time series
Dale T Smith
Dale.T.Smith at macys.com
Fri Aug 5 10:09:11 EDT 2016
To analyze unbalanced classifiers, use
from sklearn.metrics import classification_report
__________________________________________________________________________________________
Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science and Capacity Planning
| 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com
From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Pedro Pazzini
Sent: Friday, August 5, 2016 9:33 AM
To: Scikit-learn user and developer mailing list
Subject: Re: [scikit-learn] Supervised anomaly detection in time series
⚠ EXT MSG:
Just to add a few things to the discussion:
1. For unbalanced problems, as far as I know, one of the best scores to evaluate a classifier is the Area Under the ROC curve: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html. For that you will have to use clf.predict_proba(X_test) instead of clf.predict(X_test). I think that using the 'sample_weight' parameter as Smith said is a promising choice.
2. Usually is recommend the normalization of each time series for comparing them. The Z-score normalization is one of the most used [Ref: http://wan.poly.edu/KDD2012/docs/p262.pdf].
3. There are some interesting dissimilarity measures such as DTW (Dynamic Time Warping), CID (Complex Invariant Distance), and others for comparing time series[Ref: https://www.icmc.usp.br/~gbatista/files/bracis2013_1.pdf]. And there are also other approaches for comparing time series in the frequency domain such as FFT and DWT [Ref: http://infolab.usc.edu/csci599/Fall2003/Time%20Series/Efficient%20Similarity%20Search%20In%20Sequence%20Databases.pdf].
I hope it helps.
2016-08-05 9:26 GMT-03:00 Dale T Smith <Dale.T.Smith at macys.com<mailto:Dale.T.Smith at macys.com>>:
I don’t think you should treat this as an outlier detection problem. Why not try it as a classification problem? The dataset is highly unbalanced. Try
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html
Use sample_weight to tell the fit method about the class imbalance. But be sure to read up about unbalanced classification and the class_weight parameter to ExtraTreesClassifier. You cannot use the accuracy to find the best model, so read up on model validation in the sklearn User’s Guide. And when you do cross-validation to get the best hyperparameters, be sure you pass the sample weights as well.
Time series data is a bit different to use with cross-validation. You may want to add features such as minutes since midnight, day of week, weekday/weekend. And make sure your cross-validation folds respect the time series nature of the problem.
http://stackoverflow.com/questions/37583263/scikit-learn-cross-validation-custom-splits-for-time-series-data
__________________________________________________________________________________________
Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science and Capacity Planning
| 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com<mailto:dale.t.smith at macys.com>
From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith<mailto:scikit-learn-bounces%2Bdale.t.smith>=macys.com at python.org<mailto:macys.com at python.org>] On Behalf Of Nicolas Goix
Sent: Thursday, August 4, 2016 9:13 PM
To: Scikit-learn user and developer mailing list
Subject: Re: [scikit-learn] Supervised anomaly detection in time series
⚠ EXT MSG:
There are different ways of aggregating estimators. A possibility can be to take the majority vote, or averaging decision functions.
On Aug 4, 2016 8:44 PM, "Amita Misra" <amisra2 at ucsc.edu<mailto:amisra2 at ucsc.edu>> wrote:
If I train multiple algorithms on different subsamples, then how do I get the final classifier that predicts unseen data?
I have very few positive samples since it is speed bump detection and we have very few speed bumps in a drive.
However, I think that unseen new data would be quite similar to what I have in training data hence if I can correctly learn a classifier for these 5, I hope it should work well for unseen speed bumps.
Thanks,
Amita
On Thu, Aug 4, 2016 at 5:23 PM, Nicolas Goix <goix.nicolas at gmail.com<mailto:goix.nicolas at gmail.com>> wrote:
You can evaluate the accuracy of your hyper-parameters on a few samples. Just don't use the accuracy as your performance measure.
For supervised classification, training multiple algorithms on small balanced subsamples usually works well, but 5 anomalies seems indeed to be very little.
Nicolas
On Aug 4, 2016 7:51 PM, "Amita Misra" <amisra2 at ucsc.edu<mailto:amisra2 at ucsc.edu>> wrote:
SubSample would remove a lot of information from the negative class.
I have more than 500 samples of negative class and just 5 samples of positive class.
Amita
On Thu, Aug 4, 2016 at 4:43 PM, Nicolas Goix <goix.nicolas at gmail.com<mailto:goix.nicolas at gmail.com>> wrote:
Hi,
Yes you can use your labeled data (you will need to sub-sample your normal class to have similar proportion normal-abnormal) to learn your hyper-parameters through CV.
You can also try to use supervised classification algorithms on `not too highly unbalanced' sub-samples.
Nicolas
On Thu, Aug 4, 2016 at 5:17 PM, Amita Misra <amisra2 at ucsc.edu<mailto:amisra2 at ucsc.edu>> wrote:
Hi,
I am currently exploring the problem of speed bump detection using accelerometer time series data.
I have extracted some features based on mean, std deviation etc within a time window.
Since the dataset is highly skewed ( I have just 5 positive samples for every > 300 samples)
I was looking into
One ClassSVM
covariance.EllipticEnvelope
sklearn.ensemble.IsolationForest
but I am not sure how to use them.
What I get from docs
separate the positive examples and train using only negative examples
clf.fit(X_train)
and then
predict the positive examples using
clf.predict(X_test)
I am not sure what is then the role of positive examples in my training dataset or how can I use them to improve my classifier so that I can predict better on new samples.
Can we do something like Cross validation to learn the parameters as in normal binary SVM classification
Thanks,?
Amita
Amita Misra
Graduate Student Researcher
Natural Language and Dialogue Systems Lab
Baskin School of Engineering
University of California Santa Cruz
--
Amita Misra
Graduate Student Researcher
Natural Language and Dialogue Systems Lab
Baskin School of Engineering
University of California Santa Cruz
_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://mail.python.org/mailman/listinfo/scikit-learn
--
Amita Misra
Graduate Student Researcher
Natural Language and Dialogue Systems Lab
Baskin School of Engineering
University of California Santa Cruz
_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://mail.python.org/mailman/listinfo/scikit-learn
--
Amita Misra
Graduate Student Researcher
Natural Language and Dialogue Systems Lab
Baskin School of Engineering
University of California Santa Cruz
_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://mail.python.org/mailman/listinfo/scikit-learn
* This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.
_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://mail.python.org/mailman/listinfo/scikit-learn
* This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160805/9aa59723/attachment-0001.html>
More information about the scikit-learn
mailing list