[scikit-learn] Supervised anomaly detection in time series
Pedro Pazzini
pedropazzini at gmail.com
Fri Aug 5 09:32:52 EDT 2016
Just to add a few things to the discussion:
1. For unbalanced problems, as far as I know, one of the best scores to
evaluate a classifier is the Area Under the ROC curve:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html.
For that you will have to use clf.predict_proba(X_test) instead of
clf.predict(X_test). I think that using the 'sample_weight' parameter as
Smith said is a promising choice.
2. Usually is recommend the normalization of each time series for
comparing them. The Z-score normalization is one of the most used [Ref:
http://wan.poly.edu/KDD2012/docs/p262.pdf].
3. There are some interesting dissimilarity measures such as DTW
(Dynamic Time Warping), CID (Complex Invariant Distance), and others for
comparing time series[Ref:
https://www.icmc.usp.br/~gbatista/files/bracis2013_1.pdf]. And there are
also other approaches for comparing time series in the frequency domain
such as FFT and DWT [Ref:
http://infolab.usc.edu/csci599/Fall2003/Time%20Series/Efficient%20Similarity%20Search%20In%20Sequence%20Databases.pdf
].
I hope it helps.
2016-08-05 9:26 GMT-03:00 Dale T Smith <Dale.T.Smith at macys.com>:
> I don’t think you should treat this as an outlier detection problem. Why
> not try it as a classification problem? The dataset is highly unbalanced.
> Try
>
>
>
> http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.
> ExtraTreesClassifier.html
>
>
>
> Use sample_weight to tell the fit method about the class imbalance. But be
> sure to read up about unbalanced classification and the class_weight
> parameter to ExtraTreesClassifier. You cannot use the accuracy to find the
> best model, so read up on model validation in the sklearn User’s Guide. And
> when you do cross-validation to get the best hyperparameters, be sure you
> pass the sample weights as well.
>
>
>
> Time series data is a bit different to use with cross-validation. You may
> want to add features such as minutes since midnight, day of week,
> weekday/weekend. And make sure your cross-validation folds respect the time
> series nature of the problem.
>
>
>
> http://stackoverflow.com/questions/37583263/scikit-
> learn-cross-validation-custom-splits-for-time-series-data
>
>
>
>
>
> ____________________________________________________________
> ______________________________
> *Dale Smith* | Macy's Systems and Technology | IFS eCommerce | Data
> Science and Capacity Planning
> | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com
>
>
>
> *From:* scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=
> macys.com at python.org] *On Behalf Of *Nicolas Goix
> *Sent:* Thursday, August 4, 2016 9:13 PM
> *To:* Scikit-learn user and developer mailing list
> *Subject:* Re: [scikit-learn] Supervised anomaly detection in time series
>
>
>
> ⚠ EXT MSG:
>
> There are different ways of aggregating estimators. A possibility can be
> to take the majority vote, or averaging decision functions.
>
>
>
> On Aug 4, 2016 8:44 PM, "Amita Misra" <amisra2 at ucsc.edu> wrote:
>
> If I train multiple algorithms on different subsamples, then how do I get
> the final classifier that predicts unseen data?
>
> I have very few positive samples since it is speed bump detection and we
> have very few speed bumps in a drive.
> However, I think that unseen new data would be quite similar to what I
> have in training data hence if I can correctly learn a classifier for these
> 5, I hope it should work well for unseen speed bumps.
>
> Thanks,
> Amita
>
>
>
> On Thu, Aug 4, 2016 at 5:23 PM, Nicolas Goix <goix.nicolas at gmail.com>
> wrote:
>
> You can evaluate the accuracy of your hyper-parameters on a few samples.
> Just don't use the accuracy as your performance measure.
>
> For supervised classification, training multiple algorithms on small
> balanced subsamples usually works well, but 5 anomalies seems indeed to be
> very little.
>
> Nicolas
>
>
>
> On Aug 4, 2016 7:51 PM, "Amita Misra" <amisra2 at ucsc.edu> wrote:
>
> SubSample would remove a lot of information from the negative class.
>
> I have more than 500 samples of negative class and just 5 samples of
> positive class.
>
> Amita
>
>
>
> On Thu, Aug 4, 2016 at 4:43 PM, Nicolas Goix <goix.nicolas at gmail.com>
> wrote:
>
> Hi,
>
>
>
> Yes you can use your labeled data (you will need to sub-sample your normal
> class to have similar proportion normal-abnormal) to learn your
> hyper-parameters through CV.
>
>
>
> You can also try to use supervised classification algorithms on `not too
> highly unbalanced' sub-samples.
>
>
>
> Nicolas
>
>
>
> On Thu, Aug 4, 2016 at 5:17 PM, Amita Misra <amisra2 at ucsc.edu> wrote:
>
> Hi,
>
>
>
> I am currently exploring the problem of speed bump detection using
> accelerometer time series data.
>
> I have extracted some features based on mean, std deviation etc within a
> time window.
>
> Since the dataset is highly skewed ( I have just 5 positive samples for
> every > 300 samples)
>
> I was looking into
>
> One ClassSVM
> covariance.EllipticEnvelope
> sklearn.ensemble.IsolationForest
>
> but I am not sure how to use them.
>
> What I get from docs
>
> separate the positive examples and train using only negative examples
>
> clf.fit(X_train)
>
> and then
> predict the positive examples using
> clf.predict(X_test)
>
>
> I am not sure what is then the role of positive examples in my training
> dataset or how can I use them to improve my classifier so that I can
> predict better on new samples.
>
> Can we do something like Cross validation to learn the parameters as in
> normal binary SVM classification
>
>
>
> Thanks,?
>
> Amita
>
>
>
> Amita Misra
>
> Graduate Student Researcher
>
> Natural Language and Dialogue Systems Lab
>
> Baskin School of Engineering
>
> University of California Santa Cruz
>
>
>
>
>
>
>
>
> --
>
> Amita Misra
>
> Graduate Student Researcher
>
> Natural Language and Dialogue Systems Lab
>
> Baskin School of Engineering
>
> University of California Santa Cruz
>
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
>
> --
>
> Amita Misra
>
> Graduate Student Researcher
>
> Natural Language and Dialogue Systems Lab
>
> Baskin School of Engineering
>
> University of California Santa Cruz
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
>
> --
>
> Amita Misra
>
> Graduate Student Researcher
>
> Natural Language and Dialogue Systems Lab
>
> Baskin School of Engineering
>
> University of California Santa Cruz
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
> * This is an EXTERNAL EMAIL. Stop and think before clicking a link or
> opening attachments.
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160805/001e0cbe/attachment-0001.html>
More information about the scikit-learn
mailing list