[scikit-learn] Supervised anomaly detection in time series

Qingkai Kong qingkai.kong at gmail.com
Fri Aug 5 14:05:27 EDT 2016


I also worked on something similar, instead of using some algorithms deal
with unbalanced data, you can also try to create a balanced dataset either
using oversampling or downsampling. scikit-learn-contrib has already had a
project dealing with unbalanced data:
https://github.com/scikit-learn-contrib/imbalanced-learn.

Either you treat it as a classification problem or anomaly detection
problem (I prefer to treat it as a classification problem first) you all
need to find a better set of features in time domain or frequency domain.

On Fri, Aug 5, 2016 at 7:09 AM, Dale T Smith <Dale.T.Smith at macys.com> wrote:

> To analyze unbalanced classifiers, use
>
>
>
> from sklearn.metrics import classification_report
>
>
>
>
>
> ____________________________________________________________
> ______________________________
> *Dale Smith* | Macy's Systems and Technology | IFS eCommerce | Data
> Science and Capacity Planning
> | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com
>
>
>
> *From:* scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=
> macys.com at python.org] *On Behalf Of *Pedro Pazzini
> *Sent:* Friday, August 5, 2016 9:33 AM
>
> *To:* Scikit-learn user and developer mailing list
> *Subject:* Re: [scikit-learn] Supervised anomaly detection in time series
>
>
>
> ⚠ EXT MSG:
>
> Just to add a few things to the discussion:
>
>    1. For unbalanced problems, as far as I know, one of the best scores
>    to evaluate a classifier is the Area Under the ROC curve:
>    http://scikit-learn.org/stable/modules/generated/
>    sklearn.metrics.roc_auc_score.html
>    <http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html>.
>    For that you will have to use clf.predict_proba(X_test) instead of
>    clf.predict(X_test). I think that using the 'sample_weight' parameter as
>    Smith said is a promising choice.
>    2. Usually is recommend the normalization of each time series for
>    comparing them. The Z-score normalization is one of the most used [Ref:
>    http://wan.poly.edu/KDD2012/docs/p262.pdf
>    <http://wan.poly.edu/KDD2012/docs/p262.pdf>].
>    3. There are some interesting dissimilarity measures such as DTW
>    (Dynamic Time Warping), CID (Complex Invariant Distance), and others for
>    comparing time series[Ref: https://www.icmc.usp.br/~
>    gbatista/files/bracis2013_1.pdf
>    <https://www.icmc.usp.br/~gbatista/files/bracis2013_1.pdf>]. And there
>    are also other approaches for comparing time series in the frequency domain
>    such as FFT and DWT [Ref: http://infolab.usc.edu/csci599/Fall2003/Time%
>    20Series/Efficient%20Similarity%20Search%20In%
>    20Sequence%20Databases.pdf
>    <http://infolab.usc.edu/csci599/Fall2003/Time%20Series/Efficient%20Similarity%20Search%20In%20Sequence%20Databases.pdf>
>    ].
>
> I hope it helps.
>
>
>
> 2016-08-05 9:26 GMT-03:00 Dale T Smith <Dale.T.Smith at macys.com>:
>
> I don’t think you should treat this as an outlier detection problem. Why
> not try it as a classification problem? The dataset is highly unbalanced.
> Try
>
>
>
> http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.
> ExtraTreesClassifier.html
>
>
>
> Use sample_weight to tell the fit method about the class imbalance. But be
> sure to read up about unbalanced classification and the class_weight
> parameter to ExtraTreesClassifier. You cannot use the accuracy to find the
> best model, so read up on model validation in the sklearn User’s Guide. And
> when you do cross-validation to get the best hyperparameters, be sure you
> pass the sample weights as well.
>
>
>
> Time series data is a bit different to use with cross-validation. You may
> want to add features such as minutes since midnight, day of week,
> weekday/weekend. And make sure your cross-validation folds respect the time
> series nature of the problem.
>
>
>
> http://stackoverflow.com/questions/37583263/scikit-
> learn-cross-validation-custom-splits-for-time-series-data
>
>
>
>
>
> ____________________________________________________________
> ______________________________
> *Dale Smith* | Macy's Systems and Technology | IFS eCommerce | Data
> Science and Capacity Planning
> | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com
>
>
>
> *From:* scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=
> macys.com at python.org] *On Behalf Of *Nicolas Goix
> *Sent:* Thursday, August 4, 2016 9:13 PM
> *To:* Scikit-learn user and developer mailing list
> *Subject:* Re: [scikit-learn] Supervised anomaly detection in time series
>
>
>
> ⚠ EXT MSG:
>
> There are different ways of aggregating estimators. A possibility can be
> to take the majority vote, or averaging decision functions.
>
>
>
> On Aug 4, 2016 8:44 PM, "Amita Misra" <amisra2 at ucsc.edu> wrote:
>
> If I train multiple algorithms on different subsamples, then how do I get
> the final classifier that predicts unseen data?
>
> I have very few positive samples since it is speed bump detection and we
> have very few speed bumps in a drive.
> However, I think that  unseen new data would be quite similar to what I
> have in training data hence if I can correctly learn a classifier for these
> 5, I hope it should work well for unseen speed bumps.
>
> Thanks,
> Amita
>
>
>
> On Thu, Aug 4, 2016 at 5:23 PM, Nicolas Goix <goix.nicolas at gmail.com>
> wrote:
>
> You can evaluate the accuracy of your hyper-parameters on a few samples.
> Just don't use the accuracy as your performance measure.
>
> For supervised classification, training multiple algorithms on small
> balanced subsamples usually works well, but 5 anomalies seems indeed to be
> very little.
>
> Nicolas
>
>
>
> On Aug 4, 2016 7:51 PM, "Amita Misra" <amisra2 at ucsc.edu> wrote:
>
> SubSample would remove a lot of information from the negative class.
>
> I have more than 500 samples of negative class and just 5 samples of
> positive class.
>
> Amita
>
>
>
> On Thu, Aug 4, 2016 at 4:43 PM, Nicolas Goix <goix.nicolas at gmail.com>
> wrote:
>
> Hi,
>
>
>
> Yes you can use your labeled data (you will need to sub-sample your normal
> class to have similar proportion normal-abnormal) to learn your
> hyper-parameters through CV.
>
>
>
> You can also try to use supervised classification algorithms on `not too
> highly unbalanced' sub-samples.
>
>
>
> Nicolas
>
>
>
> On Thu, Aug 4, 2016 at 5:17 PM, Amita Misra <amisra2 at ucsc.edu> wrote:
>
> Hi,
>
>
>
> I am currently exploring the problem of speed bump detection using
> accelerometer time series data.
>
> I have extracted some features based on mean, std deviation etc  within a
> time window.
>
> Since the dataset is highly skewed ( I have just 5  positive samples for
> every > 300 samples)
>
> I was looking into
>
> One ClassSVM
> covariance.EllipticEnvelope
> sklearn.ensemble.IsolationForest
>
> but I am not sure how to use them.
>
> What I get from docs
>
> separate the positive examples and train using only negative examples
>
> clf.fit(X_train)
>
> and then
> predict the positive examples using
> clf.predict(X_test)
>
>
> I am not sure what is then the role of positive examples in my training
> dataset or how can I use them to improve my classifier so that I can
> predict better on new samples.
>
> Can we do something like Cross validation to learn the parameters as in
> normal binary SVM classification
>
>
>
> Thanks,?
>
> Amita
>
>
>
> Amita Misra
>
> Graduate Student Researcher
>
> Natural Language and Dialogue Systems Lab
>
> Baskin School of Engineering
>
> University of California Santa Cruz
>
>
>
>
>
>
>
>
> --
>
> Amita Misra
>
> Graduate Student Researcher
>
> Natural Language and Dialogue Systems Lab
>
> Baskin School of Engineering
>
> University of California Santa Cruz
>
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
>
> --
>
> Amita Misra
>
> Graduate Student Researcher
>
> Natural Language and Dialogue Systems Lab
>
> Baskin School of Engineering
>
> University of California Santa Cruz
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
>
> --
>
> Amita Misra
>
> Graduate Student Researcher
>
> Natural Language and Dialogue Systems Lab
>
> Baskin School of Engineering
>
> University of California Santa Cruz
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
> * This is an EXTERNAL EMAIL. Stop and think before clicking a link or
> opening attachments.
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> * This is an EXTERNAL EMAIL. Stop and think before clicking a link or
> opening attachments.
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 
Qingkai KONG
Ph.D Candidate
Seismological Lab
289 McCone Hall
University of California, Berkeley
http://seismo.berkeley.edu/qingkaikong
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160805/ed6625a3/attachment-0001.html>


More information about the scikit-learn mailing list