[scikit-learn] Anomaly/Outlier detection based on user access for a large application

Mon Mar 20 02:05:59 EDT 2017

Hi All,
    I am trying to solve a problem of finding Anomalies/Outliers using
application logs of a large KMS. Please find the details below:

*Problem Statement*: Find Anomalies/outliers using application access logs
in an un-supervised learning environment. Basic use case is to find any
suspicious activity by user/group, that deviates from a trend that the
algorithm has learned.

*Input Data*: Data would be created from log file that are in the following
format:

"ts, src_ip, decrypt, user_a, group_b, kms_region, key"

Where:

*ts* : time of access in epoch Eg: 1489840335
*decrypt* : is one of the various possible actions.
*user_a*, *group_a* : are the user and group that did the access
*kms_region* : the region in which the key exists
*key* : the key that was accessed

*Train Set*: This comes under the un-supervised learning and hence we cant
have a "normal" training set which the model can learn.

*Example of anomalies*:

   1. User A suddenly accessing from a different IP: xx.yy
   2. No. of access for a given key going up suddenly for a given user,key
   pair
   3. Increased access on a generally quite long weekend
   4. Increased access on a Thu (compared to last Thursdays)
   5. Unusual sequences of actions for a given user. Eg. read, decrypt,
   delete in quick succession for all keys for a given user

------------------------

>From our research, we have come up with below list of algorithms that are
applied to similar problems:

   - ARIMA
   <https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average>
   : This might be good for timeseries predicting, but will it also learn to
   flag anomalies like #3, #4, sequences of actions(#5) etc?
   - scikit-learn's Novelty and Outlier Detection
   <http://scikit-learn.org/stable/modules/outlier_detection.html> : Not
   sure if these will address #3, #4 and #5 use cases above.
   - Neural Networks
   - k-nearest neighbors
   - Clustering-Based Anomaly Detection Techniques: k-Means Clustering etc
   - Parametric Techniques
   <https://www.vs.inf.ethz.ch/edu/HS2011/CPS/papers/chandola09_anomaly-detection-survey.pdf>
   (See Section 7): This might work well on continuous variables, but will it
   work on discrete features like, is_weekday etc? Also will it cover cases
   like #4 and #5 above?

Most of the research I did were on problems that had continuous features
and did not consider discrete variables like "Holiday_today?" / succession
of events etc.

 Any feedback on the algorithm / technique that can be used for above
usecases would be highly appreciated. Thanks.

Regards,
John.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170320/c3224199/attachment-0001.html>