[scikit-learn] Anomaly/Outlier detection based on user access for a large application
John Doe
coderain1 at gmail.com
Mon Mar 20 02:05:59 EDT 2017
Hi All,
I am trying to solve a problem of finding Anomalies/Outliers using
application logs of a large KMS. Please find the details below:
*Problem Statement*: Find Anomalies/outliers using application access logs
in an un-supervised learning environment. Basic use case is to find any
suspicious activity by user/group, that deviates from a trend that the
algorithm has learned.
*Input Data*: Data would be created from log file that are in the following
format:
"ts, src_ip, decrypt, user_a, group_b, kms_region, key"
Where:
*ts* : time of access in epoch Eg: 1489840335
*decrypt* : is one of the various possible actions.
*user_a*, *group_a* : are the user and group that did the access
*kms_region* : the region in which the key exists
*key* : the key that was accessed
*Train Set*: This comes under the un-supervised learning and hence we cant
have a "normal" training set which the model can learn.
*Example of anomalies*:
1. User A suddenly accessing from a different IP: xx.yy
2. No. of access for a given key going up suddenly for a given user,key
pair
3. Increased access on a generally quite long weekend
4. Increased access on a Thu (compared to last Thursdays)
5. Unusual sequences of actions for a given user. Eg. read, decrypt,
delete in quick succession for all keys for a given user
------------------------
>From our research, we have come up with below list of algorithms that are
applied to similar problems:
- ARIMA
<https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average>
: This might be good for timeseries predicting, but will it also learn to
flag anomalies like #3, #4, sequences of actions(#5) etc?
- scikit-learn's Novelty and Outlier Detection
<http://scikit-learn.org/stable/modules/outlier_detection.html> : Not
sure if these will address #3, #4 and #5 use cases above.
- Neural Networks
- k-nearest neighbors
- Clustering-Based Anomaly Detection Techniques: k-Means Clustering etc
- Parametric Techniques
<https://www.vs.inf.ethz.ch/edu/HS2011/CPS/papers/chandola09_anomaly-detection-survey.pdf>
(See Section 7): This might work well on continuous variables, but will it
work on discrete features like, is_weekday etc? Also will it cover cases
like #4 and #5 above?
Most of the research I did were on problems that had continuous features
and did not consider discrete variables like "Holiday_today?" / succession
of events etc.
Any feedback on the algorithm / technique that can be used for above
usecases would be highly appreciated. Thanks.
Regards,
John.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170320/c3224199/attachment-0001.html>
More information about the scikit-learn
mailing list