[scikit-learn] Please unsubscribe

Stratford, Mark A mark_stratford at optum.com
Mon Mar 20 05:32:43 EDT 2017



-----Original Message-----
From: scikit-learn [mailto:scikit-learn-bounces+mark_stratford=optum.com at python.org] On Behalf Of scikit-learn-request at python.org
Sent: Monday, March 20, 2017 6:06 AM
To: scikit-learn at python.org
Subject: scikit-learn Digest, Vol 12, Issue 42

Send scikit-learn mailing list submissions to
	scikit-learn at python.org

To subscribe or unsubscribe via the World Wide Web, visit
	https://mail.python.org/mailman/listinfo/scikit-learn
or, via email, send a message with subject or body 'help' to
	scikit-learn-request at python.org

You can reach the person managing the list at
	scikit-learn-owner at python.org

When replying, please edit your Subject line so it is more specific than "Re: Contents of scikit-learn digest..."


Today's Topics:

   1. recommended feature selection method to train an MLPRegressor
      (Thomas Evangelidis)
   2. Re: recommended feature selection method to train an
      MLPRegressor (Andreas Mueller)
   3. Re: recommended feature selection method to train an
      MLPRegressor (Sebastian Raschka)
   4. Anomaly/Outlier detection based on user access for a large
      application (John Doe)


----------------------------------------------------------------------

Message: 1
Date: Sun, 19 Mar 2017 20:47:36 +0100
From: Thomas Evangelidis <tevang3 at gmail.com>
To: Scikit-learn user and developer mailing list
	<scikit-learn at python.org>
Subject: [scikit-learn] recommended feature selection method to train
	an MLPRegressor
Message-ID:
	<CAACvdx17Ev3jr0ds2bLyJc0RqZkqJH7Rtx=s1ZaodmUvCkcB8Q at mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Which of the following methods would you recommend to select good features
(<=50) from a set of 534 features in order to train a MLPregressor? Please take into account that the datasets I use for training are small.

http://scikit-learn.org/stable/modules/feature_selection.html

And please don't tell me to use a neural network that supports the dropout or any other algorithm for feature elimination. This is not applicable in my case because I want to know the best 50 features in order to append them to other types of feature that I am confident that are important.


?cheers
Thomas?


-- 

======================================================================

Thomas Evangelidis

Research Specialist
CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081,
62500 Brno, Czech Republic

email: tevang at pharm.uoa.gr

          tevang3 at gmail.com


website: https://sites.google.com/site/thomasevangelidishomepage/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170319/b3e083c7/attachment-0001.html>

------------------------------

Message: 2
Date: Sun, 19 Mar 2017 18:23:07 -0400
From: Andreas Mueller <t3kcit at gmail.com>
To: Scikit-learn user and developer mailing list
	<scikit-learn at python.org>
Subject: Re: [scikit-learn] recommended feature selection method to
	train an MLPRegressor
Message-ID: <6b490067-962e-02fc-5157-9a487fc1aa83 at gmail.com>
Content-Type: text/plain; charset="windows-1252"; Format="flowed"



On 03/19/2017 03:47 PM, Thomas Evangelidis wrote:
> Which of the following methods would you recommend to select good 
> features (<=50) from a set of 534 features in order to train a 
> MLPregressor? Please take into account that the datasets I use for 
> training are small.
>
> http://scikit-learn.org/stable/modules/feature_selection.html
>
> And please don't tell me to use a neural network that supports the 
> dropout or any other algorithm for feature elimination. This is not 
> applicable in my case because I want to know the best 50 features in 
> order to append them to other types of feature that I am confident 
> that are important.
>
You can always use forward or backward selection as implemented in mlxtend if you're patient. As your dataset is small that might work.
However, it might be hard tricky to get the MLP to run consistently - though maybe not...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170319/79e4cf33/attachment-0001.html>

------------------------------

Message: 3
Date: Sun, 19 Mar 2017 19:32:45 -0400
From: Sebastian Raschka <se.raschka at gmail.com>
To: Scikit-learn user and developer mailing list
	<scikit-learn at python.org>
Subject: Re: [scikit-learn] recommended feature selection method to
	train an MLPRegressor
Message-ID: <F6B32E16-6045-4934-A27D-1407D43DCF2A at gmail.com>
Content-Type: text/plain; charset=utf-8

Hm, that?s tricky. I think the other methods listed on http://scikit-learn.org/stable/modules/feature_selection.html could help regarding a computationally cheap solution, but the problem would be that they probably wouldn?t work that well for an MLP due to the linear assumption. And an exhaustive sampling of all subsets would also be impractical/impossible. For all 50 feature subsets, you already have 73353053308199416032348518540326808282134507009732998441913227684085760 combinations :P. A greedy solution like forward or backward selection would be more feasible, but still very expensive in combination with an MLP. On top of that, you also have to consider that neural networks are generally pretty sensitive to hyperparameter settings. So even if you fix the architecture, you probably still want to check if the learning rate etc. is appropriate for each combination of features (by checking the cost and validation error during training).

PS: I wouldn?t dismiss dropout, imho. Especially because your training set is small, it could be even crucial to reduce overfitting. I mean it doesn?t remove features from your dataset but just helps the network to rely on particular combinations of features to be always present during training. Your final network will still process all features and dropout will effectively cause your network to ?use? more of those features in your ~50 feature subset compared to no dropout (because otherwise, it may just learn to rely of a subset of these 50 features).

> On Mar 19, 2017, at 6:23 PM, Andreas Mueller <t3kcit at gmail.com> wrote:
> 
> 
> 
> On 03/19/2017 03:47 PM, Thomas Evangelidis wrote:
>> Which of the following methods would you recommend to select good features (<=50) from a set of 534 features in order to train a MLPregressor? Please take into account that the datasets I use for training are small.
>> 
>> http://scikit-learn.org/stable/modules/feature_selection.html
>> 
>> And please don't tell me to use a neural network that supports the dropout or any other algorithm for feature elimination. This is not applicable in my case because I want to know the best 50 features in order to append them to other types of feature that I am confident that are important.
>> 
> You can always use forward or backward selection as implemented in mlxtend if you're patient. As your dataset is small that might work.
> However, it might be hard tricky to get the MLP to run consistently - though maybe not...
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn



------------------------------

Message: 4
Date: Mon, 20 Mar 2017 11:35:59 +0530
From: John Doe <coderain1 at gmail.com>
To: scikit-learn at python.org
Subject: [scikit-learn] Anomaly/Outlier detection based on user access
	for a large application
Message-ID:
	<CAP=qekf0sVRzY+RQjYUtp9Ax2zLKi39H89CNNGbj-gewYXPzMA at mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi All,
    I am trying to solve a problem of finding Anomalies/Outliers using application logs of a large KMS. Please find the details below:

*Problem Statement*: Find Anomalies/outliers using application access logs in an un-supervised learning environment. Basic use case is to find any suspicious activity by user/group, that deviates from a trend that the algorithm has learned.

*Input Data*: Data would be created from log file that are in the following
format:

"ts, src_ip, decrypt, user_a, group_b, kms_region, key"

Where:

*ts* : time of access in epoch Eg: 1489840335
*decrypt* : is one of the various possible actions.
*user_a*, *group_a* : are the user and group that did the access
*kms_region* : the region in which the key exists
*key* : the key that was accessed

*Train Set*: This comes under the un-supervised learning and hence we cant have a "normal" training set which the model can learn.

*Example of anomalies*:

   1. User A suddenly accessing from a different IP: xx.yy
   2. No. of access for a given key going up suddenly for a given user,key
   pair
   3. Increased access on a generally quite long weekend
   4. Increased access on a Thu (compared to last Thursdays)
   5. Unusual sequences of actions for a given user. Eg. read, decrypt,
   delete in quick succession for all keys for a given user

------------------------

>From our research, we have come up with below list of algorithms that 
>are
applied to similar problems:

   - ARIMA
   <https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average>
   : This might be good for timeseries predicting, but will it also learn to
   flag anomalies like #3, #4, sequences of actions(#5) etc?
   - scikit-learn's Novelty and Outlier Detection
   <http://scikit-learn.org/stable/modules/outlier_detection.html> : Not
   sure if these will address #3, #4 and #5 use cases above.
   - Neural Networks
   - k-nearest neighbors
   - Clustering-Based Anomaly Detection Techniques: k-Means Clustering etc
   - Parametric Techniques
   <https://www.vs.inf.ethz.ch/edu/HS2011/CPS/papers/chandola09_anomaly-detection-survey.pdf>
   (See Section 7): This might work well on continuous variables, but will it
   work on discrete features like, is_weekday etc? Also will it cover cases
   like #4 and #5 above?

Most of the research I did were on problems that had continuous features and did not consider discrete variables like "Holiday_today?" / succession of events etc.

 Any feedback on the algorithm / technique that can be used for above usecases would be highly appreciated. Thanks.

Regards,
John.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170320/c3224199/attachment.html>

------------------------------

Subject: Digest Footer

_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn


------------------------------

End of scikit-learn Digest, Vol 12, Issue 42
********************************************


This e-mail, including attachments, may include confidential and/or
proprietary information, and may be used only by the person or entity
to which it is addressed. If the reader of this e-mail is not the intended
recipient or his or her authorized agent, the reader is hereby notified
that any dissemination, distribution or copying of this e-mail is
prohibited. If you have received this e-mail in error, please notify the
sender by replying to this message and delete this e-mail immediately.



More information about the scikit-learn mailing list