[scikit-learn] partial_fit implementation for IsolationForest

donkey-hotei at cryptolab.net donkey-hotei at cryptolab.net
Fri Jul 1 03:48:55 EDT 2016


hi Olivier,

thanks for your response.

> What you describe is quite different from what sklearn models
> typically do with partial_fit. partial_fit is more about out-of-core /
> streaming fitting rather than true online learning with explicit
> forgetting.
> 
> In particular what you suggest would not accept calling partial_fit
> with very small chunks (e.g. from tens to a hundred samples at a time)
> because that would not be enough to develop deep isolation trees and
> would harm the performance of the resulting isolation forest.

I see, suppose I should check to see how the depth of these trees 
changes when fitting on small chunks as opposed to large chunks -. 
either way, refreshing on at least 1000 samples has proven to work O.K 
here in the face of concept drift

> If the problem is true online learning (tracking a stream of training
> data with expected shifts in its distribution) I think it's better to
> devise a dedicated API that does not try to mimic the scikit-learn API
> (for this specific part). There will typically have to be an
> additional hyperparameter to control how much the model should
> remember about old samples.

ok, i've been using a parameter called 'n_more_estimators' that decides 
how many trees are dropped/added. maybe it is not the best way

> If the problem is more about out-of-core, then partial_fit is suitable
> but the trees should grow and get reorganized progressively (as
> pointed by others in previous comments).

maybe a name like "online_fit" would be more appropriate? it would be 
nice to know what exactly is meant by "reorganized" , so far ive been 
merely dropping the oldest trees

> BTW,  I would be curious to know more about the kind of anomaly
> detection problem where you found IsolationForests to work well.

The problem is intrusion detection at the application layer, features 
are parsed from http audit logs

ty


More information about the scikit-learn mailing list