[scikit-learn] partial_fit implementation for IsolationForest

donkey-hotei at cryptolab.net donkey-hotei at cryptolab.net
Fri Jul 1 03:48:55 EDT 2016

hi Olivier,

thanks for your response.

> What you describe is quite different from what sklearn models
> typically do with partial_fit. partial_fit is more about out-of-core /
> streaming fitting rather than true online learning with explicit
> forgetting.
> In particular what you suggest would not accept calling partial_fit
> with very small chunks (e.g. from tens to a hundred samples at a time)
> because that would not be enough to develop deep isolation trees and
> would harm the performance of the resulting isolation forest.

I see, suppose I should check to see how the depth of these trees 
changes when fitting on small chunks as opposed to large chunks -. 
either way, refreshing on at least 1000 samples has proven to work O.K 
here in the face of concept drift

> If the problem is true online learning (tracking a stream of training
> data with expected shifts in its distribution) I think it's better to
> devise a dedicated API that does not try to mimic the scikit-learn API
> (for this specific part). There will typically have to be an
> additional hyperparameter to control how much the model should
> remember about old samples.

ok, i've been using a parameter called 'n_more_estimators' that decides 
how many trees are dropped/added. maybe it is not the best way

> If the problem is more about out-of-core, then partial_fit is suitable
> but the trees should grow and get reorganized progressively (as
> pointed by others in previous comments).

maybe a name like "online_fit" would be more appropriate? it would be 
nice to know what exactly is meant by "reorganized" , so far ive been 
merely dropping the oldest trees

> BTW,  I would be curious to know more about the kind of anomaly
> detection problem where you found IsolationForests to work well.

The problem is intrusion detection at the application layer, features 
are parsed from http audit logs


More information about the scikit-learn mailing list