[scikit-learn] partial_fit implementation for IsolationForest
Nicolas Goix
goix.nicolas at gmail.com
Thu Jun 9 12:58:52 EDT 2016
Hi Isaak
There is a good review on methods to do online random forests here:
https://arxiv.org/pdf/1302.4853.pdf
In fact, it turns out that the method of having a "window" of trees is not
the best way to do. Usually the trees have to be grown in the same time
data arrive, see
http://lrs.icg.tugraz.at/pubs/saffari_olcv_09.pdf
Adapting ensembles API to online learning seems hard work. But you can open
a PR to discuss it.
Nicolas
On 9 Jun 2016 9:06 am, <donkey-hotei at cryptolab.net> wrote:
> hi nicolas,
> excuse me, didn't mean to drop this thread for so long.
>
> There is a paper from the same authors as iforest but for streaming
>> data: http://ijcai.org/Proceedings/11/Papers/254.pdf
>>
>> For now it is not cited enough (24) to satisfy the sklearn
>> requirements. Waiting for more citations, this could be a nice
>> addition to sklearn-contrib.
>>
>
> agreed, I started on a weak implementation of hstree but it is not
> scikit-learn compatible,
> let's see what happens...
> it would be nice to see some guidance here, maybe a new splitter will have
> to be added?
>
> Otherwise, we could imagine extending iforest to streaming data by
>> building new
>> trees when data come (and removing the oldest ones), prediction still
>> being based on
>> the average depth of the forest. I'm not sure this heuristic could be
>> merged on
>> scikit-learn, since it is not based on well-cited papers. In the same
>> time,
>> it is a natural and simple extension of iforest to streaming data...
>>
>> Any opinion on it?
>>
>
> It is, as I thought a simple extension - my first naive approach was to
> use the 'warm_start' attribute
> of the BaseBagging parent class to preserve older estimators and then, in
> the 'partial_fit' method, we have a loop
> which deleted popped off some n-number of estimators before calling the
> original 'fit' method again on incoming data -
> adding new estimators to the ensemble.
> We run into the problem of concept drift. Is this the way you'd implement
> this? if not, how would you approach?
>
> thanks so much for reading,
> isaak
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160609/e57af4d6/attachment.html>
More information about the scikit-learn
mailing list