hello scikit-learn devs, After following the work on IsolationForest so far and testing on a real-world problem here we've found this model to be very promising for anomaly detection. However, at present, IsolationForest only fits data in batch even while it may be well suited to incremental on-line learning since one could subsample recent history and older estimators can be dropped progressively. I'd like to contribute this feature, but being new to ML and scikit-learn I'm curious how I should start making a quick & dirty version to see how this may work. Are there other good examples where one could see the difference between .fit and .partial_fit in other models? thanks isaak y.
Hi Isaac, You may have a look at MiniBatchKMeans and MiniBatchDictionaryLearning that both proposes this API. At the moment, you should fit a single mini batch to the estimator using partial_fit, and update the inner attributes accordingly. During the first partial_fit, you should take care of various memory allocation that are needed by the estimator. Please fill free to create a pull request whenever you think your code is ready for review. Good luck! Le 26 mai 2016 13:14, <donkey-hotei@cryptolab.net> a écrit :
hello scikit-learn devs,
After following the work on IsolationForest so far and testing on a real-world problem here we've found this model to be very promising for anomaly detection. However, at present, IsolationForest only fits data in batch even while it may be well suited to incremental on-line learning since one could subsample recent history and older estimators can be dropped progressively.
I'd like to contribute this feature, but being new to ML and scikit-learn I'm curious how I should start making a quick & dirty version to see how this may work. Are there other good examples where one could see the difference between .fit and .partial_fit in other models?
thanks isaak y. _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Hello Isaak, There is a paper from the same authors as iforest but for streaming data: http://ijcai.org/Proceedings/11/Papers/254.pdf For now it is not cited enough (24) to satisfy the sklearn requirements. Waiting for more citations, this could be a nice addition to sklearn-contrib. Otherwise, we could imagine extending iforest to streaming data by building new trees when data come (and removing the oldest ones), prediction still being based on the average depth of the forest. I'm not sure this heuristic could be merged on scikit-learn, since it is not based on well-cited papers. In the same time, it is a natural and simple extension of iforest to streaming data... Any opinion on it? Nicolas 2016-05-26 13:32 GMT+02:00 Arthur Mensch <arthur.mensch@inria.fr>:
Hi Isaac,
You may have a look at MiniBatchKMeans and MiniBatchDictionaryLearning that both proposes this API. At the moment, you should fit a single mini batch to the estimator using partial_fit, and update the inner attributes accordingly. During the first partial_fit, you should take care of various memory allocation that are needed by the estimator.
Please fill free to create a pull request whenever you think your code is ready for review.
Good luck! Le 26 mai 2016 13:14, <donkey-hotei@cryptolab.net> a écrit :
hello scikit-learn devs,
After following the work on IsolationForest so far and testing on a real-world problem here we've found this model to be very promising for anomaly detection. However, at present, IsolationForest only fits data in batch even while it may be well suited to incremental on-line learning since one could subsample recent history and older estimators can be dropped progressively.
I'd like to contribute this feature, but being new to ML and scikit-learn I'm curious how I should start making a quick & dirty version to see how this may work. Are there other good examples where one could see the difference between .fit and .partial_fit in other models?
thanks isaak y. _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
I think your idea is an excellent candidate for scikit-learn-contrib https://github.com/scikit-learn-contrib/scikit-learn-contrib __________________________________________________________________________________________ Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science and Capacity Planning | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith@macys.com From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com@python.org] On Behalf Of Nicolas Goix Sent: Thursday, May 26, 2016 8:51 AM To: Scikit-learn user and developer mailing list Subject: Re: [scikit-learn] partial_fit implementation for IsolationForest ⚠ EXT MSG: Hello Isaak, There is a paper from the same authors as iforest but for streaming data: http://ijcai.org/Proceedings/11/Papers/254.pdf For now it is not cited enough (24) to satisfy the sklearn requirements. Waiting for more citations, this could be a nice addition to sklearn-contrib. Otherwise, we could imagine extending iforest to streaming data by building new trees when data come (and removing the oldest ones), prediction still being based on the average depth of the forest. I'm not sure this heuristic could be merged on scikit-learn, since it is not based on well-cited papers. In the same time, it is a natural and simple extension of iforest to streaming data... Any opinion on it? Nicolas 2016-05-26 13:32 GMT+02:00 Arthur Mensch <arthur.mensch@inria.fr<mailto:arthur.mensch@inria.fr>>: Hi Isaac, You may have a look at MiniBatchKMeans and MiniBatchDictionaryLearning that both proposes this API. At the moment, you should fit a single mini batch to the estimator using partial_fit, and update the inner attributes accordingly. During the first partial_fit, you should take care of various memory allocation that are needed by the estimator. Please fill free to create a pull request whenever you think your code is ready for review. Good luck! Le 26 mai 2016 13:14, <donkey-hotei@cryptolab.net<mailto:donkey-hotei@cryptolab.net>> a écrit : hello scikit-learn devs, After following the work on IsolationForest so far and testing on a real-world problem here we've found this model to be very promising for anomaly detection. However, at present, IsolationForest only fits data in batch even while it may be well suited to incremental on-line learning since one could subsample recent history and older estimators can be dropped progressively. I'd like to contribute this feature, but being new to ML and scikit-learn I'm curious how I should start making a quick & dirty version to see how this may work. Are there other good examples where one could see the difference between .fit and .partial_fit in other models? thanks isaak y. _______________________________________________ scikit-learn mailing list scikit-learn@python.org<mailto:scikit-learn@python.org> https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn@python.org<mailto:scikit-learn@python.org> https://mail.python.org/mailman/listinfo/scikit-learn * This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.
How about mondrian forests ;) On 05/26/2016 09:28 AM, Dale T Smith wrote:
I think your idea is an excellent candidate for scikit-learn-contrib
https://github.com/scikit-learn-contrib/scikit-learn-contrib
__________________________________________________________________________________________ *Dale Smith*| Macy's Systems and Technology | IFS eCommerce | Data Science and Capacity Planning | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith@macys.com
*From:*scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com@python.org] *On Behalf Of *Nicolas Goix *Sent:* Thursday, May 26, 2016 8:51 AM *To:* Scikit-learn user and developer mailing list *Subject:* Re: [scikit-learn] partial_fit implementation for IsolationForest
⚠ EXT MSG:
Hello Isaak,
There is a paper from the same authors as iforest but for streaming data: http://ijcai.org/Proceedings/11/Papers/254.pdf
For now it is not cited enough (24) to satisfy the sklearn requirements. Waiting for more citations, this could be a nice addition to sklearn-contrib.
Otherwise, we could imagine extending iforest to streaming data by building new trees when data come (and removing the oldest ones), prediction still being based on the average depth of the forest. I'm not sure this heuristic could be merged on scikit-learn, since it is not based on well-cited papers. In the same time, it is a natural and simple extension of iforest to streaming data...
Any opinion on it?
Nicolas
2016-05-26 13:32 GMT+02:00 Arthur Mensch <arthur.mensch@inria.fr <mailto:arthur.mensch@inria.fr>>:
Hi Isaac,
You may have a look at MiniBatchKMeans and MiniBatchDictionaryLearning that both proposes this API. At the moment, you should fit a single mini batch to the estimator using partial_fit, and update the inner attributes accordingly. During the first partial_fit, you should take care of various memory allocation that are needed by the estimator.
Please fill free to create a pull request whenever you think your code is ready for review.
Good luck!
Le 26 mai 2016 13:14, <donkey-hotei@cryptolab.net <mailto:donkey-hotei@cryptolab.net>> a écrit :
hello scikit-learn devs,
After following the work on IsolationForest so far and testing on a real-world problem here we've found this model to be very promising for anomaly detection. However, at present, IsolationForest only fits data in batch even while it may be well suited to incremental on-line learning since one could subsample recent history and older estimators can be dropped progressively.
I'd like to contribute this feature, but being new to ML and scikit-learn I'm curious how I should start making a quick & dirty version to see how this may work. Are there other good examples where one could see the difference between .fit and .partial_fit in other models?
thanks isaak y. _______________________________________________ scikit-learn mailing list scikit-learn@python.org <mailto:scikit-learn@python.org> https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org <mailto:scikit-learn@python.org> https://mail.python.org/mailman/listinfo/scikit-learn
* This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
hi nicolas, excuse me, didn't mean to drop this thread for so long.
There is a paper from the same authors as iforest but for streaming data: http://ijcai.org/Proceedings/11/Papers/254.pdf
For now it is not cited enough (24) to satisfy the sklearn requirements. Waiting for more citations, this could be a nice addition to sklearn-contrib.
agreed, I started on a weak implementation of hstree but it is not scikit-learn compatible, let's see what happens... it would be nice to see some guidance here, maybe a new splitter will have to be added?
Otherwise, we could imagine extending iforest to streaming data by building new trees when data come (and removing the oldest ones), prediction still being based on the average depth of the forest. I'm not sure this heuristic could be merged on scikit-learn, since it is not based on well-cited papers. In the same time, it is a natural and simple extension of iforest to streaming data...
Any opinion on it?
It is, as I thought a simple extension - my first naive approach was to use the 'warm_start' attribute of the BaseBagging parent class to preserve older estimators and then, in the 'partial_fit' method, we have a loop which deleted popped off some n-number of estimators before calling the original 'fit' method again on incoming data - adding new estimators to the ensemble. We run into the problem of concept drift. Is this the way you'd implement this? if not, how would you approach? thanks so much for reading, isaak
Hi Isaak There is a good review on methods to do online random forests here: https://arxiv.org/pdf/1302.4853.pdf In fact, it turns out that the method of having a "window" of trees is not the best way to do. Usually the trees have to be grown in the same time data arrive, see http://lrs.icg.tugraz.at/pubs/saffari_olcv_09.pdf Adapting ensembles API to online learning seems hard work. But you can open a PR to discuss it. Nicolas On 9 Jun 2016 9:06 am, <donkey-hotei@cryptolab.net> wrote:
hi nicolas, excuse me, didn't mean to drop this thread for so long.
There is a paper from the same authors as iforest but for streaming
data: http://ijcai.org/Proceedings/11/Papers/254.pdf
For now it is not cited enough (24) to satisfy the sklearn requirements. Waiting for more citations, this could be a nice addition to sklearn-contrib.
agreed, I started on a weak implementation of hstree but it is not scikit-learn compatible, let's see what happens... it would be nice to see some guidance here, maybe a new splitter will have to be added?
Otherwise, we could imagine extending iforest to streaming data by
building new trees when data come (and removing the oldest ones), prediction still being based on the average depth of the forest. I'm not sure this heuristic could be merged on scikit-learn, since it is not based on well-cited papers. In the same time, it is a natural and simple extension of iforest to streaming data...
Any opinion on it?
It is, as I thought a simple extension - my first naive approach was to use the 'warm_start' attribute of the BaseBagging parent class to preserve older estimators and then, in the 'partial_fit' method, we have a loop which deleted popped off some n-number of estimators before calling the original 'fit' method again on incoming data - adding new estimators to the ensemble. We run into the problem of concept drift. Is this the way you'd implement this? if not, how would you approach?
thanks so much for reading, isaak _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
nicolas,
There is a good review on methods to do online random forests here:
https://arxiv.org/pdf/1302.4853.pdf
In fact, it turns out that the method of having a "window" of trees is not the best way to do. Usually the trees have to be grown in the same time data arrive, see
http://lrs.icg.tugraz.at/pubs/saffari_olcv_09.pdf
Adapting ensembles API to online learning seems hard work. But you can open a PR to discuss it.
Thanks a lot for the papers and info. I'll open PR at sometime and see what happens.. ty, isaak
However, at present, IsolationForest only fits data in batch even while it may be well suited to incremental on-line learning since one could subsample recent history and older estimators can be dropped progressively.
What you describe is quite different from what sklearn models typically do with partial_fit. partial_fit is more about out-of-core / streaming fitting rather than true online learning with explicit forgetting. In particular what you suggest would not accept calling partial_fit with very small chunks (e.g. from tens to a hundred samples at a time) because that would not be enough to develop deep isolation trees and would harm the performance of the resulting isolation forest. If the problem is true online learning (tracking a stream of training data with expected shifts in its distribution) I think it's better to devise a dedicated API that does not try to mimic the scikit-learn API (for this specific part). There will typically have to be an additional hyperparameter to control how much the model should remember about old samples. If the problem is more about out-of-core, then partial_fit is suitable but the trees should grow and get reorganized progressively (as pointed by others in previous comments). BTW, I would be curious to know more about the kind of anomaly detection problem where you found IsolationForests to work well. -- Olivier
hi Olivier, thanks for your response.
What you describe is quite different from what sklearn models typically do with partial_fit. partial_fit is more about out-of-core / streaming fitting rather than true online learning with explicit forgetting.
In particular what you suggest would not accept calling partial_fit with very small chunks (e.g. from tens to a hundred samples at a time) because that would not be enough to develop deep isolation trees and would harm the performance of the resulting isolation forest.
I see, suppose I should check to see how the depth of these trees changes when fitting on small chunks as opposed to large chunks -. either way, refreshing on at least 1000 samples has proven to work O.K here in the face of concept drift
If the problem is true online learning (tracking a stream of training data with expected shifts in its distribution) I think it's better to devise a dedicated API that does not try to mimic the scikit-learn API (for this specific part). There will typically have to be an additional hyperparameter to control how much the model should remember about old samples.
ok, i've been using a parameter called 'n_more_estimators' that decides how many trees are dropped/added. maybe it is not the best way
If the problem is more about out-of-core, then partial_fit is suitable but the trees should grow and get reorganized progressively (as pointed by others in previous comments).
maybe a name like "online_fit" would be more appropriate? it would be nice to know what exactly is meant by "reorganized" , so far ive been merely dropping the oldest trees
BTW, I would be curious to know more about the kind of anomaly detection problem where you found IsolationForests to work well.
The problem is intrusion detection at the application layer, features are parsed from http audit logs ty
participants (6)
-
Andreas Mueller -
Arthur Mensch -
Dale T Smith -
donkey-hotei@cryptolab.net -
Nicolas Goix -
Olivier Grisel