[scikit-learn] Optimization algorithms in scikit-learn

Tue Sep 4 14:45:09 EDT 2018

Hi Andreas,

Is there a particular reason why there is no general purpose optimization
module? Most of the optimizers (atleast the first order methods) are
general purpose since you just need to feed the gradient. In some special
cases, you probably need problem specific formulation for better
performance. The advantage of SVRG is that you don't need to store the
gradients which costs a storage of order
number_of_weights*number_of_samples which is the main problem with SAG and
SAGA. Thus, for most neural network models (and even non-NN models) using
SAG and SAGA is infeasible on personal computers.

SVRG is not popular in deep learning community but it should be noted that
SVRG is different from Adam since it does not tune the step size. Just to
clarify, SVRG can be faster than Adam since it decreases the variance to
achieve a similar convergence rate as full batch methods while being
computationally cheap like SGD/Adam. However, one can combine both methods
to obtain an even faster algorithm.

Cheers,
Touqir

On Tue, Sep 4, 2018 at 11:46 AM Andreas Mueller <t3kcit at gmail.com> wrote:

> Hi Touqir.
> We don't usually implement general purpose optimizers in
> scikit-learn, in particular because usually different optimizers
> apply to different kinds of problems.
> For linear models we have SAG and SAGA, for neural nets we have adam.
> I don't think the authors claim to be faster than SAG, so I'm not sure
> what the
> motivation would be for using their method.
>
> Best,
> Andy
>
>
> On 09/04/2018 12:55 PM, Touqir Sajed wrote:
>
> Hi,
>
> I have been looking for stochastic optimization algorithms in scikit-learn
> that are faster than SGD and so far I have come across Adam and momentum.
> Are there other methods implemented in scikit-learn? Particularly, the
> variance reduction methods such as SVRG (
> https://papers.nips.cc/paper/4937-accelerating-stochastic-gradient-descent-using-predictive-variance-reduction.pdf
> <https://ml-trckr.com/link/https%3A%2F%2Fpapers.nips.cc%2Fpaper%2F4937-accelerating-stochastic-gradient-descent-using-predictive-variance-reduction.pdf/W7SK8K47xGR7dKCC8Wlv>)
> ? These variance reduction methods are the current state of the art in
> terms of convergence speed while maintaining runtime complexity of order n
> -- number of features. If they are not implemented yet, I think it would be
> really great to implement(I am happy to do so) them since nowadays working
> on large datasets(where LBGFS may not be practical) is the norm where the
> improvements are definitely worth it.
>
> Cheers,
> Touqir
>
> --
> Computing Science Master's student at University of Alberta, Canada,
> specializing in Machine Learning. Website :
> https://ca.linkedin.com/in/touqir-sajed-6a95b1126
> <https://ml-trckr.com/link/https%3A%2F%2Fca.linkedin.com%2Fin%2Ftouqir-sajed-6a95b1126/W7SK8K47xGR7dKCC8Wlv>
>
>
> _______________________________________________
> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>

-- 
Computing Science Master's student at University of Alberta, Canada,
specializing in Machine Learning. Website :
https://ca.linkedin.com/in/touqir-sajed-6a95b1126
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180904/ab8ba7c1/attachment.html>