[SciPy-Dev] [gpaw-users] wrapper for Scalapack

Fri Oct 6 05:37:30 EDT 2017

On Thu, Oct 5, 2017 at 8:36 PM Ralf Gommers <ralf.gommers at gmail.com> wrote:

> On Fri, Oct 6, 2017 at 7:05 AM, <josef.pktd at gmail.com> wrote:
>
>>
>> But does distributed computing stay out of scope for SciPy after 1.0?
>> As a long term plan towards 2.0?
>>
>
> Such changes are worth discussing once in a while, usually sharpens the
> focus:)
>
> My first thoughts:
> - traditional stuff like MPI, BLACS, ScaLAPACK will likely always remain
> out of scope
> - we can consider new dependencies, but only if they do not make it harder
> to install SciPy
> - a few more likely changes would be to start allowing/supporting pandas
> data frames as inputs, broader use of simple (optional) parallelization
> with joblib or threading, and using dask under the hood.
>

There seems to be a profusion of tools for parallelization, so choosing
just one to use as a basis for scipy's parallelization could be really
frustrating for users who have a reason to need a different one.

The exception, I would say, is the concurrent.futures interface. This is
part of python (3), and it allows a limited but manageable and useful
amount of parallelization. It is also an interface other tools can and do
implement. For example, emcee is capable of taking advantage of
parallelization, but that parallelization happens entirely in one place: a
map is done to compute log-probabilities for a list of candidates. emcee is
agnostic about how this map works; by default it can use python's built-in
map, but emcee provides an "MPIPool" object that supplies a parallel map
that uses MPI, python's ThreadPoolExecutor and ProcessPoolExecutor also
provide such a parallel map, and (for example) dask provides an Executor
interface that allows such a map across a collection of dask instances.

So: I think it would be good if scipy could incorporate the use of
Executors to achieve parallelism where that's available from the underlying
algorithms. From the user's point of view, this just means one or two more
optional arguments, in particular a "pool" argument from which futures are
generated. In turn, it might make sense to implement a few new algorithms
that can use parallelism effectively. The global optimizers spring to mind
as candidates for this process, but in fact any local optimizer that needs
a gradient but has to compute it numerically can probably benefit from
computing the derivative in parallel.

This sort of opportunistic parallelization is no substitute for something
like Scalapack or PaGMO, dedicated distributed computing algorithms, but it
is a way for scipy to allow easy parallelization where possible.

Anne
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20171006/8ec973eb/attachment-0001.html>