[SciPy-Dev] [gpaw-users] wrapper for Scalapack
Ralf Gommers
ralf.gommers at gmail.com
Sun Oct 29 06:18:42 EDT 2017
On Fri, Oct 6, 2017 at 10:37 PM, Anne Archibald <peridot.faceted at gmail.com>
wrote:
> On Thu, Oct 5, 2017 at 8:36 PM Ralf Gommers <ralf.gommers at gmail.com>
> wrote:
>
>> On Fri, Oct 6, 2017 at 7:05 AM, <josef.pktd at gmail.com> wrote:
>>
>>>
>>> But does distributed computing stay out of scope for SciPy after 1.0?
>>> As a long term plan towards 2.0?
>>>
>>
>> Such changes are worth discussing once in a while, usually sharpens the
>> focus:)
>>
>> My first thoughts:
>> - traditional stuff like MPI, BLACS, ScaLAPACK will likely always remain
>> out of scope
>> - we can consider new dependencies, but only if they do not make it
>> harder to install SciPy
>> - a few more likely changes would be to start allowing/supporting pandas
>> data frames as inputs, broader use of simple (optional) parallelization
>> with joblib or threading, and using dask under the hood.
>>
>
Interesting thoughts, thanks Anne. This thread is a bit stale by now, but I
still wanted to record my thoughts on the topic.
> There seems to be a profusion of tools for parallelization, so choosing
> just one to use as a basis for scipy's parallelization could be really
> frustrating for users who have a reason to need a different one.
>
You're thinking about the relatively small fraction of power users here
that would care (compared to the n_jobs=<number> trivial parallelization
users), and my first thought is that addressing that use case comes with
costs that are possibly not worth the effort.
> The exception, I would say, is the concurrent.futures interface.
>
In terms of user friendliness, I'd say concurrent.futures is pretty poor.
This:
with futures.ThreadPoolExecutor(max_workers=4) as executor:
some_function(..., pool=executor.map)
is much worse than:
some_function(..., n_jobs=4)
This is part of python (3), and it allows a limited but manageable and
> useful amount of parallelization. It is also an interface other tools can
> and do implement. For example, emcee is capable of taking advantage of
> parallelization, but that parallelization happens entirely in one place: a
> map is done to compute log-probabilities for a list of candidates. emcee is
> agnostic about how this map works; by default it can use python's built-in
> map, but emcee provides an "MPIPool" object that supplies a parallel map
> that uses MPI, python's ThreadPoolExecutor and ProcessPoolExecutor also
> provide such a parallel map, and (for example) dask provides an Executor
> interface that allows such a map across a collection of dask instances.
>
> So: I think it would be good if scipy could incorporate the use of
> Executors to achieve parallelism where that's available from the underlying
> algorithms. From the user's point of view, this just means one or two more
> optional arguments, in particular a "pool" argument from which futures are
> generated.
>
It'd have to be two I think, like in emcee which has `*threads=1*, *pool=None`.
*I'd say `threads` (or `n_jobs` as in scikit-learn and spatial.cKDTree) is
the must-have if we go for more parallel support and `pool` is the power
user one that would be a tradeoff. It would enable users of MPI, dask,
etc., but on the other hand it makes the API more verbose, is more work to
support, and a lot harder to test. IIRC joblib also does a lot of work to
make ndarrays (and other objects?) pickleable to make parallelization work
for a wider range of functions that plain multiprocessing. So I'm undecided
on whether a `pool` keyword would make sense in scipy.
In turn, it might make sense to implement a few new algorithms that can use
> parallelism effectively. The global optimizers spring to mind as candidates
> for this process, but in fact any local optimizer that needs a gradient but
> has to compute it numerically can probably benefit from computing the
> derivative in parallel.
>
Clustering functions would be another good candidate.
> This sort of opportunistic parallelization is no substitute for something
> like Scalapack or PaGMO, dedicated distributed computing algorithms, but it
> is a way for scipy to allow easy parallelization where possible.
>
Agreed
Ralf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20171029/f9d57153/attachment.html>
More information about the SciPy-Dev
mailing list