[scikit-learn] Dimension Reduction - MDS

Brown J.B. jbbrown at kuhp.kyoto-u.ac.jp
Thu Oct 11 10:30:37 EDT 2018


Hi Guillaume,

The good news is that your script works as-is on smaller datasets, and
hopefully does the logic for your task correctly.

In addition to Alex's comment about data size and MDS tractability, I would
also point out a philosophical issue -- why consider MDS for such a large
dataset?
At least in two dimensions, once MDS gets beyond 1000 samples or so, the
resulting sample coordinates and its visualization are potentially highly
dispersed (e.g.,  like a 2D-uniform distribution) and may not lead to
interpretability.
One can move to three-dimensional MDS, but perhaps even then a few thousand
samples gets to the limit of graphical interpretability.
It very obviously depends on the relationships in your data.

Also, as you continue your work, keep in mind that the per-sample
dimensionality (number of entries in a single sample's descriptor vector)
will not be the primary determinant of the memory consumption requirements
for the MDS algorithm, because in any case you must compute (either inline
or pre-compute) the distance matrix between each pair of samples, and that
matrix stays in memory during coordinate generation (as far as I know).
So, 10 chemical descriptors (since I noticed you mentioning Dragon) or 1000
descriptors will still result in the same memory requirement for the
distance matrix, and then scaling to hundreds of thousands of samples will
eat all of the compute node's RAM.

Since you have 200k samples, you could potentially do some type of repeated
partial clustering (e.g., on random subsamples of data) to find a
reasonable number of clusters per repetition, analyze those results to make
an estimate of a number of clusters for a global clustering, and then
select a limited number of samples per cluster to use for projection to a
coordinate space by MDS.
Or a diversity selection (either by vector distance or in your case,
differing compound scaffolds) may be a way to get a quick subset and
visualize distance relationships.

Hope this helps.

Sincerely,
J.B. Brown

2018年10月11日(木) 20:14 Alexandre Gramfort <alexandre.gramfort at inria.fr>:

> hi Guillaume,
>
> I cannot use our MDS solver at this scale. Even if you fit it in RAM
> it will be slow.
>
> I would play with https://github.com/lmcinnes/umap unless you really
> what a classic MDS.
>
> Alex
>
> On Thu, Oct 11, 2018 at 10:31 AM Guillaume Favelier
> <Guillaume.Favelier at lip6.fr> wrote:
> >
> > Hello J.B,
> >
> > Thank you for your quick reply.
> >
> > > If you try with a very small (e.g., 100 sample) data file, does your
> code
> > > employing MDS work?
> > > As you increase the number of samples, does the script continue to
> work?
> > So I tried the same script while increasing the number of samples (100,
> > 1000 and 10000) and it works indeed without swapping on my workstation.
> >
> > > That is 49,000,000 entries, plus overhead for a data structure.
> > I thought that even 49M entries of doubles would be able to be processed
> > with 64G of RAM. Is there something to configure to allow this
> computation?
> >
> > The typical datasets I use can have around 200-300k rows with a few
> columns
> > (usually up to 3).
> >
> > Best regards,
> >
> > Guillaume
> >
> > Quoting "Brown J.B. via scikit-learn" <scikit-learn at python.org>:
> >
> > > Hello Guillaume,
> > >
> > > You are computing a distance matrix of shape 70000x70000 to generate
> MDS
> > > coordinates.
> > > That is 49,000,000 entries, plus overhead for a data structure.
> > >
> > > If you try with a very small (e.g., 100 sample) data file, does your
> code
> > > employing MDS work?
> > > As you increase the number of samples, does the script continue to
> work?
> > >
> > > Hope this helps you get started.
> > > J.B.
> > >
> > > 2018年10月9日(火) 18:22 Guillaume Favelier <Guillaume.Favelier at lip6.fr>:
> > >
> > >> Hi everyone,
> > >>
> > >> I'm trying to use some dimension reduction algorithm [1] on my dataset
> > >> [2] in a
> > >> python script [3] but for some reason, Python seems to consume a lot
> of my
> > >> main memory and even swap on my configuration [4] so I don't have the
> > >> expected result
> > >> but a memory error instead.
> > >>
> > >> I have the impression that this behaviour is not intended so can you
> > >> help me know
> > >> what I did wrong or miss somewhere please?
> > >>
> > >> [1]: MDS -
> > >>
> http://scikit-learn.org/stable/modules/generated/sklearn.manifold.MDS.html
> > >> [2]: dragon.csv - 69827 rows, 3 columns (x,y,z)
> > >> [3]: dragon.py - 10 lines
> > >> [4]: dragon_swap.png - htop on my workstation
> > >>
> > >> TAR archive:
> > >> https://drive.google.com/open?id=1d1S99XeI7wNEq131wkBUCBrctPQRgpxn
> > >>
> > >> Best regards,
> > >>
> > >> Guillaume Favelier
> > >>
> > >> _______________________________________________
> > >> scikit-learn mailing list
> > >> scikit-learn at python.org
> > >> https://mail.python.org/mailman/listinfo/scikit-learn
> > >>
> >
> >
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181011/4f4ea881/attachment-0001.html>


More information about the scikit-learn mailing list