[Numpy-discussion] performance matrix multiplication vs. matlab

Mon Jun 8 07:07:38 EDT 2009

2009/6/8 Gael Varoquaux <gael.varoquaux at normalesup.org>:
> On Mon, Jun 08, 2009 at 08:58:29AM +0200, Matthieu Brucher wrote:
>> Given the number of PCs, I think you may just be measuring noise.
>> As said in several manifold reduction publications (as the ones by
>> Torbjorn Vik who published on robust PCA for medical imaging), you
>> cannot expect to have more than 4 or 5 meaningful PCs, due to the
>> dimensionality curse. If you want 50 PCs, you have to have at least...
>> 10^50 samples, which is quite a lot, let's say it this way.
>> According to the litterature, a usual manifold can be described by 4
>> or 5 variables. If you have more, it is that you may be infringing
>> your hypothesis, here the linearity of your data (and as it is medical
>> imaging, you know from the beginning that this hypothesis is wrong).
>> So if you really want to find something meaningful and/or physical,
>> you should use a real dimensionality reduction, preferably a
>> non-linear one.
>
> I am not sure I am following you: I have time-varying signals. I am not
> taking a shot of the same process over and over again. My intuition tells
> me that I have more than 5 meaningful patterns.

How many samples do you have? 10000? a million? a billion? The problem
with 50 PCs is that your search space is mostly empty, "thanks" to the
curse of dimensionality. This means that you *should* not try to get a
meaning for the 10th and following PCs.

> Anyhow, I do some more analysis behind that (ICA actually), and I do find
> more than 5 patterns of interest that I not noise.

ICa suffers from the same problems than PCA. And I'm not even talking
about the linearity hypothesis that is never respected.

> So maybe I should be using some non-linear dimensionality reduction, but
> what I am doing works, and I can write a generative model of it. Most
> importantly, it is actually quite computationaly simple.

Thanks linearity ;)
The problem is that you will have a lot of confounds this way (your 50
PCs can in fact be the effect of 5 variables that are nonlinear).

> However, if you can point me to methods that you believe are better (and
> tell me why you believe so), I am all ears.

My thesis was on nonlinear dimensionality reduction (this is why I
believe so, especially in the medical imaging field), but it always
need some adaptation. It depends on what you want to do, the time you
can use to process data, ... Suffice to say we started with PCA some
years ago and we were switching to nonlinear reduction because of the
emptiness of the search space and because of the nonlinearity of the
brain space (no idea what my former lab is doing now, but it is used
for DTI at least).
You should check some books on it, and you surely have to read
something about the curse of dimensionality (at least if you want to
get published, as people know about this issue in the medical field),
even if you do not use nonlinear techniques.

Matthieu
-- 
Information System Engineer, Ph.D.
Website: http://matthieu-brucher.developpez.com/
Blogs: http://matt.eifelle.com and http://blog.developpez.com/?blog=92
LinkedIn: http://www.linkedin.com/in/matthieubrucher