[Numpy-discussion] performance matrix multiplication vs. matlab

Matthieu Brucher matthieu.brucher at gmail.com
Mon Jun 8 02:58:29 EDT 2009

2009/6/8 Gael Varoquaux <gael.varoquaux at normalesup.org>:
> On Mon, Jun 08, 2009 at 12:29:08AM -0400, David Warde-Farley wrote:
>> On 7-Jun-09, at 6:12 AM, Gael Varoquaux wrote:
>> > Well, I do bootstrapping of PCAs, that is SVDs. I can tell you, it
>> > makes
>> > a big difference, especially since I have 8 cores.
>> Just curious Gael: how many PC's are you retaining? Have you tried
>> iterative methods (i.e. the EM algorithm for PCA)?
> I am using the heuristic exposed in
> http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4562996
> We have very noisy and long time series. My experience is that most
> model-based heuristics for choosing the number of PCs retained give us
> way too much on this problem (they simply keep diverging if I add noise
> at the end of the time series). The algorithm we use gives us ~50
> interesting PCs (each composed of 50 000 dimensions). That happens to be
> quite right based on our experience with the signal. However, being
> fairly new to statistics, I am not aware of the EM algorithm that you
> mention. I'd be interested in a reference, to see if I can use that
> algorithm. The PCA bootstrap is time-consuming.


Given the number of PCs, I think you may just be measuring noise.
As said in several manifold reduction publications (as the ones by
Torbjorn Vik who published on robust PCA for medical imaging), you
cannot expect to have more than 4 or 5 meaningful PCs, due to the
dimensionality curse. If you want 50 PCs, you have to have at least...
10^50 samples, which is quite a lot, let's say it this way.
According to the litterature, a usual manifold can be described by 4
or 5 variables. If you have more, it is that you may be infringing
your hypothesis, here the linearity of your data (and as it is medical
imaging, you know from the beginning that this hypothesis is wrong).
So if you really want to find something meaningful and/or physical,
you should use a real dimensionality reduction, preferably a
non-linear one.

Just my 2 cents ;)

Information System Engineer, Ph.D.
Website: http://matthieu-brucher.developpez.com/
Blogs: http://matt.eifelle.com and http://blog.developpez.com/?blog=92
LinkedIn: http://www.linkedin.com/in/matthieubrucher

More information about the NumPy-Discussion mailing list