[Numpy-discussion] very large matrices.

Mon May 14 21:16:06 EDT 2007

Dave,
    I'm may be totally wrong, but i have intuitive feeling that
your problem may be reformulated  with focus on separation of a "basic"
physical (vs. mathematical) 'core' and the terms which depend on a
reasonable "small parameter".  In other words, my point is
to build a simplified model problem with the physically-important
components which would generate 100-500 component vectors
*core* problem (that is, to build "physical" principal componments
first 'manually' based on common/physical sense, guesses, heuristics).
    To me, the vectors and matrices of 5-10K components and more is
a sign of suboptimal problem formulation and a need of its reformulation
in "more physical" terms - *if* that means smth significant in your concrete
situation.  For instance, the noise reduction can be done at problem
formulation level first by introducing "regularizing" terms such as
artificial friction, viscosity, coupling, etc which used to create
a dramatic regularization effect stabilizing the problem and thus
making it easy-to-interpret in physical (vs. purely math) terms.
    The Monte-Carlo technique is also an option in this type of
problems.  Of course, more physical details would be helpful in
better understanding your problem.
    good luck,
val

----- Original Message ----- 
From: "Dave P. Novakovic" <davidnovakovic at gmail.com>
To: "Discussion of Numerical Python" <numpy-discussion at scipy.org>
Sent: Sunday, May 13, 2007 2:46 AM
Subject: Re: [Numpy-discussion] very large matrices.


> They are very large numbers indeed. Thanks for giving me a wake up call.
> Currently my data is represented as vectors in a vectorset, a typical
> sparse representation.
>
> I reduced the problem significantly by removing lots of noise. I'm
> basically recording traces of a terms occurrence throughout a corpus
> and doing an analysis of the eigenvectors.
>
> I reduced my matrix to  4863 x 4863 by filtering the original corpus.
> Now when I attempt svd, I'm finding a memory error in the svd routine.
> Is there a hard upper limit of the size of a matrix for these
> calculations?
>
>  File "/usr/lib/python2.4/site-packages/numpy/linalg/linalg.py", line
> 575, in svd
>    vt = zeros((n, nvt), t)
> MemoryError
>
> Cheers
>
> Dave
>
>
> On 5/13/07, Anne Archibald <peridot.faceted at gmail.com> wrote:
>> On 12/05/07, Dave P. Novakovic <davidnovakovic at gmail.com> wrote:
>>
>> > core 2 duo with 4gb RAM.
>> >
>> > I've heard about iterative svd functions. I actually need a complete
>> > svd, with all eigenvalues (not LSI). I'm actually more interested in
>> > the individual eigenvectors.
>> >
>> > As an example, a single row could probably have about 3000 non zero 
>> > elements.
>>
>> I think you need to think hard about whether your problem can be done
>> in another way.
>>
>> First of all, the singular values (as returned from the svd) are not
>> eigenvalues - eigenvalue decomposition is a much harder problem,
>> numerically.
>>
>> Second, your full non-sparse matrix will be 8*75000*75000 bytes, or
>> about 42 gibibytes. Put another way, the representation of your data
>> alone is ten times the size of the RAM on the machine you're using.
>>
>> Third, your matrix has 225 000 000 nonzero entries; assuming a perfect
>> sparse representation with no extra bytes (at least two bytes per
>> entry is typical, usually more), that's 1.7 GiB.
>>
>> Recall that basically any matrix operation is at least O(N^3), so you
>> can expect order 10^14 floating-point operations to be required. This
>> is actually the *least* significant constraint; pushing stuff into and
>> out of disk caches will be taking most of your time.
>>
>> Even if you can represent your matrix sparsely (using only a couple of
>> gibibytes), you've said you want the full set of eigenvectors, which
>> is not likely to be a sparse matrix - so your result is back up to 42
>> GiB. And you should expect an eigenvalue algorithm, if it even
>> survives massive roundoff problems, to require something like that
>> much working space; thus your problem probably has a working size of
>> something like 84 GiB.
>>
>> SVD is a little easier, if that's what you want, but the full solution
>> is twice as large, though if you discard entries corresponding to
>> small values it might be quite reasonable. You'll still need some
>> fairly specialized code, though. Which form are you looking for?
>>
>> Solving your problem in a reasonable amount of time, as described and
>> on the hardware you specify, is going to require some very specialized
>> algorithms; you could try looking for an out-of-core eigenvalue
>> package, but I'd first look to see if there's any way you can simplify
>> your problem - getting just one eigenvector, maybe.
>>
>> Anne
>> _______________________________________________
>> Numpy-discussion mailing list
>> Numpy-discussion at scipy.org
>> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>>
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>