Large Two Dimensional Array
Denis McMahon
denismfmcmahon at gmail.com
Wed Jan 29 11:32:19 EST 2014
On Tue, 28 Jan 2014 21:25:54 -0800, Ayushi Dalmia wrote:
> Hello,
>
> I am trying to implement IBM Model 1. In that I need to create a matrix
> of 50000*50000 with double values. Currently I am using dict of dict but
> it is unable to support such high dimensions and hence gives memory
> error. Any help in this regard will be useful. I understand that I
> cannot store the matrix in the RAM but what is the most efficient way to
> do this?
This looks to me like a table with columns:
word1 (varchar 20) | word2 (varchar 20) | connection (double)
might be your best solution, but it's going a huge table (2G5 rows)
The primary key is going to be the combination of all 3 columns (or
possibly the combination of word1 and word2) and you want indexes on
word1 and word2, which will slow down populating the table, but speed up
searching it, and I assume that searching is going to be a much more
frequent operation than populating.
Also, creating a database has the additional advantage that next time you
want to use the program for a conversion between two languages that
you've previously built the data for, the data already exists in the
database, so you don't need to build it again.
I imagine you would have either one table for each language pair, or one
table for each conversion (treating a->b and b->a as two separate
conversions).
I'm also guessing that varchar 20 is long enough to hold any of your
50,000 words in either language, that value might need adjusting
otherwise.
--
Denis McMahon, denismfmcmahon at gmail.com
More information about the Python-list
mailing list