Cannot generate very large very sparse random matrix
Hi, I've just noticed that it is not possible to generate very large very sparse random matrices with scipy.sparse.random(). For example: scipy.sparse.random(1_000_000, 1_000_000, density = 1e-11) should create a sparse matrix with only 10 non-zero values... but instead triggers a MemoryError: ---- MemoryError Traceback (most recent call last) <ipython-input-8-eb81d3aec480> in <module> ----> 1 scipy.sparse.random(1_000_000, 1_000_000, density = 1e-11) ~/miniconda3/envs/lap/lib/python3.8/site-packages/scipy/sparse/construct.py in random(m, n, density, format, dtype, random_state, data_rvs) 787 data_rvs = partial(random_state.uniform, 0., 1.) 788 --> 789 ind = random_state.choice(mn, size=k, replace=False) 790 791 j = np.floor(ind * 1. / m).astype(tp, copy=False) mtrand.pyx in numpy.random.mtrand.RandomState.choice() mtrand.pyx in numpy.random.mtrand.RandomState.permutation() MemoryError: Unable to allocate 7.28 TiB for an array with shape (1000000000000,) and data type int64 ---- Here is the problematic line in current master branch of SciPy: https://github.com/scipy/scipy/blob/master/scipy/sparse/construct.py#L806 In short, the issue is due to random_state.choice(... replace=False) which needs to allocate the humongous array in order to pick the ten random numbers... I understand the technical difficulty of generating random numbers without replacement, but it is quite counterintuitive that in order to generate a sparse random matrix it is necessary to create an equally large but *dense* vector first. Is there a solution to this problem? Thanks in advance, Emanuele -- -- Le informazioni contenute nella presente comunicazione sono di natura privata e come tali sono da considerarsi riservate ed indirizzate esclusivamente ai destinatari indicati e per le finalità strettamente legate al relativo contenuto. Se avete ricevuto questo messaggio per errore, vi preghiamo di eliminarlo e di inviare una comunicazione all’indirizzo e-mail del mittente. -- The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you received this in error, please contact the sender and delete the material.
This is a known issue, see https://github.com/scipy/scipy/issues/9699. I haven't checked on the status of numpy.random.Generator.choice() in a while, so maybe the issue can be resolved now. On Wed, Nov 11, 2020 at 6:46 PM Emanuele Olivetti <olivetti@fbk.eu> wrote:
Hi,
I've just noticed that it is not possible to generate very large very sparse random matrices with scipy.sparse.random(). For example: scipy.sparse.random(1_000_000, 1_000_000, density = 1e-11) should create a sparse matrix with only 10 non-zero values... but instead triggers a MemoryError: ---- MemoryError Traceback (most recent call last) <ipython-input-8-eb81d3aec480> in <module> ----> 1 scipy.sparse.random(1_000_000, 1_000_000, density = 1e-11)
~/miniconda3/envs/lap/lib/python3.8/site-packages/scipy/sparse/construct.py in random(m, n, density, format, dtype, random_state, data_rvs) 787 data_rvs = partial(random_state.uniform, 0., 1.) 788 --> 789 ind = random_state.choice(mn, size=k, replace=False) 790 791 j = np.floor(ind * 1. / m).astype(tp, copy=False)
mtrand.pyx in numpy.random.mtrand.RandomState.choice()
mtrand.pyx in numpy.random.mtrand.RandomState.permutation()
MemoryError: Unable to allocate 7.28 TiB for an array with shape (1000000000000,) and data type int64 ----
Here is the problematic line in current master branch of SciPy:
https://github.com/scipy/scipy/blob/master/scipy/sparse/construct.py#L806
In short, the issue is due to random_state.choice(... replace=False) which needs to allocate the humongous array in order to pick the ten random numbers...
I understand the technical difficulty of generating random numbers without replacement, but it is quite counterintuitive that in order to generate a sparse random matrix it is necessary to create an equally large but *dense* vector first.
Is there a solution to this problem?
Thanks in advance,
Emanuele
-- Le informazioni contenute nella presente comunicazione sono di natura privata e come tali sono da considerarsi riservate ed indirizzate esclusivamente ai destinatari indicati e per le finalità strettamente legate al relativo contenuto. Se avete ricevuto questo messaggio per errore, vi preghiamo di eliminarlo e di inviare una comunicazione all’indirizzo e-mail del mittente. -- The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you received this in error, please contact the sender and delete the material. _______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev
Thank you for your response. Indeed numpy.random.Generator.choice() solves the problem: ---- rng = np.random.default_rng() rng.choice(1_000_0000_000_000_000, size=10, replace=False) array([7363643319410659, 1001129358099623, 7384908776761990, 3610742892883208, 9484192959193500, 6273686405826185, 1550972534180773, 1845765940909299, 144504113475750, 7853188631204629]) ---- while: ---- np.random.choice(1_000_0000_000_000_000, size=10, replace=False) --------------------------------------------------------------------------- MemoryError Traceback (most recent call last) <ipython-input-11-95b556ac15b9> in <module> ----> 1 np.random.choice(1_000_0000_000_000_000, size=10, replace=False) mtrand.pyx in numpy.random.mtrand.RandomState.choice() mtrand.pyx in numpy.random.mtrand.RandomState.permutation() MemoryError: Unable to allocate 71.1 PiB for an array with shape (10000000000000000,) and data type int64 ---- According to the latest comment on the github issue you mentioned: "It looks like np.random.Generator should be available from numpy 1.17 on, and the current minimum numpy version is 1.16.5."... So this may require a little while... As a quick fix but also meaningful new feature, would it be possible to extend the API of scipy.sparse.random() and to add the option "replace=False" (then piped to np.random.choice()) which, if set to True, would give the liberty to the user to solve the issue for very large very sparse matrices at the cost of some (rare) collisions? I would gladly accept it - and that's also my current fix on my local SciPy. Best, Emanuele On Fri, Nov 13, 2020 at 4:23 PM CJ Carey <perimosocordiae@gmail.com> wrote:
This is a known issue, see https://github.com/scipy/scipy/issues/9699.
I haven't checked on the status of numpy.random.Generator.choice() in a while, so maybe the issue can be resolved now.
On Wed, Nov 11, 2020 at 6:46 PM Emanuele Olivetti <olivetti@fbk.eu> wrote:
Hi,
I've just noticed that it is not possible to generate very large very sparse random matrices with scipy.sparse.random(). For example: scipy.sparse.random(1_000_000, 1_000_000, density = 1e-11) should create a sparse matrix with only 10 non-zero values... but instead triggers a MemoryError: ---- MemoryError Traceback (most recent call last) <ipython-input-8-eb81d3aec480> in <module> ----> 1 scipy.sparse.random(1_000_000, 1_000_000, density = 1e-11)
~/miniconda3/envs/lap/lib/python3.8/site-packages/scipy/sparse/construct.py in random(m, n, density, format, dtype, random_state, data_rvs) 787 data_rvs = partial(random_state.uniform, 0., 1.) 788 --> 789 ind = random_state.choice(mn, size=k, replace=False) 790 791 j = np.floor(ind * 1. / m).astype(tp, copy=False)
mtrand.pyx in numpy.random.mtrand.RandomState.choice()
mtrand.pyx in numpy.random.mtrand.RandomState.permutation()
MemoryError: Unable to allocate 7.28 TiB for an array with shape (1000000000000,) and data type int64 ----
Here is the problematic line in current master branch of SciPy:
https://github.com/scipy/scipy/blob/master/scipy/sparse/construct.py#L806
In short, the issue is due to random_state.choice(... replace=False) which needs to allocate the humongous array in order to pick the ten random numbers...
I understand the technical difficulty of generating random numbers without replacement, but it is quite counterintuitive that in order to generate a sparse random matrix it is necessary to create an equally large but *dense* vector first.
Is there a solution to this problem?
Thanks in advance,
Emanuele
-- Le informazioni contenute nella presente comunicazione sono di natura privata e come tali sono da considerarsi riservate ed indirizzate esclusivamente ai destinatari indicati e per le finalità strettamente legate al relativo contenuto. Se avete ricevuto questo messaggio per errore, vi preghiamo di eliminarlo e di inviare una comunicazione all’indirizzo e-mail del mittente. -- The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you received this in error, please contact the sender and delete the material. _______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev
-- -- Le informazioni contenute nella presente comunicazione sono di natura privata e come tali sono da considerarsi riservate ed indirizzate esclusivamente ai destinatari indicati e per le finalità strettamente legate al relativo contenuto. Se avete ricevuto questo messaggio per errore, vi preghiamo di eliminarlo e di inviare una comunicazione all’indirizzo e-mail del mittente. -- The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you received this in error, please contact the sender and delete the material.
Sorry for such a late response to this thread, but I wanted to point out another workaround that should help users with numpy 1.17+. You can pass a `random_state` parameter to scipy.sparse.random, which will accept a new-style Generator object. So if you amend your example to: scipy.sparse.random(1_000_000, 1_000_000, density = 1e-11, random_state = np.random.default_rng()) then you'll get the fast behavior. On Fri, Nov 13, 2020 at 6:29 PM Emanuele Olivetti <olivetti@fbk.eu> wrote:
Thank you for your response. Indeed numpy.random.Generator.choice() solves the problem: ---- rng = np.random.default_rng() rng.choice(1_000_0000_000_000_000, size=10, replace=False)
array([7363643319410659, 1001129358099623, 7384908776761990, 3610742892883208, 9484192959193500, 6273686405826185, 1550972534180773, 1845765940909299, 144504113475750, 7853188631204629]) ---- while: ---- np.random.choice(1_000_0000_000_000_000, size=10, replace=False) --------------------------------------------------------------------------- MemoryError Traceback (most recent call last) <ipython-input-11-95b556ac15b9> in <module> ----> 1 np.random.choice(1_000_0000_000_000_000, size=10, replace=False)
mtrand.pyx in numpy.random.mtrand.RandomState.choice()
mtrand.pyx in numpy.random.mtrand.RandomState.permutation()
MemoryError: Unable to allocate 71.1 PiB for an array with shape (10000000000000000,) and data type int64 ----
According to the latest comment on the github issue you mentioned: "It looks like np.random.Generator should be available from numpy 1.17 on, and the current minimum numpy version is 1.16.5."... So this may require a little while...
As a quick fix but also meaningful new feature, would it be possible to extend the API of scipy.sparse.random() and to add the option "replace=False" (then piped to np.random.choice()) which, if set to True, would give the liberty to the user to solve the issue for very large very sparse matrices at the cost of some (rare) collisions? I would gladly accept it - and that's also my current fix on my local SciPy.
Best,
Emanuele
On Fri, Nov 13, 2020 at 4:23 PM CJ Carey <perimosocordiae@gmail.com> wrote:
This is a known issue, see https://github.com/scipy/scipy/issues/9699.
I haven't checked on the status of numpy.random.Generator.choice() in a while, so maybe the issue can be resolved now.
On Wed, Nov 11, 2020 at 6:46 PM Emanuele Olivetti <olivetti@fbk.eu> wrote:
Hi,
I've just noticed that it is not possible to generate very large very sparse random matrices with scipy.sparse.random(). For example: scipy.sparse.random(1_000_000, 1_000_000, density = 1e-11) should create a sparse matrix with only 10 non-zero values... but instead triggers a MemoryError: ---- MemoryError Traceback (most recent call last) <ipython-input-8-eb81d3aec480> in <module> ----> 1 scipy.sparse.random(1_000_000, 1_000_000, density = 1e-11)
~/miniconda3/envs/lap/lib/python3.8/site-packages/scipy/sparse/construct.py in random(m, n, density, format, dtype, random_state, data_rvs) 787 data_rvs = partial(random_state.uniform, 0., 1.) 788 --> 789 ind = random_state.choice(mn, size=k, replace=False) 790 791 j = np.floor(ind * 1. / m).astype(tp, copy=False)
mtrand.pyx in numpy.random.mtrand.RandomState.choice()
mtrand.pyx in numpy.random.mtrand.RandomState.permutation()
MemoryError: Unable to allocate 7.28 TiB for an array with shape (1000000000000,) and data type int64 ----
Here is the problematic line in current master branch of SciPy:
https://github.com/scipy/scipy/blob/master/scipy/sparse/construct.py#L806
In short, the issue is due to random_state.choice(... replace=False) which needs to allocate the humongous array in order to pick the ten random numbers...
I understand the technical difficulty of generating random numbers without replacement, but it is quite counterintuitive that in order to generate a sparse random matrix it is necessary to create an equally large but *dense* vector first.
Is there a solution to this problem?
Thanks in advance,
Emanuele
-- Le informazioni contenute nella presente comunicazione sono di natura privata e come tali sono da considerarsi riservate ed indirizzate esclusivamente ai destinatari indicati e per le finalità strettamente legate al relativo contenuto. Se avete ricevuto questo messaggio per errore, vi preghiamo di eliminarlo e di inviare una comunicazione all’indirizzo e-mail del mittente. -- The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you received this in error, please contact the sender and delete the material. _______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev
-- Le informazioni contenute nella presente comunicazione sono di natura privata e come tali sono da considerarsi riservate ed indirizzate esclusivamente ai destinatari indicati e per le finalità strettamente legate al relativo contenuto. Se avete ricevuto questo messaggio per errore, vi preghiamo di eliminarlo e di inviare una comunicazione all’indirizzo e-mail del mittente. -- The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you received this in error, please contact the sender and delete the material. _______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev
Thanks for the tip! Emanuele On Mon, Jan 18, 2021 at 9:36 PM CJ Carey <perimosocordiae@gmail.com> wrote:
Sorry for such a late response to this thread, but I wanted to point out another workaround that should help users with numpy 1.17+. You can pass a `random_state` parameter to scipy.sparse.random, which will accept a new-style Generator object.
So if you amend your example to:
scipy.sparse.random(1_000_000, 1_000_000, density = 1e-11, random_state = np.random.default_rng())
then you'll get the fast behavior.
On Fri, Nov 13, 2020 at 6:29 PM Emanuele Olivetti <olivetti@fbk.eu> wrote:
Thank you for your response. Indeed numpy.random.Generator.choice() solves the problem: ---- rng = np.random.default_rng() rng.choice(1_000_0000_000_000_000, size=10, replace=False)
array([7363643319410659, 1001129358099623, 7384908776761990, 3610742892883208, 9484192959193500, 6273686405826185, 1550972534180773, 1845765940909299, 144504113475750, 7853188631204629]) ---- while: ---- np.random.choice(1_000_0000_000_000_000, size=10, replace=False)
--------------------------------------------------------------------------- MemoryError Traceback (most recent call last) <ipython-input-11-95b556ac15b9> in <module> ----> 1 np.random.choice(1_000_0000_000_000_000, size=10, replace=False)
mtrand.pyx in numpy.random.mtrand.RandomState.choice()
mtrand.pyx in numpy.random.mtrand.RandomState.permutation()
MemoryError: Unable to allocate 71.1 PiB for an array with shape (10000000000000000,) and data type int64 ----
According to the latest comment on the github issue you mentioned: "It looks like np.random.Generator should be available from numpy 1.17 on, and the current minimum numpy version is 1.16.5."... So this may require a little while...
As a quick fix but also meaningful new feature, would it be possible to extend the API of scipy.sparse.random() and to add the option "replace=False" (then piped to np.random.choice()) which, if set to True, would give the liberty to the user to solve the issue for very large very sparse matrices at the cost of some (rare) collisions? I would gladly accept it - and that's also my current fix on my local SciPy.
Best,
Emanuele
On Fri, Nov 13, 2020 at 4:23 PM CJ Carey <perimosocordiae@gmail.com> wrote:
This is a known issue, see https://github.com/scipy/scipy/issues/9699.
I haven't checked on the status of numpy.random.Generator.choice() in a while, so maybe the issue can be resolved now.
On Wed, Nov 11, 2020 at 6:46 PM Emanuele Olivetti <olivetti@fbk.eu> wrote:
Hi,
I've just noticed that it is not possible to generate very large very sparse random matrices with scipy.sparse.random(). For example: scipy.sparse.random(1_000_000, 1_000_000, density = 1e-11) should create a sparse matrix with only 10 non-zero values... but instead triggers a MemoryError: ---- MemoryError Traceback (most recent call last) <ipython-input-8-eb81d3aec480> in <module> ----> 1 scipy.sparse.random(1_000_000, 1_000_000, density = 1e-11)
~/miniconda3/envs/lap/lib/python3.8/site-packages/scipy/sparse/construct.py in random(m, n, density, format, dtype, random_state, data_rvs) 787 data_rvs = partial(random_state.uniform, 0., 1.) 788 --> 789 ind = random_state.choice(mn, size=k, replace=False) 790 791 j = np.floor(ind * 1. / m).astype(tp, copy=False)
mtrand.pyx in numpy.random.mtrand.RandomState.choice()
mtrand.pyx in numpy.random.mtrand.RandomState.permutation()
MemoryError: Unable to allocate 7.28 TiB for an array with shape (1000000000000,) and data type int64 ----
Here is the problematic line in current master branch of SciPy:
https://github.com/scipy/scipy/blob/master/scipy/sparse/construct.py#L806
In short, the issue is due to random_state.choice(... replace=False) which needs to allocate the humongous array in order to pick the ten random numbers...
I understand the technical difficulty of generating random numbers without replacement, but it is quite counterintuitive that in order to generate a sparse random matrix it is necessary to create an equally large but *dense* vector first.
Is there a solution to this problem?
Thanks in advance,
Emanuele
-- Le informazioni contenute nella presente comunicazione sono di natura privata e come tali sono da considerarsi riservate ed indirizzate esclusivamente ai destinatari indicati e per le finalità strettamente legate al relativo contenuto. Se avete ricevuto questo messaggio per errore, vi preghiamo di eliminarlo e di inviare una comunicazione all’indirizzo e-mail del mittente. -- The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you received this in error, please contact the sender and delete the material. _______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev
-- Le informazioni contenute nella presente comunicazione sono di natura privata e come tali sono da considerarsi riservate ed indirizzate esclusivamente ai destinatari indicati e per le finalità strettamente legate al relativo contenuto. Se avete ricevuto questo messaggio per errore, vi preghiamo di eliminarlo e di inviare una comunicazione all’indirizzo e-mail del mittente. -- The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you received this in error, please contact the sender and delete the material. _______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev
-- -- Le informazioni contenute nella presente comunicazione sono di natura privata e come tali sono da considerarsi riservate ed indirizzate esclusivamente ai destinatari indicati e per le finalità strettamente legate al relativo contenuto. Se avete ricevuto questo messaggio per errore, vi preghiamo di eliminarlo e di inviare una comunicazione all’indirizzo e-mail del mittente. -- The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you received this in error, please contact the sender and delete the material.
participants (2)
-
CJ Carey -
Emanuele Olivetti