Multi thread loading data
Is it possible to use loadtxt in a mult thread way? Basically, I want to process a very large CSV file (100+ million records) and instead of loading thousand elements into a buffer process and then load another 1 thousand elements and process and so on... I was wondering if there is a technique where I can use multiple processors to do this faster. TIA
can you hold the entire file in memory as single array with room to spare?
If so, you could use multiprocessing and load a bunch of smaller
arrays, then join them all together.
It wont be super fast, because serializing a numpy array is somewhat
slow when using multiprocessing. That said, its still faster than disk
transfers.
I'm sure some numpy expert will come on here though and give you a
much better idea.
On Wed, Jul 1, 2009 at 7:57 AM, Mag Gam
Is it possible to use loadtxt in a mult thread way? Basically, I want to process a very large CSV file (100+ million records) and instead of loading thousand elements into a buffer process and then load another 1 thousand elements and process and so on...
I was wondering if there is a technique where I can use multiple processors to do this faster.
TIA _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Thu, Jul 2, 2009 at 5:14 PM, Chris Colbert
can you hold the entire file in memory as single array with room to spare? If so, you could use multiprocessing and load a bunch of smaller arrays, then join them all together.
It wont be super fast, because serializing a numpy array is somewhat slow when using multiprocessing. That said, its still faster than disk transfers.
I'm sure some numpy expert will come on here though and give you a much better idea.
On Wed, Jul 1, 2009 at 7:57 AM, Mag Gam
wrote: Is it possible to use loadtxt in a mult thread way? Basically, I want to process a very large CSV file (100+ million records) and instead of loading thousand elements into a buffer process and then load another 1 thousand elements and process and so on...
I was wondering if there is a technique where I can use multiple processors to do this faster.
TIA
Do you know about the GIL (global interpreter lock) in Python ? It means that Python isn't doing "real" multithreading... Only if one thread is e.g. doing some slow or blocking io stuff, the other thread could keep work, e.g. doing CPU-heavy numpy stuff. But you would get 2-CPU numpy code - except for some C-implemented "long running" operations -- these should be programmed in a way that releases the GIL so that the other CPU could go on doing it's Python code. HTH, Sebastian Haase
Who are quoting Sebastian?
Multiprocessing is a python package that spawns multiple python
processes, effectively side-stepping the GIL, and provides easy
mechanisms for IPC. Hence the need for serialization....
On Thu, Jul 2, 2009 at 11:30 AM, Sebastian Haase
On Thu, Jul 2, 2009 at 5:14 PM, Chris Colbert
wrote: can you hold the entire file in memory as single array with room to spare? If so, you could use multiprocessing and load a bunch of smaller arrays, then join them all together.
It wont be super fast, because serializing a numpy array is somewhat slow when using multiprocessing. That said, its still faster than disk transfers.
I'm sure some numpy expert will come on here though and give you a much better idea.
On Wed, Jul 1, 2009 at 7:57 AM, Mag Gam
wrote: Is it possible to use loadtxt in a mult thread way? Basically, I want to process a very large CSV file (100+ million records) and instead of loading thousand elements into a buffer process and then load another 1 thousand elements and process and so on...
I was wondering if there is a technique where I can use multiple processors to do this faster.
TIA
Do you know about the GIL (global interpreter lock) in Python ? It means that Python isn't doing "real" multithreading... Only if one thread is e.g. doing some slow or blocking io stuff, the other thread could keep work, e.g. doing CPU-heavy numpy stuff. But you would get 2-CPU numpy code - except for some C-implemented "long running" operations -- these should be programmed in a way that releases the GIL so that the other CPU could go on doing it's Python code.
HTH, Sebastian Haase _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Thu, Jul 2, 2009 at 5:38 PM, Chris Colbert
Who are quoting Sebastian?
Multiprocessing is a python package that spawns multiple python processes, effectively side-stepping the GIL, and provides easy mechanisms for IPC. Hence the need for serialization....
I was replying to the OP's email Regarding your comment: can separate processes not access the same memory space !? via shared memory ... I think there was a discussion about this not to long ago on this list. -S.
On Thu, Jul 2, 2009 at 11:30 AM, Sebastian Haase
wrote: On Thu, Jul 2, 2009 at 5:14 PM, Chris Colbert
wrote: can you hold the entire file in memory as single array with room to spare? If so, you could use multiprocessing and load a bunch of smaller arrays, then join them all together.
It wont be super fast, because serializing a numpy array is somewhat slow when using multiprocessing. That said, its still faster than disk transfers.
I'm sure some numpy expert will come on here though and give you a much better idea.
On Wed, Jul 1, 2009 at 7:57 AM, Mag Gam
wrote: Is it possible to use loadtxt in a mult thread way? Basically, I want to process a very large CSV file (100+ million records) and instead of loading thousand elements into a buffer process and then load another 1 thousand elements and process and so on...
I was wondering if there is a technique where I can use multiple processors to do this faster.
TIA
Do you know about the GIL (global interpreter lock) in Python ? It means that Python isn't doing "real" multithreading... Only if one thread is e.g. doing some slow or blocking io stuff, the other thread could keep work, e.g. doing CPU-heavy numpy stuff. But you would get 2-CPU numpy code - except for some C-implemented "long running" operations -- these should be programmed in a way that releases the GIL so that the other CPU could go on doing it's Python code.
HTH, Sebastian Haase _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
I'm relatively certain its possible, but then you have to deal with
locks, semaphores, synchronization, etc...
On Thu, Jul 2, 2009 at 12:04 PM, Sebastian Haase
On Thu, Jul 2, 2009 at 5:38 PM, Chris Colbert
wrote: Who are quoting Sebastian?
Multiprocessing is a python package that spawns multiple python processes, effectively side-stepping the GIL, and provides easy mechanisms for IPC. Hence the need for serialization....
I was replying to the OP's email
Regarding your comment: can separate processes not access the same memory space !? via shared memory ... I think there was a discussion about this not to long ago on this list.
-S.
On Thu, Jul 2, 2009 at 11:30 AM, Sebastian Haase
wrote: On Thu, Jul 2, 2009 at 5:14 PM, Chris Colbert
wrote: can you hold the entire file in memory as single array with room to spare? If so, you could use multiprocessing and load a bunch of smaller arrays, then join them all together.
It wont be super fast, because serializing a numpy array is somewhat slow when using multiprocessing. That said, its still faster than disk transfers.
I'm sure some numpy expert will come on here though and give you a much better idea.
On Wed, Jul 1, 2009 at 7:57 AM, Mag Gam
wrote: Is it possible to use loadtxt in a mult thread way? Basically, I want to process a very large CSV file (100+ million records) and instead of loading thousand elements into a buffer process and then load another 1 thousand elements and process and so on...
I was wondering if there is a technique where I can use multiple processors to do this faster.
TIA
Do you know about the GIL (global interpreter lock) in Python ? It means that Python isn't doing "real" multithreading... Only if one thread is e.g. doing some slow or blocking io stuff, the other thread could keep work, e.g. doing CPU-heavy numpy stuff. But you would get 2-CPU numpy code - except for some C-implemented "long running" operations -- these should be programmed in a way that releases the GIL so that the other CPU could go on doing it's Python code.
HTH, Sebastian Haase _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
participants (3)
-
Chris Colbert
-
Mag Gam
-
Sebastian Haase