Mailman 3 Multi thread loading data - NumPy-Discussion - python.org

newer
Numpy complex types, packing and...

Multi thread loading data

older
pep-3118 extended struct format...

Mag Gam

1 Jul 2009 1 Jul '09

5:27 p.m.

Is it possible to use loadtxt in a mult thread way? Basically, I want to process a very large CSV file (100+ million records) and instead of loading thousand elements into a buffer process and then load another 1 thousand elements and process and so on... I was wondering if there is a technique where I can use multiple processors to do this faster. TIA

Reply

Sign in to reply online Use email software

Show replies by date

Chris Colbert

2 Jul 2 Jul

8:44 p.m.

can you hold the entire file in memory as single array with room to spare? If so, you could use multiprocessing and load a bunch of smaller arrays, then join them all together. It wont be super fast, because serializing a numpy array is somewhat slow when using multiprocessing. That said, its still faster than disk transfers. I'm sure some numpy expert will come on here though and give you a much better idea. On Wed, Jul 1, 2009 at 7:57 AM, Mag Gam wrote:

Is it possible to use loadtxt in a mult thread way? Basically, I want to process a very large CSV file (100+ million records) and instead of loading thousand elements into a buffer process and then load another 1 thousand elements and process and so on...

I was wondering if there is a technique where I can use multiple processors to do this faster.

TIA _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply

Sign in to reply online Use email software

Sebastian Haase

9 p.m.

On Thu, Jul 2, 2009 at 5:14 PM, Chris Colbert wrote:

can you hold the entire file in memory as single array with room to spare? If so, you could use multiprocessing and load a bunch of smaller arrays, then join them all together.

It wont be super fast, because serializing a numpy array is somewhat slow when using multiprocessing. That said, its still faster than disk transfers.

I'm sure some numpy expert will come on here though and give you a much better idea.

On Wed, Jul 1, 2009 at 7:57 AM, Mag Gam wrote:

...
Is it possible to use loadtxt in a mult thread way? Basically, I want to process a very large CSV file (100+ million records) and instead of loading thousand elements into a buffer process and then load another 1 thousand elements and process and so on...

I was wondering if there is a technique where I can use multiple processors to do this faster.

TIA

Do you know about the GIL (global interpreter lock) in Python ? It means that Python isn't doing "real" multithreading... Only if one thread is e.g. doing some slow or blocking io stuff, the other thread could keep work, e.g. doing CPU-heavy numpy stuff. But you would get 2-CPU numpy code - except for some C-implemented "long running" operations -- these should be programmed in a way that releases the GIL so that the other CPU could go on doing it's Python code. HTH, Sebastian Haase

Reply

Sign in to reply online Use email software

Chris Colbert

9:08 p.m.

Who are quoting Sebastian? Multiprocessing is a python package that spawns multiple python processes, effectively side-stepping the GIL, and provides easy mechanisms for IPC. Hence the need for serialization.... On Thu, Jul 2, 2009 at 11:30 AM, Sebastian Haase wrote:

On Thu, Jul 2, 2009 at 5:14 PM, Chris Colbert wrote:

...
can you hold the entire file in memory as single array with room to spare? If so, you could use multiprocessing and load a bunch of smaller arrays, then join them all together.

It wont be super fast, because serializing a numpy array is somewhat slow when using multiprocessing. That said, its still faster than disk transfers.

I'm sure some numpy expert will come on here though and give you a much better idea.

On Wed, Jul 1, 2009 at 7:57 AM, Mag Gam wrote:

...
Is it possible to use loadtxt in a mult thread way? Basically, I want to process a very large CSV file (100+ million records) and instead of loading thousand elements into a buffer process and then load another 1 thousand elements and process and so on...

I was wondering if there is a technique where I can use multiple processors to do this faster.

TIA

Do you know about the GIL (global interpreter lock) in Python ? It means that Python isn't doing "real" multithreading... Only if one thread is e.g. doing some slow or blocking io stuff, the other thread could keep work, e.g. doing CPU-heavy numpy stuff. But you would get 2-CPU numpy code - except for some C-implemented "long running" operations -- these should be programmed in a way that releases the GIL so that the other CPU could go on doing it's Python code.

HTH, Sebastian Haase _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply

Sign in to reply online Use email software

Sebastian Haase

9:34 p.m.

On Thu, Jul 2, 2009 at 5:38 PM, Chris Colbert wrote:

Who are quoting Sebastian?

Multiprocessing is a python package that spawns multiple python processes, effectively side-stepping the GIL, and provides easy mechanisms for IPC. Hence the need for serialization....

I was replying to the OP's email Regarding your comment: can separate processes not access the same memory space !? via shared memory ... I think there was a discussion about this not to long ago on this list. -S.

On Thu, Jul 2, 2009 at 11:30 AM, Sebastian Haase wrote:

...
On Thu, Jul 2, 2009 at 5:14 PM, Chris Colbert wrote:

...
can you hold the entire file in memory as single array with room to spare? If so, you could use multiprocessing and load a bunch of smaller arrays, then join them all together.

It wont be super fast, because serializing a numpy array is somewhat slow when using multiprocessing. That said, its still faster than disk transfers.

I'm sure some numpy expert will come on here though and give you a much better idea.

On Wed, Jul 1, 2009 at 7:57 AM, Mag Gam wrote:

...
Is it possible to use loadtxt in a mult thread way? Basically, I want to process a very large CSV file (100+ million records) and instead of loading thousand elements into a buffer process and then load another 1 thousand elements and process and so on...

I was wondering if there is a technique where I can use multiple processors to do this faster.

TIA

Do you know about the GIL (global interpreter lock) in Python ? It means that Python isn't doing "real" multithreading... Only if one thread is e.g. doing some slow or blocking io stuff, the other thread could keep work, e.g. doing CPU-heavy numpy stuff. But you would get 2-CPU numpy code - except for some C-implemented "long running" operations -- these should be programmed in a way that releases the GIL so that the other CPU could go on doing it's Python code.

HTH, Sebastian Haase _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply

Sign in to reply online Use email software

Chris Colbert

10:38 p.m.

I'm relatively certain its possible, but then you have to deal with locks, semaphores, synchronization, etc... On Thu, Jul 2, 2009 at 12:04 PM, Sebastian Haase wrote:

On Thu, Jul 2, 2009 at 5:38 PM, Chris Colbert wrote:

...
Who are quoting Sebastian?

Multiprocessing is a python package that spawns multiple python processes, effectively side-stepping the GIL, and provides easy mechanisms for IPC. Hence the need for serialization....

I was replying to the OP's email

Regarding your comment: can separate processes not access the same memory space !? via shared memory ... I think there was a discussion about this not to long ago on this list.

-S.

...
On Thu, Jul 2, 2009 at 11:30 AM, Sebastian Haase wrote:

...
On Thu, Jul 2, 2009 at 5:14 PM, Chris Colbert wrote:

...
can you hold the entire file in memory as single array with room to spare? If so, you could use multiprocessing and load a bunch of smaller arrays, then join them all together.

It wont be super fast, because serializing a numpy array is somewhat slow when using multiprocessing. That said, its still faster than disk transfers.

I'm sure some numpy expert will come on here though and give you a much better idea.

On Wed, Jul 1, 2009 at 7:57 AM, Mag Gam wrote:

...
Is it possible to use loadtxt in a mult thread way? Basically, I want to process a very large CSV file (100+ million records) and instead of loading thousand elements into a buffer process and then load another 1 thousand elements and process and so on...

I was wondering if there is a technique where I can use multiple processors to do this faster.

TIA

Do you know about the GIL (global interpreter lock) in Python ? It means that Python isn't doing "real" multithreading... Only if one thread is e.g. doing some slow or blocking io stuff, the other thread could keep work, e.g. doing CPU-heavy numpy stuff. But you would get 2-CPU numpy code - except for some C-implemented "long running" operations -- these should be programmed in a way that releases the GIL so that the other CPU could go on doing it's Python code.

HTH, Sebastian Haase _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply

Sign in to reply online Use email software

5405

Age (days ago)

5406

Last active (days ago)

Download

5 comments

3 participants

tags

participants (3)

Chris Colbert
Mag Gam
Sebastian Haase