[Tutor] multiprocessing question

Mon Nov 24 02:20:14 CET 2014

On 23Nov2014 22:30, Albert-Jan Roskam <fomcl at yahoo.com.dmarc.invalid> wrote:
>I created some code to get records from a potentially giant .csv file. This implements a __getitem__ method that gets records from a memory-mapped csv file. In order for this to work, I need to build a lookup table that maps line numbers to line starts/ends. This works, BUT building the lookup table could be time-consuming (and it freezes up the app). The (somewhat pruned) code is here: http://pastebin.com/0x6JKbfh. Now I would like to build the lookup table in a separate process. I used multiprocessing. In the crude example below, it appears to be doing what I have in mind. Is this the way to do it? I have never used multiprocessing/threading before, apart from playing around. One specfic question: __getitem__ is supposed to throw an IndexError when needed. But how do I know when I should do this if I don't yet know the total number of records? If there an uever cheap way of doing getting this number?

First up, multiprocessing is not what you want. You want threading for this.

The reason is that your row index makes an in-memory index. If you do this in a 
subprocess (mp.Process) then the in-memory index is in a different process, and 
not accessable.

Use a Thread. You code will be very similar.

Next: your code is short enough to including inline instead of forcing people 
to go to pastebin; in particular if I were reading your email offline (as I 
might do on a train) then I could not consult your code. Including it in the 
message is preferable, normally.

Your approach of setting self.lookup_done to False and then later to True 
answers your question about "__getitem__ is supposed to throw an IndexError 
when needed. But how do I know when I should do this if I don't yet know the 
total number of records?" Make __getitem__ _block_ until self.lookup_done is 
True. At that point you should know how many records there are.

Regarding blocking, you want a Condition object or a Lock (a Lock is simpler, 
and Condition is more general). Using a Lock, you would create the Lock and 
.acquire it. In create_lookup(), release() the Lock at the end. In __getitem__ 
(or any other function dependent on completion of create_lookup), .acquire() 
and then .release() the Lock. That will cause it to block until the index scan 
is finished.

A remark about the create_lookup() function on pastebin: you go:

  record_start += len(line)

This presumes that a single text character on a line consumes a single byte or 
memory or file disc space. However, your data file is utf-8 encoded, and some 
characters may be more than one byte or storage. This means that your 
record_start values will not be useful because they are character counts, not 
byte counts, and you need byte counts to offset into a file if you are doing 
random access.

Instead, note the value of unicode_csv_data.tell() before reading each line 
(you will need to modify your CSV reader somewhat to do this, and maybe return 
both the offset and line text). That is a byte offset to be used later.

Cheers,
Cameron Simpson <cs at zip.com.au>

George, discussing a patent and prior art:
"Look, this  publication has a date, the patent has a priority date,
can't you just compare them?"
Paul Sutcliffe:
"Not unless you're a lawyer."