[Tutor] multiprocessing question

Mon Nov 24 23:16:20 CET 2014

On 24Nov2014 12:56, Albert-Jan Roskam <fomcl at yahoo.com> wrote:
> > From: Cameron Simpson <cs at zip.com.au>
>> On 23Nov2014 22:30, Albert-Jan Roskam <fomcl at yahoo.com.dmarc.invalid>
>> wrote:
>>> I created some code to get records from a potentially giant .csv file. This
>> implements a __getitem__ method that gets records from a memory-mapped csv file.
>> In order for this to work, I need to build a lookup table that maps line numbers
>> to line starts/ends. This works, BUT building the lookup table could be
>> time-consuming (and it freezes up the app). [...]
>>
>> First up, multiprocessing is not what you want. You want threading for this.
>>
>> The reason is that your row index makes an in-memory index. If you do this in a
>> subprocess (mp.Process) then the in-memory index is in a different process, and
>> not accessable.
>
>Hi Cameron,  Thanks for helping me. I read this page before I decided to go for multiprocessing: http://stackoverflow.com/questions/3044580/multiprocessing-vs-threading-python. I never *really* understood why cPython (with GIL) could have threading anyway. I am confused: I thought the idea of mutliprocessing.Manager was to share information.

Regarding the GIL, it will prevent the raw python interpreter from using more 
than one CPU: no two python opcodes run concurrently. However, any calls to C 
libraries or the OS which may block release the GIL (broadly speaking). So 
while the OS is off reading data from a hard drive or opening a network 
connection or something, the Python interpreter is free to run opcodes for 
other python threads. It is timesharing at the python opcode level. And if the 
OS or a C library is off doing work with the GIL released then you get true 
multithreading.

Most real code is not compute bound at the Python level, most of the time.  
Whenever you block for I/O or delegate work to a library or another process, 
your current Python Thread is stalled, allowing other Threads to run.

For myself, I use threads when algorithms naturally fall into parallel 
expression or for situations like yours where some lengthy process must run but 
I want the main body of code to commence work before it finishes. As it 
happens, one of my common uses cases for  the latter is reading a CSV file:-)

Anywhere you want to do things in parallel, ideally I/O bound, a Thread is a 
reasonable thing to consider. It lets you write the separate task in a nice 
linear fashion.

With a Thread (coding errors aside) you know where you stand: the data 
structures it works on are the very same ones used by the main program. (Of 
course, therein lie the hazards as well.)

With multiprocessing the subprocess works on distinct data sets and (from my 
reading) any shared data is managed by proxy objects that communicate between 
the processes. That gets you data isolation for the subprocess, but also higher 
latency in data access between the processes and of course the task of 
arranging those proxy objects.

For your task I would go with a Thread.

>> when needed. But how do I know when I should do this if I don't yet know the
>>
>> total number of records?" Make __getitem__ _block_ until self.lookup_done
>> is
>> True. At that point you should know how many records there are.
>>
>> Regarding blocking, you want a Condition object or a Lock (a Lock is simpler,
>> and Condition is more general). Using a Lock, you would create the Lock and
>> .acquire it. In create_lookup(), release() the Lock at the end. In __getitem__
>> (or any other function dependent on completion of create_lookup), .acquire()
>> and then .release() the Lock. That will cause it to block until the index scan
>> is finished.
>
> So __getitem__ cannot be called while it is being created? But wouldn't that defeat the purpose? My PyQt program around it initially shows the first 25 records. On many occasions that's all what's needed.  

That depends on the CSV and how you're using it. If __getitem__ is just "give 
me row number N", then all it really needs to do is check against the current 
count of rows read.  Keep such a counter, updated by the scanning/indexing 
thread. If the requested row number is less than the counter, fetch it and 
return it.  Otherwise block/wait until the counter becomes big enough. (Or 
throw some exception if the calling code can cope with the notion of "data not 
ready yet".)

If you want __getitem__ to block, you will need to arrange a way to do that.  
Stupid programs busy wait:

  while counter < index_value:
    pass

Horrendous; it causes the CPU to max out _and_ gets in the way of other work, 
slowing everything down. The simple approach is a poll:

  while counter < index_value:
    sleep(0.1)

This polls 10 times a second. Tuning the sleep time is a subjective call: too 
frequent will consume resources, to infrequent will make __getitem__ too slow 
to respond when the counter finally catches up.

A more elaborate but truly blocking scheme is to have some kind of request 
queue, where __getitem__ makes (for example) a Condition variable and queues a 
request for "when the counter reaches this number". When the indexer reaches 
that number (or finsihes indexing) it wakes up the condition and __getitem__ 
gets on with its task. This requires extra code in your indexer to (a) keep a 
PriorityQueue of requests and (b) to check for the lowest one when it 
increments its record count. When the record count reaches the lowest request, 
wake up every request of that count, and then record the next request (if any) 
as the next "wake up" number. That is a sketch: there are complications, such 
as when a new request comes in lower than the current "lowest" request, and so 
forth.

I'd go with the 0.1s poll loop myself. It is simple and easy and will work. Use 
a better scheme later if needed.

>> A remark about the create_lookup() function on pastebin: you go:
>>
>>   record_start += len(line)
>>
>> This presumes that a single text character on a line consumes a single byte 
or
>> memory or file disc space. However, your data file is utf-8 encoded, and some
>> characters may be more than one byte or storage. This means that your
>> record_start values will not be useful because they are character counts, not
>> byte counts, and you need byte counts to offset into a file if you are doing
>> random access.
>>
>> Instead, note the value of unicode_csv_data.tell() before reading each line
>> (you will need to modify your CSV reader somewhat to do this, and maybe return
>> both the offset and line text). That is a byte offset to be used later.
>
>THANKS!! How could I not think of this.. I initially started wth open(), which returns bytestrings.I could convert it to bytes and then take the len() 

Converting to bytes relies on that conversion being symmetric and requires you 
to know the conversion required. Simply noting the .tell() value before the 
line is read avoids all that: wher am I? Read line. Return line and start 
position. Simple and direct.

Cheers,
Cameron Simpson <cs at zip.com.au>