[Tutor] multiprocessing question

Mon Nov 24 13:56:27 CET 2014

 ----- Original Message -----
 > From: Cameron Simpson <cs at zip.com.au>
 > To: Python Mailing List <tutor at python.org>
 > Cc: 
 > Sent: Monday, November 24, 2014 2:20 AM
 > Subject: Re: [Tutor] multiprocessing question
 > 
> On 23Nov2014 22:30, Albert-Jan Roskam <fomcl at yahoo.com.dmarc.invalid> 
> wrote:
>> I created some code to get records from a potentially giant .csv file. This 
> implements a __getitem__ method that gets records from a memory-mapped csv file. 
> In order for this to work, I need to build a lookup table that maps line numbers 
> to line starts/ends. This works, BUT building the lookup table could be 
> time-consuming (and it freezes up the app). The (somewhat pruned) code is here: 
> http://pastebin.com/0x6JKbfh. Now I would like to build the lookup table in a 
> separate process. I used multiprocessing. In the crude example below, it appears 
> to be doing what I have in mind. Is this the way to do it? I have never used 
> multiprocessing/threading before, apart from playing around. One specfic 
> question: __getitem__ is supposed to throw an IndexError when needed. But how do 
> I know when I should do this if I don't yet know the total number of 
> records? If there an uever cheap way of doing getting this number?
> 
> First up, multiprocessing is not what you want. You want threading for this.
> 
> The reason is that your row index makes an in-memory index. If you do this in a 
> subprocess (mp.Process) then the in-memory index is in a different process, and 
> not accessable.
Hi Cameron,  Thanks for helping me. I read this page before I decided to go for multiprocessing: http://stackoverflow.com/questions/3044580/multiprocessing-vs-threading-python. I never *really* understood why cPython (with GIL) could have threading anyway. I am confused: I thought the idea of mutliprocessing.Manager was to share information.
>>> help(mp.Manager)
Help on function Manager in module multiprocessing:Manager()
    Returns a manager associated with a running server process
    
    The managers methods such as `Lock()`, `Condition()` and `Queue()`
    can be used to create shared objects.>>> help(mp.managers.SyncManager)
Help on class SyncManager in module multiprocessing.managers:class SyncManager(BaseManager)
 |  Subclass of `BaseManager` which supports a number of shared object types.
 |  
 |  The types registered are those intended for the synchronization
 |  of threads, plus `dict`, `list` and `Namespace`.
 |  
 |  The `multiprocessing.Manager()` function creates started instances of
 |  this class.......>>> help(mp.Manager().dict)
Help on method dict in module multiprocessing.managers:dict(self, *args, **kwds) method of multiprocessing.managers.SyncManager instance  > Use a Thread. You code will be very similar.
Ok, I will try that.
> Next: your code is short enough to including inline instead of forcing people 
> to go to pastebin; in particular if I were reading your email offline (as I 
> might do on a train) then I could not consult your code. Including it in the 
> message is preferable, normally. Sorry about that. I did not want to burden people with too many lines of code. The pastebin code was meant as the problem context.
 
> Your approach of setting self.lookup_done to False and then later to True 
> answers your question about "__getitem__ is supposed to throw an IndexError  :-) Nice. I added those lines while editing the mail. 
 
> 
> when needed. But how do I know when I should do this if I don't yet know the 
> 
> total number of records?" Make __getitem__ _block_ until self.lookup_done 
> is 
> True. At that point you should know how many records there are.
> 
> Regarding blocking, you want a Condition object or a Lock (a Lock is simpler, 
> and Condition is more general). Using a Lock, you would create the Lock and 
> .acquire it. In create_lookup(), release() the Lock at the end. In __getitem__ 
> (or any other function dependent on completion of create_lookup), .acquire() 
> and then .release() the Lock. That will cause it to block until the index scan 
> is finished. So __getitem__ cannot be called while it is being created? But wouldn't that defeat the purpose? My PyQt program around it initially shows the first 25 records. On many occasions that's all what's needed.  
 
> A remark about the create_lookup() function on pastebin: you go:
> 
>   record_start += len(line)
THANKS!! How could I not think of this.. I initially started wth open(), which returns bytestrings.I could convert it to bytes and then take the len() 
> This presumes that a single text character on a line consumes a single byte or 
> memory or file disc space. However, your data file is utf-8 encoded, and some 
> characters may be more than one byte or storage. This means that your 
> record_start values will not be useful because they are character counts, not 
> byte counts, and you need byte counts to offset into a file if you are doing 
> random access.
> 
> Instead, note the value of unicode_csv_data.tell() before reading each line 
> (you will need to modify your CSV reader somewhat to do this, and maybe return 
> both the offset and line text). That is a byte offset to be used later.
> 
> Cheers,
> Cameron Simpson <cs at zip.com.au>
> 
> George, discussing a patent and prior art:
> "Look, this  publication has a date, the patent has a priority date,
> can't you just compare them?"
> Paul Sutcliffe:
> "Not unless you're a lawyer."
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor
>  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20141124/cecd0bae/attachment.html>