[Tutor] threading for each line in a large file, and doing it right

Alan Gauld alan.gauld at yahoo.co.uk
Wed Apr 25 04:27:08 EDT 2018


On 25/04/18 03:26, Evuraan wrote:

> Please consider this situation :
> Each line in "massive_input.txt" need to be churned by the
> "time_intensive_stuff" function, so I am trying to background it.

What kind of "churning" is involved?
If its compute intensive threading may not be the right
answer, but if its I/O bound then threading is probably
ok.

> import threading
> 
> def time_intensive_stuff(arg):
>    # some code, some_conditional
>    return (some_conditional)

What exactly do you mean by some_conditional?
Is it some kind of big decision tree? Or if/else network?
Or is it dependent on external data
    (from where? a database? network?)

And you return it - but what is returned?
 - an expression, a boolean result?

Its not clear what the nature of the task is but that
makes a big difference to how best to parallelise the work.

> with open("massive_input.txt") as fobj:
>    for i in fobj:
>       thread_thingy = thread.Threading(target=time_intensive_stuff, args=(i,) )
>       thread_thingy.start()
> 
> With above code, it still does not feel like it is backgrounding at
> scale,  

Can you say why you feel that way?
What measurements have you done?
What system observations(CPU, Memory, Network etc)?
What did you expect to see and what did you see.

Also consider that processing a huge number of lines
will generate a huge number of subprocesses or
threads. There is an overhead to each thread and
your computer may not have enough resources to
run them all efficiently.

It may be better to batch the lines so each subprocess
handles 10, or 50 or 100 lines (whatever makes sense).
Put a loop into your time intensive function to process
the list of input values and return a list of outputs.

And your external loop needs an inner loop to create
the batches. The number of entries in the batch can
be parametrized so that you can experiment to find
the most cost effective size..

> I am sure there is a better pythonic way.

I suspect the issues are not Python specific but
are more generally about paralleling large jobs.

> How do I achieve something like this bash snippet below in python:
> 
> time_intensive_stuff_in_bash(){
>    # some code
>   :
> }
> 
> for i in $(< massive_input.file); do
>     time_intensive_stuff_in_bash i & disown
>     :
> done

Its the same except in bash you start a whole
new process so instead of using threading you
use concurrent. But did you try this in bash?
Was it faster than using Python? I would expect
the same issues of too many processes to arise
in bash.

HTH
-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos




More information about the Tutor mailing list