Multi-threading interface idea
Hi folks, I'd like to get some feedback on a multi-threading interface I've been thinking about and using for the past year or so. I won't bury the lede, see my approach here <https://gist.github.com/boxysean/3ed325ebb75db0303002f9484821e553#file-my_ex...> . *Background / problem:* A couple of years ago, I inherited my company's codebase to get data into our data warehouse using an ELT approach (extract-and-loads done in python, transforms done in dbt/SQL). The codebase has dozens of python scripts to integrate first-party and third-party data from databases, FTPs, and APIs, which are run on a scheduler (typically daily or hourly). The scripts I inherited were single-threaded procedural scripts, looking like glue code, and spending most of their time in network I/O. (See example. <https://gist.github.com/boxysean/3ed325ebb75db0303002f9484821e553#file-unthr...>) This got my company pretty far! As my team and I added more and more integrations with more and more data, we wanted to have faster and faster scripts to reduce our dev cycles and reduce our multi-hour nightly jobs to minutes. Because our scripts were network-bound, multi-threading was a good way to accomplish this, and so I looked into concurrent.futures (example <https://gist.github.com/boxysean/3ed325ebb75db0303002f9484821e553#file-concu...>) and asyncio (example <https://gist.github.com/boxysean/3ed325ebb75db0303002f9484821e553#file-async...>), but I decided against these options because: 1. It wasn't immediately apparently how to adapt my codebase to use these libraries without either some fundamental changes to our execution platform and/or reworking of our scripts from the ground up and/or adding significant lines of multi-threading code to each script. 2. I couldn't wrap my head around the async/await and future constructs particularly quickly, and I was concerned that my team would also struggle with this change. 3. I believe the procedural style glue code we have is quite easy to comprehend, which I think has a positive impact on scale. *Solution:* And so, as mentioned at the top, I designed a different interface to concurrent.futures.ThreadPoolExecutor that we are successfully using for our extract-and-load pattern, see a basic example here <https://gist.github.com/boxysean/3ed325ebb75db0303002f9484821e553#file-my_ex...>. The design considerations of this interface include: - The usage is minimally-invasive to the original unthreaded approach of the codebase. (And so, teaching the library to team members has been fairly straightforward despite the multi-threaded paradigm shift.) - The @parallel.task decorator should be used to encapsulate a homogeneous method accepting different parameters. The contents of the method should be primarily I/O to achieve the concurrency gains of python multi-threading. - If no parallel.threads context manager has been entered, the @parallel.task decorator acts as a no-op (and the code runs serially). - If an environment variable is set to disable the context manager, the @parallel.task decorator acts as a no-op (and the code runs serially). - There is also an environment variable to change the number of workers provided by parallel.threads (if not hard-coded). While it's possible to return a value from a @parallel.task method, I encourage my team to use the decorator to start-and-complete work; think of writing "embarrassingly parallel" methods that can be "mapped". A couple of other things we've implemented include a "thread barrier" in the case where we want a set tasks to complete before a set of other tasks, and a decorator for factory methods to produce cached thread-local objects (helpful for ensuring thread-safe access to network clients that are not thread-safe). *Your feedback:* - I'd love to hear your thoughts on my problem and solution. - I've done a bit of research of existing libraries in PyPI and PEPs but I don't see any similar libraries; are you aware of anything? - What do you suggest I do next? I'm considering publishing it, but could use some tips on what to here! Thanks! Sean McIntyre
This looks like a very nice library to put on PyPI. But it's not an idea for change to the Python language itself, so probably this is the wrong forum. Python-list is closer. ... if it is a suggestion to change the standard library itself, I'm -1 on the idea. On Sat, Feb 8, 2020 at 6:11 PM Sean McIntyre <boxysean@gmail.com> wrote:
Hi folks,
I'd like to get some feedback on a multi-threading interface I've been thinking about and using for the past year or so. I won't bury the lede, see my approach here <https://gist.github.com/boxysean/3ed325ebb75db0303002f9484821e553#file-my_ex...> .
*Background / problem:*
A couple of years ago, I inherited my company's codebase to get data into our data warehouse using an ELT approach (extract-and-loads done in python, transforms done in dbt/SQL). The codebase has dozens of python scripts to integrate first-party and third-party data from databases, FTPs, and APIs, which are run on a scheduler (typically daily or hourly). The scripts I inherited were single-threaded procedural scripts, looking like glue code, and spending most of their time in network I/O. (See example. <https://gist.github.com/boxysean/3ed325ebb75db0303002f9484821e553#file-unthr...>) This got my company pretty far!
As my team and I added more and more integrations with more and more data, we wanted to have faster and faster scripts to reduce our dev cycles and reduce our multi-hour nightly jobs to minutes. Because our scripts were network-bound, multi-threading was a good way to accomplish this, and so I looked into concurrent.futures (example <https://gist.github.com/boxysean/3ed325ebb75db0303002f9484821e553#file-concu...>) and asyncio (example <https://gist.github.com/boxysean/3ed325ebb75db0303002f9484821e553#file-async...>), but I decided against these options because:
1. It wasn't immediately apparently how to adapt my codebase to use these libraries without either some fundamental changes to our execution platform and/or reworking of our scripts from the ground up and/or adding significant lines of multi-threading code to each script.
2. I couldn't wrap my head around the async/await and future constructs particularly quickly, and I was concerned that my team would also struggle with this change.
3. I believe the procedural style glue code we have is quite easy to comprehend, which I think has a positive impact on scale.
*Solution:*
And so, as mentioned at the top, I designed a different interface to concurrent.futures.ThreadPoolExecutor that we are successfully using for our extract-and-load pattern, see a basic example here <https://gist.github.com/boxysean/3ed325ebb75db0303002f9484821e553#file-my_ex...>. The design considerations of this interface include:
- The usage is minimally-invasive to the original unthreaded approach of the codebase. (And so, teaching the library to team members has been fairly straightforward despite the multi-threaded paradigm shift.)
- The @parallel.task decorator should be used to encapsulate a homogeneous method accepting different parameters. The contents of the method should be primarily I/O to achieve the concurrency gains of python multi-threading.
- If no parallel.threads context manager has been entered, the @parallel.task decorator acts as a no-op (and the code runs serially).
- If an environment variable is set to disable the context manager, the @parallel.task decorator acts as a no-op (and the code runs serially).
- There is also an environment variable to change the number of workers provided by parallel.threads (if not hard-coded).
While it's possible to return a value from a @parallel.task method, I encourage my team to use the decorator to start-and-complete work; think of writing "embarrassingly parallel" methods that can be "mapped".
A couple of other things we've implemented include a "thread barrier" in the case where we want a set tasks to complete before a set of other tasks, and a decorator for factory methods to produce cached thread-local objects (helpful for ensuring thread-safe access to network clients that are not thread-safe).
*Your feedback:*
- I'd love to hear your thoughts on my problem and solution.
- I've done a bit of research of existing libraries in PyPI and PEPs but I don't see any similar libraries; are you aware of anything?
- What do you suggest I do next? I'm considering publishing it, but could use some tips on what to here!
Thanks!
Sean McIntyre _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/KGSMCQ... Code of Conduct: http://python.org/psf/codeofconduct/
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
I'm not sure I get the benefits of this. You can write the same thing even more simply by directly using `ThreadPoolExecutor`. You don't even need `as_completed` as in your `concurrent_future_example`, because you don't want any of the values as they're completed, you only want them after they're all available. In fact, you don't even need to deal with futures. (You don't even need an executor; you could just use a `multiprocessing.[Thread]Pool`, but let's stick with the executor here.) def main(): with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor: succeeded = executor.map(extract_and_load, URLs) print(f"Successfully completed {sum(1 for result in succeeded if result)}") I think this meets your own design considerations better than your version. It's even less invasive—not even a decorator, just use the existing function as-is. Since there is no decorator, it doesn't get in the way of serial code. And so on. If you need to have environment variables that control the parallelism, you can wrap `ThreadPoolExecutor` trivially. Just write an `__init__` that looks up the environment variables before calling `super()`; you don't need to build a whole different abstraction on top of it. Of course it's also more flexible—if you do need more complicated concurrency later (e.g., if processing results takes long enough that it's worth handling them as they come in instead of waiting for them all), you have the futures, which can be composed in various ways—but it doesn't in any way force you to use that flexibility if you don't need it. If you find this useful anyway, because you have a team that doesn't want to learn even the basic use of futures and executors, of course that's fine. And if you put it on PyPI, maybe others will find it useful as well. But I don't think there's any need for it in the stdlib or anything.
2. I couldn't wrap my head around the async/await and future constructs particularly quickly, and I was concerned that my team would also struggle with this change.
3. I believe the procedural style glue code we have is quite easy to comprehend, which I think has a positive impact on scale.
While I can certainly understand the appeal of the simplicity of the ``@ parallel.task`` decorator used in the example, I strongly suspect that it will end up becoming increasingly tangled as the needs grow in complexity. I'd bet on something like this being highly convenient in the short term, but very costly in the long term if it eventually becomes unmaintainable and has to be reconstructed from the ground up (which seems rather likely). It would also lose out on much of the useful functionality of futures, such as cancellations, timeouts, and asynchronously iterating over results in the order of completion, as they finish, rather than order sent (with ``cf.as_completed()``); just to name a few. I can also understand that it takes some time to get used to how futures work, but it's well worth the effort and time to develop a solid fundamental understanding for building scalable back-end systems. Many asynchronous and concurrent frameworks (including in other languages, such as C++, Java, and C#) utilize futures in a similar manner, so the general concepts apply universally for the most part. It's a similar story with async/await syntax (which is present in C# and JS; and upcoming in C++20). That being said, I think the above syntax could be useful for simple scripts, prototyping, or perhaps for educational purposes. I could see it potentially having some popularity on PyPI for those use cases, but I don't think it has a place in the standard library. On Sat, Feb 8, 2020 at 6:08 PM Sean McIntyre <boxysean@gmail.com> wrote:
Hi folks,
I'd like to get some feedback on a multi-threading interface I've been thinking about and using for the past year or so. I won't bury the lede, see my approach here <https://gist.github.com/boxysean/3ed325ebb75db0303002f9484821e553#file-my_ex...> .
*Background / problem:*
A couple of years ago, I inherited my company's codebase to get data into our data warehouse using an ELT approach (extract-and-loads done in python, transforms done in dbt/SQL). The codebase has dozens of python scripts to integrate first-party and third-party data from databases, FTPs, and APIs, which are run on a scheduler (typically daily or hourly). The scripts I inherited were single-threaded procedural scripts, looking like glue code, and spending most of their time in network I/O. (See example. <https://gist.github.com/boxysean/3ed325ebb75db0303002f9484821e553#file-unthr...>) This got my company pretty far!
As my team and I added more and more integrations with more and more data, we wanted to have faster and faster scripts to reduce our dev cycles and reduce our multi-hour nightly jobs to minutes. Because our scripts were network-bound, multi-threading was a good way to accomplish this, and so I looked into concurrent.futures (example <https://gist.github.com/boxysean/3ed325ebb75db0303002f9484821e553#file-concu...>) and asyncio (example <https://gist.github.com/boxysean/3ed325ebb75db0303002f9484821e553#file-async...>), but I decided against these options because:
1. It wasn't immediately apparently how to adapt my codebase to use these libraries without either some fundamental changes to our execution platform and/or reworking of our scripts from the ground up and/or adding significant lines of multi-threading code to each script.
2. I couldn't wrap my head around the async/await and future constructs particularly quickly, and I was concerned that my team would also struggle with this change.
3. I believe the procedural style glue code we have is quite easy to comprehend, which I think has a positive impact on scale.
*Solution:*
And so, as mentioned at the top, I designed a different interface to concurrent.futures.ThreadPoolExecutor that we are successfully using for our extract-and-load pattern, see a basic example here <https://gist.github.com/boxysean/3ed325ebb75db0303002f9484821e553#file-my_ex...>. The design considerations of this interface include:
- The usage is minimally-invasive to the original unthreaded approach of the codebase. (And so, teaching the library to team members has been fairly straightforward despite the multi-threaded paradigm shift.)
- The @parallel.task decorator should be used to encapsulate a homogeneous method accepting different parameters. The contents of the method should be primarily I/O to achieve the concurrency gains of python multi-threading.
- If no parallel.threads context manager has been entered, the @parallel.task decorator acts as a no-op (and the code runs serially).
- If an environment variable is set to disable the context manager, the @parallel.task decorator acts as a no-op (and the code runs serially).
- There is also an environment variable to change the number of workers provided by parallel.threads (if not hard-coded).
While it's possible to return a value from a @parallel.task method, I encourage my team to use the decorator to start-and-complete work; think of writing "embarrassingly parallel" methods that can be "mapped".
A couple of other things we've implemented include a "thread barrier" in the case where we want a set tasks to complete before a set of other tasks, and a decorator for factory methods to produce cached thread-local objects (helpful for ensuring thread-safe access to network clients that are not thread-safe).
*Your feedback:*
- I'd love to hear your thoughts on my problem and solution.
- I've done a bit of research of existing libraries in PyPI and PEPs but I don't see any similar libraries; are you aware of anything?
- What do you suggest I do next? I'm considering publishing it, but could use some tips on what to here!
Thanks!
Sean McIntyre _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/KGSMCQ... Code of Conduct: http://python.org/psf/codeofconduct/
Thanks all for the feedback! I especially appreciate that a couple of you pointed out that because I am avoiding the standard library threading constructs like futures I could be limiting the use of other features they provide. (I am experiencing that issue with other wrapped libraries in my codebase, so I can appreciate this feedback.) One other detail of this module that I've found useful is that @parallel.task decorated methods deeper in the call stack are run on the concurrent.futures.ThreadPoolExecutor provided by the @parallel.threads context manager without carrying around a reference to the ThreadPoolExecutor. I agree that it doesn't seem appropriate standard library, but sounds like there could be an audience for a PyPI library. Thanks and best! Sean On Mon, Feb 10, 2020 at 4:46 AM Kyle Stanley <aeros167@gmail.com> wrote:
2. I couldn't wrap my head around the async/await and future constructs particularly quickly, and I was concerned that my team would also struggle with this change.
3. I believe the procedural style glue code we have is quite easy to comprehend, which I think has a positive impact on scale.
While I can certainly understand the appeal of the simplicity of the ``@ parallel.task`` decorator used in the example, I strongly suspect that it will end up becoming increasingly tangled as the needs grow in complexity. I'd bet on something like this being highly convenient in the short term, but very costly in the long term if it eventually becomes unmaintainable and has to be reconstructed from the ground up (which seems rather likely). It would also lose out on much of the useful functionality of futures, such as cancellations, timeouts, and asynchronously iterating over results in the order of completion, as they finish, rather than order sent (with ``cf.as_completed()``); just to name a few.
I can also understand that it takes some time to get used to how futures work, but it's well worth the effort and time to develop a solid fundamental understanding for building scalable back-end systems. Many asynchronous and concurrent frameworks (including in other languages, such as C++, Java, and C#) utilize futures in a similar manner, so the general concepts apply universally for the most part. It's a similar story with async/await syntax (which is present in C# and JS; and upcoming in C++20).
That being said, I think the above syntax could be useful for simple scripts, prototyping, or perhaps for educational purposes. I could see it potentially having some popularity on PyPI for those use cases, but I don't think it has a place in the standard library.
On Sat, Feb 8, 2020 at 6:08 PM Sean McIntyre <boxysean@gmail.com> wrote:
Hi folks,
I'd like to get some feedback on a multi-threading interface I've been thinking about and using for the past year or so. I won't bury the lede, see my approach here <https://gist.github.com/boxysean/3ed325ebb75db0303002f9484821e553#file-my_ex...> .
*Background / problem:*
A couple of years ago, I inherited my company's codebase to get data into our data warehouse using an ELT approach (extract-and-loads done in python, transforms done in dbt/SQL). The codebase has dozens of python scripts to integrate first-party and third-party data from databases, FTPs, and APIs, which are run on a scheduler (typically daily or hourly). The scripts I inherited were single-threaded procedural scripts, looking like glue code, and spending most of their time in network I/O. (See example. <https://gist.github.com/boxysean/3ed325ebb75db0303002f9484821e553#file-unthr...>) This got my company pretty far!
As my team and I added more and more integrations with more and more data, we wanted to have faster and faster scripts to reduce our dev cycles and reduce our multi-hour nightly jobs to minutes. Because our scripts were network-bound, multi-threading was a good way to accomplish this, and so I looked into concurrent.futures (example <https://gist.github.com/boxysean/3ed325ebb75db0303002f9484821e553#file-concu...>) and asyncio (example <https://gist.github.com/boxysean/3ed325ebb75db0303002f9484821e553#file-async...>), but I decided against these options because:
1. It wasn't immediately apparently how to adapt my codebase to use these libraries without either some fundamental changes to our execution platform and/or reworking of our scripts from the ground up and/or adding significant lines of multi-threading code to each script.
2. I couldn't wrap my head around the async/await and future constructs particularly quickly, and I was concerned that my team would also struggle with this change.
3. I believe the procedural style glue code we have is quite easy to comprehend, which I think has a positive impact on scale.
*Solution:*
And so, as mentioned at the top, I designed a different interface to concurrent.futures.ThreadPoolExecutor that we are successfully using for our extract-and-load pattern, see a basic example here <https://gist.github.com/boxysean/3ed325ebb75db0303002f9484821e553#file-my_ex...>. The design considerations of this interface include:
- The usage is minimally-invasive to the original unthreaded approach of the codebase. (And so, teaching the library to team members has been fairly straightforward despite the multi-threaded paradigm shift.)
- The @parallel.task decorator should be used to encapsulate a homogeneous method accepting different parameters. The contents of the method should be primarily I/O to achieve the concurrency gains of python multi-threading.
- If no parallel.threads context manager has been entered, the @parallel.task decorator acts as a no-op (and the code runs serially).
- If an environment variable is set to disable the context manager, the @parallel.task decorator acts as a no-op (and the code runs serially).
- There is also an environment variable to change the number of workers provided by parallel.threads (if not hard-coded).
While it's possible to return a value from a @parallel.task method, I encourage my team to use the decorator to start-and-complete work; think of writing "embarrassingly parallel" methods that can be "mapped".
A couple of other things we've implemented include a "thread barrier" in the case where we want a set tasks to complete before a set of other tasks, and a decorator for factory methods to produce cached thread-local objects (helpful for ensuring thread-safe access to network clients that are not thread-safe).
*Your feedback:*
- I'd love to hear your thoughts on my problem and solution.
- I've done a bit of research of existing libraries in PyPI and PEPs but I don't see any similar libraries; are you aware of anything?
- What do you suggest I do next? I'm considering publishing it, but could use some tips on what to here!
Thanks!
Sean McIntyre _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/KGSMCQ... Code of Conduct: http://python.org/psf/codeofconduct/
participants (4)
-
Andrew Barnert
-
David Mertz
-
Kyle Stanley
-
Sean McIntyre