API and process questions (sparked by Claudiu Popa on 16104

(1) Should fixes to a docstring go in with a patch, even if they aren't related to the changing functionality? Bytestring compilation has several orthogonal parameters. Most -- but not all -- are documented in the docstring. (Specifically, there is no explanation of the rx parameter which acts as a filter, and no mention that symbolic links to directories are ignored.) It is best if a commit changes one small thing at a time. On the other hand, Nick recently posted that the minimal overhead of a patch commit is about half an hour. Is that overhead enough to override the one-issue-per-patch guideline? (2) The patch adds new functionality to use multiple processes in parallel. The normal parameter values are integers indicating how many processes to use. The parameter also needs two special values -- one to indicate "use os.cpu_count", and the other to indicate "don't use multiprocessing at all". (A) Is there a Best Practices for this situation, with two odd cases? (B) Claudiu originally copied the API from a similar APU for regrtest. What is the barrier for "do it sensibly" vs "stick with precedent elsewhere"? (Apparently regrtest treats any negative number as a request for the cpu_count calculation; I suspect that "-5" is more likely to be an escaping error for "5" than it is to be a real request for auto-calculation that just happened to choose -5 instead of -1.) (C) How important is it to keep the API consistent between a top-level CLI command and the internal implementation? At the the moment, the request for cpu_count is handled in the CLI wrapper, and not available to interactive callers. On the other hand, interactive callers could just call cpu_count themselves... (D) How important is it to maintain consistency with other uses of the same tool -- multiprocessing has its own was of requesting auto-calculation. (So someone used to multiprocessing might assume that "None" meant to auto-calculate, as opposed to "don't use multiprocessing at all.") -jJ

On Mon Apr 28 2014 at 11:32:58 AM, Jim J. Jewett <jimjjewett@gmail.com> wrote:
(1) Should fixes to a docstring go in with a patch, even if they aren't related to the changing functionality?
It should probably be its own commit.
That's typical for backported patches, not if something is only going into default. That being said, if the change is truly small and you "sneak" it in I don't think anyone is going to make you revert the patch to do it as more atomic commits.
No. In this situation I would consider 0 or -1 for "use os.cpu_count' and None for "don't use multi-processing".
Regrtest doesn't count as an API to stick to. =)
It doesn't hurt, but we typically don't promote CLIs for stdlib modules so there isn't much precedent.
That's a judgment call. If there is already no consistency go with the one that makes the most sense. If consistency exists across the stdlib then stay consistent.

Why would the user care if multiprocessing is used behind the scene? It would be strange for processes=1 to fail if multiprocessing is not available. If you set a default value of 1, then compileall() will work regardless of whether multiprocessing is available. In short: processes <= 0: use os.cpu_count() processes == 1 (default): just use normal sequential compiling processes > 1: use multiprocessing There's no reason to introduce None. Or am I missing something?

And incidentally, I think that the argument *processes* should be renamed to *workers*, or *jobs* (like in make), and any mention of multiprocessing in the documentation should be removed (if any): multiprocessing is an implementation detail. When I type: make -jN I don't really care that make is using fork() ;-)

On Mon, Apr 28, 2014 at 12:56 PM, Charles-François Natali <cf.natali@gmail.com> wrote:
Why would the user care if multiprocessing is used behind the scene?
Err ... that was another set of questions that I forgot to ask. (A) Why bother raising an error if multiprocessing is unavailable? After all, there is a perfectly fine fallback... On other hand, errors should not pass silently. If a user has explicitly asked for multiprocessing, there should be some notice that it didn't happen. And builds are presumably something that a developer will monitor to respond to the Exception. (A1) What sort of Error? I'm inclined to raise the original ImportError, but the patch prefers a ValueError. (B) Assuming the exception, I suppose your question adds a 3rd special case of "Whatever the system suggests, and I don't care whether or not it involves multiprocessing."
It would be strange for processes=1 to fail if multiprocessing is not available.
As Claudiu pointed out, processes=1 should really mean 1 worker process, which is still different from "do everything in the main process". I'm not sure that level of control is really worth the complexity, but I'm not certain it isn't.
processes <= 0: use os.cpu_count()
I could understand doing that for 0 or -1; what is the purpose of doing it for both, let alone for -4? Are we at the point where the parameter should just take positive integers or one of a set of specified string values? -jJ

2014-04-28 18:29 GMT+01:00 Jim J. Jewett <jimjjewett@gmail.com>:
The point I'm making is that he's not asking for multiprocessing, he's asking for parallel build. If you pass 1 (or keep the default value), there's no fallback involved: the code should simply proceed sequentially.
(A1) What sort of Error? I'm inclined to raise the original ImportError, but the patch prefers a ValueError.
NotImplementedError would maybe make sense.
I disagree. If you pass job =1 (and not processes = 1), then you don't care whether multiprocessing is available or not: you just do everything in your main process. It would be quite wasteful to create a single child process!
Honestly, as long as it accepts 0, I'm happy.

In article <CA+OGgf4weBk1NsXBSqi1g7tDE-=7rfkzd5bZn1MXiHWSggECEQ@mail.gmail.com>, "Jim J. Jewett" <jimjjewett@gmail.com> wrote:
For regrtest, there is an important difference between "do everything in the main process" and "do one test at a time in a subprocess", namely the inadvertent global side-effects that running some tests have on other tests. The latter option is much safer and gives more reproducible test results. For compileall, I don't think there is a big difference between the two cases such that the regrtest semantics need to be followed. -- Ned Deily, nad@acm.org

On Mon, Apr 28, 2014 at 6:32 PM, Jim J. Jewett <jimjjewett@gmail.com> wrote:
It could be much less. As an example, http://bugs.python.org/issue19943 has been closed 9 minutes after the report, it didn't have a patch and the fix had to be applied/grafted/merged on 3 branches. The time passed between the first commit and the push is less than a minute too. Clearly the time increases if you have to run the test suite or if there is a merge conflict/push race, and further decreases if there is a simple typo-fix on default only and a patch already available.
Is that overhead enough to override the one-issue-per-patch guideline?
That said, if the main fix should go only on default and it has a smaller unrelated fix that should also go on default only, then doing it together is generally OK (for some value of "unrelated" -- the fix should still be small and near the code you touched for the main fix). If the unrelated fix needs to be ported on all the branches (or in general in branches where the main fix shouldn't go), it's better to have two separate patches. If you commit the smaller fix together with the bigger one, and then decide to backport it, you will have to do a null merge and it gets slightly more complicated. Best Regards, Ezio Melotti

This issue raised too much bikeshedding. To wrap it up, I'll modify the patch with the following: - processes renamed to workers - `workers` defaults to 1 - When `workers` is equal to 0, then `os.cpu_count` will be used - When `workers` > 1, multiple processes will be used - When `workers` == 1, run normally (no multiple processes) - Negative values really should raise a ValueError (as multiprocessing.Pool and soon concurrent.futures.Thread/ProcessPoolExecutor) - Will raise NotImplementedError if multiprocessing can't be used (when `workers` equals to 0 or > 1) If anyone agrees with the above, then I'll modify the patch. This will be its last iteration, any other bikeshedding should be addressed by the core dev who'll apply it. Thank you.

On 4/28/2014 4:24 PM, Claudiu Popa wrote:
I presume you mean for this to be the default.
Yay!
Having read the thread, though not much of the issue, these look to me like the best choices for a clean, comprehensible API.
-- Terry Jan Reedy

On Mon, 28 Apr 2014 23:24:16 +0300, Claudiu Popa <pcmanticore@gmail.com> wrote:
- Will raise NotImplementedError if multiprocessing can't be used (when `workers` equals to 0 or > 1)
I think the most common use case for this ability will be "run with the appropriate number of processes for the system I'm on", where 'the appropriate number' is 1 (the main process) if multiprocessing is not available. Otherwise the tool calling compileall would have to figure out how to "catch the error" (how do you do that when invoking a CLI?) and re-run the script using '1` itself. How you spell this I don't really care, but I think the above is the most common use case. --David

On 28 Apr 2014 16:19, "Ezio Melotti" <ezio.melotti@gmail.com> wrote:
On Mon, Apr 28, 2014 at 6:32 PM, Jim J. Jewett <jimjjewett@gmail.com>
wrote:
The 30 minute guesstimate was for doing it right for a patch you didn't write yourself: - final recheck that all the required pieces are there - patch import & merge commands - run the tests for every affected branch - coming up with the NEWS entry & commit message It's generally much, much faster for changes I write myself - another undesirable incentive since it further encourages us to work on our own changes over incorporating patches from others.
What Ezio said :) I definitely do coarser commits for CPython than I do for beaker-project.org, specifically because Gerrit provides decent tools for reviewing and merging a patch series, and we don't currently have anything like that in the CPython workflow. Cheers, Nick.
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com

On Mon Apr 28 2014 at 11:32:58 AM, Jim J. Jewett <jimjjewett@gmail.com> wrote:
(1) Should fixes to a docstring go in with a patch, even if they aren't related to the changing functionality?
It should probably be its own commit.
That's typical for backported patches, not if something is only going into default. That being said, if the change is truly small and you "sneak" it in I don't think anyone is going to make you revert the patch to do it as more atomic commits.
No. In this situation I would consider 0 or -1 for "use os.cpu_count' and None for "don't use multi-processing".
Regrtest doesn't count as an API to stick to. =)
It doesn't hurt, but we typically don't promote CLIs for stdlib modules so there isn't much precedent.
That's a judgment call. If there is already no consistency go with the one that makes the most sense. If consistency exists across the stdlib then stay consistent.

Why would the user care if multiprocessing is used behind the scene? It would be strange for processes=1 to fail if multiprocessing is not available. If you set a default value of 1, then compileall() will work regardless of whether multiprocessing is available. In short: processes <= 0: use os.cpu_count() processes == 1 (default): just use normal sequential compiling processes > 1: use multiprocessing There's no reason to introduce None. Or am I missing something?

And incidentally, I think that the argument *processes* should be renamed to *workers*, or *jobs* (like in make), and any mention of multiprocessing in the documentation should be removed (if any): multiprocessing is an implementation detail. When I type: make -jN I don't really care that make is using fork() ;-)

On Mon, Apr 28, 2014 at 12:56 PM, Charles-François Natali <cf.natali@gmail.com> wrote:
Why would the user care if multiprocessing is used behind the scene?
Err ... that was another set of questions that I forgot to ask. (A) Why bother raising an error if multiprocessing is unavailable? After all, there is a perfectly fine fallback... On other hand, errors should not pass silently. If a user has explicitly asked for multiprocessing, there should be some notice that it didn't happen. And builds are presumably something that a developer will monitor to respond to the Exception. (A1) What sort of Error? I'm inclined to raise the original ImportError, but the patch prefers a ValueError. (B) Assuming the exception, I suppose your question adds a 3rd special case of "Whatever the system suggests, and I don't care whether or not it involves multiprocessing."
It would be strange for processes=1 to fail if multiprocessing is not available.
As Claudiu pointed out, processes=1 should really mean 1 worker process, which is still different from "do everything in the main process". I'm not sure that level of control is really worth the complexity, but I'm not certain it isn't.
processes <= 0: use os.cpu_count()
I could understand doing that for 0 or -1; what is the purpose of doing it for both, let alone for -4? Are we at the point where the parameter should just take positive integers or one of a set of specified string values? -jJ

2014-04-28 18:29 GMT+01:00 Jim J. Jewett <jimjjewett@gmail.com>:
The point I'm making is that he's not asking for multiprocessing, he's asking for parallel build. If you pass 1 (or keep the default value), there's no fallback involved: the code should simply proceed sequentially.
(A1) What sort of Error? I'm inclined to raise the original ImportError, but the patch prefers a ValueError.
NotImplementedError would maybe make sense.
I disagree. If you pass job =1 (and not processes = 1), then you don't care whether multiprocessing is available or not: you just do everything in your main process. It would be quite wasteful to create a single child process!
Honestly, as long as it accepts 0, I'm happy.

In article <CA+OGgf4weBk1NsXBSqi1g7tDE-=7rfkzd5bZn1MXiHWSggECEQ@mail.gmail.com>, "Jim J. Jewett" <jimjjewett@gmail.com> wrote:
For regrtest, there is an important difference between "do everything in the main process" and "do one test at a time in a subprocess", namely the inadvertent global side-effects that running some tests have on other tests. The latter option is much safer and gives more reproducible test results. For compileall, I don't think there is a big difference between the two cases such that the regrtest semantics need to be followed. -- Ned Deily, nad@acm.org

On Mon, Apr 28, 2014 at 6:32 PM, Jim J. Jewett <jimjjewett@gmail.com> wrote:
It could be much less. As an example, http://bugs.python.org/issue19943 has been closed 9 minutes after the report, it didn't have a patch and the fix had to be applied/grafted/merged on 3 branches. The time passed between the first commit and the push is less than a minute too. Clearly the time increases if you have to run the test suite or if there is a merge conflict/push race, and further decreases if there is a simple typo-fix on default only and a patch already available.
Is that overhead enough to override the one-issue-per-patch guideline?
That said, if the main fix should go only on default and it has a smaller unrelated fix that should also go on default only, then doing it together is generally OK (for some value of "unrelated" -- the fix should still be small and near the code you touched for the main fix). If the unrelated fix needs to be ported on all the branches (or in general in branches where the main fix shouldn't go), it's better to have two separate patches. If you commit the smaller fix together with the bigger one, and then decide to backport it, you will have to do a null merge and it gets slightly more complicated. Best Regards, Ezio Melotti

This issue raised too much bikeshedding. To wrap it up, I'll modify the patch with the following: - processes renamed to workers - `workers` defaults to 1 - When `workers` is equal to 0, then `os.cpu_count` will be used - When `workers` > 1, multiple processes will be used - When `workers` == 1, run normally (no multiple processes) - Negative values really should raise a ValueError (as multiprocessing.Pool and soon concurrent.futures.Thread/ProcessPoolExecutor) - Will raise NotImplementedError if multiprocessing can't be used (when `workers` equals to 0 or > 1) If anyone agrees with the above, then I'll modify the patch. This will be its last iteration, any other bikeshedding should be addressed by the core dev who'll apply it. Thank you.

On 4/28/2014 4:24 PM, Claudiu Popa wrote:
I presume you mean for this to be the default.
Yay!
Having read the thread, though not much of the issue, these look to me like the best choices for a clean, comprehensible API.
-- Terry Jan Reedy

On Mon, 28 Apr 2014 23:24:16 +0300, Claudiu Popa <pcmanticore@gmail.com> wrote:
- Will raise NotImplementedError if multiprocessing can't be used (when `workers` equals to 0 or > 1)
I think the most common use case for this ability will be "run with the appropriate number of processes for the system I'm on", where 'the appropriate number' is 1 (the main process) if multiprocessing is not available. Otherwise the tool calling compileall would have to figure out how to "catch the error" (how do you do that when invoking a CLI?) and re-run the script using '1` itself. How you spell this I don't really care, but I think the above is the most common use case. --David

On Tue, Apr 29, 2014 at 11:11 PM, Charles-François Natali <cf.natali@gmail.com> wrote:
I've just updated the issue with the latest patch. If any core dev could pick it up and commit it, I'll be thankful.

On 28 Apr 2014 16:19, "Ezio Melotti" <ezio.melotti@gmail.com> wrote:
On Mon, Apr 28, 2014 at 6:32 PM, Jim J. Jewett <jimjjewett@gmail.com>
wrote:
The 30 minute guesstimate was for doing it right for a patch you didn't write yourself: - final recheck that all the required pieces are there - patch import & merge commands - run the tests for every affected branch - coming up with the NEWS entry & commit message It's generally much, much faster for changes I write myself - another undesirable incentive since it further encourages us to work on our own changes over incorporating patches from others.
What Ezio said :) I definitely do coarser commits for CPython than I do for beaker-project.org, specifically because Gerrit provides decent tools for reviewing and merging a patch series, and we don't currently have anything like that in the CPython workflow. Cheers, Nick.
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
participants (9)
-
Brett Cannon
-
Charles-François Natali
-
Claudiu Popa
-
Ezio Melotti
-
Jim J. Jewett
-
Ned Deily
-
Nick Coghlan
-
R. David Murray
-
Terry Reedy