Re: [Python-Dev] cpython (2.7): Fix comment blocks. Adjust blocksize to a power-of-two for better divmod

On Fri, 14 Jun 2013 07:06:49 +0200 (CEST) raymond.hettinger <python-checkins@python.org> wrote:
http://hg.python.org/cpython/rev/5accb0ac8bfb changeset: 84116:5accb0ac8bfb branch: 2.7 parent: 84095:ca8e86711403 user: Raymond Hettinger <python@rcn.com> date: Fri Jun 14 01:06:33 2013 -0400 summary: Fix comment blocks. Adjust blocksize to a power-of-two for better divmod computations.
Is there any rationale for changing the heuristic from "fits in a whole number of cachelines" to "allows fast division by the blocksize"? I personally would prefer if such changes were justified a bit more than by a one-liner changeset message without any corresponding open issue. Regards Antoine.

14.06.13 11:46, Antoine Pitrou написав(ла):
On Fri, 14 Jun 2013 07:06:49 +0200 (CEST) raymond.hettinger <python-checkins@python.org> wrote:
http://hg.python.org/cpython/rev/5accb0ac8bfb changeset: 84116:5accb0ac8bfb Fix comment blocks. Adjust blocksize to a power-of-two for better divmod computations.
Is there any rationale for changing the heuristic from "fits in a whole number of cachelines" to "allows fast division by the blocksize"?
I personally would prefer if such changes were justified a bit more than by a one-liner changeset message without any corresponding open issue.
I share the doubts of Antoine and I was going to write the same comment. I thought there were good reasons for previous code. What has changed?

On Fri, Jun 14, 2013 at 6:47 AM, Serhiy Storchaka <storchaka@gmail.com>wrote:
14.06.13 11:46, Antoine Pitrou написав(ла):
On Fri, 14 Jun 2013 07:06:49 +0200 (CEST) raymond.hettinger <python-checkins@python.org> wrote:
http://hg.python.org/cpython/**rev/5accb0ac8bfb<http://hg.python.org/cpython/rev/5accb0ac8bfb> changeset: 84116:5accb0ac8bfb Fix comment blocks. Adjust blocksize to a power-of-two for better divmod computations.
Is there any rationale for changing the heuristic from "fits in a whole number of cachelines" to "allows fast division by the blocksize"?
I personally would prefer if such changes were justified a bit more than by a one-liner changeset message without any corresponding open issue.
I share the doubts of Antoine and I was going to write the same comment. I thought there were good reasons for previous code. What has changed?
I agree it would be interesting to hear about the reasons for the change. Raymond? Eli

On 06/17/2013 09:05 PM, Eli Bendersky wrote:
On Fri, Jun 14, 2013 at 6:47 AM, Serhiy Storchaka <storchaka@gmail.com <mailto:storchaka@gmail.com>> wrote:
14.06.13 11:46, Antoine Pitrou написав(ла):
On Fri, 14 Jun 2013 07:06:49 +0200 (CEST) raymond.hettinger <python-checkins@python.org <mailto:python-checkins@python.org>> wrote:
http://hg.python.org/cpython/__rev/5accb0ac8bfb <http://hg.python.org/cpython/rev/5accb0ac8bfb> changeset: 84116:5accb0ac8bfb Fix comment blocks. Adjust blocksize to a power-of-two for better divmod computations.
Is there any rationale for changing the heuristic from "fits in a whole number of cachelines" to "allows fast division by the blocksize"?
I personally would prefer if such changes were justified a bit more than by a one-liner changeset message without any corresponding open issue.
I share the doubts of Antoine and I was going to write the same comment. I thought there were good reasons for previous code. What has changed?
I agree it would be interesting to hear about the reasons for the change. Raymond?
Asking as a learner: are such non-bugfix changes appropriate for the 2.7 line? -- ~Ethan~

Raymond - Why did you do this in the 2.7 branch? It hasn't been done in 3.3 or default and it isn't not the sort of change we make in a stable release branch without justification. What issue was this for? What problem were you solving? -gps On Mon, Jun 17, 2013 at 11:08 PM, Ethan Furman <ethan@stoneleaf.us> wrote:
On 06/17/2013 09:05 PM, Eli Bendersky wrote:
On Fri, Jun 14, 2013 at 6:47 AM, Serhiy Storchaka <storchaka@gmail.com<mailto: storchaka@gmail.com>> wrote:
14.06.13 11:46, Antoine Pitrou написав(ла):
On Fri, 14 Jun 2013 07:06:49 +0200 (CEST) raymond.hettinger <python-checkins@python.org <mailto: python-checkins@**python.org <python-checkins@python.org>>> wrote:
http://hg.python.org/cpython/_**_rev/5accb0ac8bfb<http://hg.python.org/cpython/__rev/5accb0ac8bfb>< http://hg.python.org/cpython/**rev/5accb0ac8bfb<http://hg.python.org/cpython/rev/5accb0ac8bfb>
changeset: 84116:5accb0ac8bfb Fix comment blocks. Adjust blocksize to a power-of-two for better divmod computations.
Is there any rationale for changing the heuristic from "fits in a whole number of cachelines" to "allows fast division by the blocksize"?
I personally would prefer if such changes were justified a bit more than by a one-liner changeset message without any corresponding open issue.
I share the doubts of Antoine and I was going to write the same comment. I thought there were good reasons for previous code. What has changed?
I agree it would be interesting to hear about the reasons for the change. Raymond?
Asking as a learner: are such non-bugfix changes appropriate for the 2.7 line?
-- ~Ethan~
______________________________**_________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/**mailman/listinfo/python-dev<http://mail.python.org/mailman/listinfo/python-dev> Unsubscribe: http://mail.python.org/**mailman/options/python-dev/** greg%40krypto.org<http://mail.python.org/mailman/options/python-dev/greg%40krypto.org>

Many people have raised concerns about this change, so I've now backed it out. 2013/6/18 Gregory P. Smith <greg@krypto.org>:
Raymond -
Why did you do this in the 2.7 branch?
It hasn't been done in 3.3 or default and it isn't not the sort of change we make in a stable release branch without justification. What issue was this for? What problem were you solving?
-gps
On Mon, Jun 17, 2013 at 11:08 PM, Ethan Furman <ethan@stoneleaf.us> wrote:
On 06/17/2013 09:05 PM, Eli Bendersky wrote:
On Fri, Jun 14, 2013 at 6:47 AM, Serhiy Storchaka <storchaka@gmail.com <mailto:storchaka@gmail.com>> wrote:
14.06.13 11:46, Antoine Pitrou написав(ла):
On Fri, 14 Jun 2013 07:06:49 +0200 (CEST) raymond.hettinger <python-checkins@python.org <mailto:python-checkins@python.org>> wrote:
http://hg.python.org/cpython/__rev/5accb0ac8bfb <http://hg.python.org/cpython/rev/5accb0ac8bfb>
changeset: 84116:5accb0ac8bfb Fix comment blocks. Adjust blocksize to a power-of-two for better divmod computations.
Is there any rationale for changing the heuristic from "fits in a whole number of cachelines" to "allows fast division by the blocksize"?
I personally would prefer if such changes were justified a bit more than by a one-liner changeset message without any corresponding open issue.
I share the doubts of Antoine and I was going to write the same comment. I thought there were good reasons for previous code. What has changed?
I agree it would be interesting to hear about the reasons for the change. Raymond?
Asking as a learner: are such non-bugfix changes appropriate for the 2.7 line?
-- ~Ethan~
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/greg%40krypto.org
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/benjamin%40python.org
-- Regards, Benjamin

On 6/22/2013 2:17 PM, Benjamin Peterson wrote:
Many people have raised concerns about this change, so I've now backed it out.
I think that change also goes with this change: http://hg.python.org/cpython/rev/f1dc30a1be72 changeset 84248:f1dc30a1be72 2.7 Arrange structure to match the common access patterns. 1.1 --- a/Modules/_collectionsmodule.c 1.2 +++ b/Modules/_collectionsmodule.c 1.3 @@ -48,8 +48,8 @@ 1.4 1.5 typedef struct BLOCK { 1.6 struct BLOCK *leftlink; 1.7 + PyObject *data[BLOCKLEN]; 1.8 struct BLOCK *rightlink; 1.9 - PyObject *data[BLOCKLEN]; 1.10 } block; 1.11 1.12 #define MAXFREEBLOCKS 10 , which seems like a strange micro-optimization, at best. Based on that structure, it would seem that neither BLOCKLEN being 62 (previous value) nor 64 (the new value) make much sense. It would seem best that sizeof(block) == 64, so BLOCKLEN should be (64 - 2*sizeof(PyObject *)). Nevertheless, I am skeptical that any tuning of this structure provides any meaningful performance improvement. -- Scott Dial scott@scottdial.com

On Sat, Jun 22, 2013 at 1:43 PM, Scott Dial <scott+python-dev@scottdial.com> wrote:
On 6/22/2013 2:17 PM, Benjamin Peterson wrote:
Many people have raised concerns about this change, so I've now backed it out.
I think that change also goes with this change:
http://hg.python.org/cpython/rev/f1dc30a1be72
changeset 84248:f1dc30a1be72 2.7 Arrange structure to match the common access patterns.
1.1 --- a/Modules/_collectionsmodule.c 1.2 +++ b/Modules/_collectionsmodule.c 1.3 @@ -48,8 +48,8 @@ 1.4 1.5 typedef struct BLOCK { 1.6 struct BLOCK *leftlink; 1.7 + PyObject *data[BLOCKLEN]; 1.8 struct BLOCK *rightlink; 1.9 - PyObject *data[BLOCKLEN]; 1.10 } block; 1.11 1.12 #define MAXFREEBLOCKS 10
, which seems like a strange micro-optimization, at best.
Based on that structure, it would seem that neither BLOCKLEN being 62 (previous value) nor 64 (the new value) make much sense. It would seem best that sizeof(block) == 64, so BLOCKLEN should be (64 - 2*sizeof(PyObject *)). Nevertheless, I am skeptical that any tuning of this structure provides any meaningful performance improvement.
Actually the data buffer is an array of pointers too, so with the original BLOCKLEN value of 62, sizeof(block) would be 64 times sizeof(PyObject *). In the Py3 version of the source there's even a comment explaining this right before the #define BLOCKLEN. Raymond, can you explain? -- --Guido van Rossum (python.org/~guido)

I've backed this one out, too. 2013/6/22 Scott Dial <scott+python-dev@scottdial.com>:
On 6/22/2013 2:17 PM, Benjamin Peterson wrote:
Many people have raised concerns about this change, so I've now backed it out.
I think that change also goes with this change:
http://hg.python.org/cpython/rev/f1dc30a1be72
changeset 84248:f1dc30a1be72 2.7 Arrange structure to match the common access patterns.
1.1 --- a/Modules/_collectionsmodule.c 1.2 +++ b/Modules/_collectionsmodule.c 1.3 @@ -48,8 +48,8 @@ 1.4 1.5 typedef struct BLOCK { 1.6 struct BLOCK *leftlink; 1.7 + PyObject *data[BLOCKLEN]; 1.8 struct BLOCK *rightlink; 1.9 - PyObject *data[BLOCKLEN]; 1.10 } block; 1.11 1.12 #define MAXFREEBLOCKS 10
, which seems like a strange micro-optimization, at best.
Based on that structure, it would seem that neither BLOCKLEN being 62 (previous value) nor 64 (the new value) make much sense. It would seem best that sizeof(block) == 64, so BLOCKLEN should be (64 - 2*sizeof(PyObject *)). Nevertheless, I am skeptical that any tuning of this structure provides any meaningful performance improvement.
-- Scott Dial scott@scottdial.com
-- Regards, Benjamin

On Jun 18, 2013, at 12:00 AM, Gregory P. Smith <greg@krypto.org> wrote:
Why did you do this in the 2.7 branch?
It hasn't been done in 3.3 or default
I worked on the 2.7 branch first because that was the one I had loaded and the one where I did timings and code disassembly. I intended to post it to 3.3 and 3.4 as well over the weekend. Ideally, it makes maintenance simpler for me if I keep the branches the same as possible. I viewed the one-line struct transposition, comment correction, and one-line blocklen change as being somewhat innocuous non-algorithmic changes. The struct change fixed an unintended cache miss for left links and the blocksize change causes the deque_index code to compile more efficiently (using a right-shift and bitwise-and and rather than a measurably more expensive division and modulo calculation). I truly wasn't expecting the Spanish Inquisition :-) On Jun 22, 2013, at 1:43 PM, Scott Dial <scott+python-dev@scottdial.com> wrote:
changeset 84248:f1dc30a1be72 2.7 Arrange structure to match the common access patterns.
1.1 --- a/Modules/_collectionsmodule.c 1.2 +++ b/Modules/_collectionsmodule.c 1.3 @@ -48,8 +48,8 @@ 1.4 1.5 typedef struct BLOCK { 1.6 struct BLOCK *leftlink; 1.7 + PyObject *data[BLOCKLEN]; 1.8 struct BLOCK *rightlink; 1.9 - PyObject *data[BLOCKLEN]; 1.10 } block; 1.11 1.12 #define MAXFREEBLOCKS 10
, which seems like a strange micro-optimization, at best.
Yes, this is a micro-optimization. In working on implementing deque slicing for 3.4, I restudied the block access patterns. On an appendleft(), popleft() or extendleft() operation, the left link is accessed immediately before or after the leftmost entry in the data block. The old struct arrangement can cause an unnecessary cache miss when jumping leftward. This was something I overlooked when I originally wrote the code almost a decade ago. On Jun 23, 2013, at 11:38 AM, Benjamin Peterson <benjamin@python.org> wrote:
I've backed this one out, too.
Really, you reverted my one-line change within 24 hours of it being posted? I can't be on-line every day and sometimes it takes a little while to catch up with python email. On Jun 22, 2013, at 2:55 PM, Guido van Rossum <guido@python.org> wrote:
Actually the data buffer is an array of pointers too, so with the original BLOCKLEN value of 62, sizeof(block) would be 64 times sizeof(PyObject *). In the Py3 version of the source there's even a comment explaining this right before the #define BLOCKLEN. Raymond, can you explain?
I also tried to fix that comment so it would stop emphasizing the blocklen being a multiple of the cache line. Also long as there is a reasonably long data block, it matters less whether the data block size is an exact multiple of the cache line length (62 or 64 words of data versus a typical 8 byte cache line). The data block size does matter elsewhere though. The benefit of the having the data block being a power-of-two is that the deque_index computation can use bits shifts rather division and modulo calculations. The benefit of switching data block size from 62 to 64 was measurable (a few percent) and observable in the disassembly of the code. I experimented with one other ordering as well (putting the data block first and the links afterwards). That saved the scaled-index byte in the generated code but produced no measureable speed-up. In short, I believe the following should go in: * The comment fix. (Incorrectly suggesting that a 64-bit Py_ssize_t would overflow). The revised comment is simpler, more general, and correct. * Putting the leftlink before the data block in the structure. The follow is up for debate: Changing the BLOCKLEN from 62 to 64 is debatable. It measureably improved deque_index without an observable negative effect on the other operations. Lastly, there was a change I just put in to Py 3.4 replacing the memcpy() with a simple loop and replacing the "deque->" references with local variables. Besides giving a small speed-up, it made the code more clear and less at the mercy of various implementations of memcpy(). Ideally, I would like 2.7 and 3.3 to replace their use of memcpy() as well, but the flavor of this thread suggests that is right out. Raymond

On 6/23/2013 8:16 PM, Raymond Hettinger wrote:
Yes, this is a micro-optimization. In working on implementing deque slicing for 3.4, I restudied the block access patterns. On an appendleft(), popleft() or extendleft() operation, the left link is accessed immediately before or after the leftmost entry in the data block. The old struct arrangement can cause an unnecessary cache miss when jumping leftward. This was something I overlooked when I originally wrote the code almost a decade ago.
A decade ago, cache lines were 64 bytes, pointers were 4 bytes, and allocations were 16 byte aligned, so there could never be a cache miss. Nowadays, cache lines are still 64 bytes but pointers are 8 bytes, and we still allocating on 16 byte alignment, so you have a 25% chance of a cache miss now. -- Scott Dial scott@scottdial.com

On Jun 23, 2013, at 6:52 PM, Scott Dial <scott+python-dev@scottdial.com> wrote:
Nowadays, cache lines are still 64 bytes but pointers are 8 bytes, and we still allocating on 16 byte alignment, so you have a 25% chance of a cache miss now.
Honestly, I'm not sure what you're arguing for or against. The struct should to be reordered so that the leftmost data element and left link are positioned side-by-side, and the same goes for the rightmost element and right link. Whether the blocksize should change is debatable. It causes an unfortunate odd-size at the end (not good for the final cache line), but it does improve the speed of the deque_index() code. The former seems to make no difference in timings while the latter gives a measureable speed-up. Unfortunately, I'm losing interest in this part of the deque work. I've already invested substantial time reading papers (such as http://www.akkadia.org/drepper/cpumemory.pdf and http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-...), analyzing the code, reading disassembly, and making timings. But it isn't worth all the second guessing (and what feels like sniping). I've worked on this code for almost a decade. As far as I can tell, none of the participants in this thread has ever previously shown any interest in the deque object. It is discouraging to have a simple parameter change and struct reordering reverted. This thread has offered zero support or encouragement for my work. The python-dev social environment appears to be degrading over time. I only have a few hours of development time each week and now I'm wasting that time responding to these emails (it may go with the territory, but it is a waste none-the-less). If it is what you guys what, then leave the code as is (with an incorrect comment, a blocklen that is unfavorable to indexing, and a struct order that doesn't exploit cache locality by following the natural access patterns in the code). I understand that the code for Py2.7 is sensitive and understand if you all prefer to leave it untouched. Raymond P.S. This has all arisen in context of my working on patch for implementing slicing for deques. In that context, the code for deque_item() and deque_rotate() will become more important than they were before. Along the way, I am re-examining all my existing code.

On Sun, Jun 23, 2013 at 11:37 PM, Raymond Hettinger < raymond.hettinger@gmail.com> wrote:
As far as I can tell, none of the participants in this thread has ever previously shown any interest in the deque object. It is discouraging to have a simple parameter change and struct reordering reverted. This thread has offered zero support or encouragement for my work. The python-dev social environment appears to be degrading over time.
FWIW, I was surprised to see Raymond's patch reverted in what looked like a knee-jerk reaction. In my view Raymond through his contributions has earned the presumption of validity for his commits. A risk of introducing regressions may ultimately outweigh the benefit of optimization and keeping 2.x and 3.x code in sync, but there is no need in rushing the decision. This patch could be reverted at any time before the 2.7.6 release which as far as I know has not even been scheduled.

2013/6/23 Alexander Belopolsky <alexander.belopolsky@gmail.com>:
On Sun, Jun 23, 2013 at 11:37 PM, Raymond Hettinger <raymond.hettinger@gmail.com> wrote:
As far as I can tell, none of the participants in this thread has ever previously shown any interest in the deque object. It is discouraging to have a simple parameter change and struct reordering reverted. This thread has offered zero support or encouragement for my work. The python-dev social environment appears to be degrading over time.
FWIW, I was surprised to see Raymond's patch reverted in what looked like a knee-jerk reaction. In my view Raymond through his contributions has earned the presumption of validity for his commits. A risk of introducing regressions may ultimately outweigh the benefit of optimization and keeping 2.x and 3.x code in sync, but there is no need in rushing the decision. This patch could be reverted at any time before the 2.7.6 release which as far as I know has not even been scheduled.
The first concern about 5accb0ac8bfb was raised more than a week ago and more piled up over the course of the week. I knew Raymond wasn't offline, since he continued to commit. Since we don't have mandatory reviews before push (which more and more I think we should), post-facto reviews on python-checkins need to be acknowledged. A backout is not the end of the world; it takes seconds to reland once a conclusion has been reached. This incident should not be construed to diminish Raymond's long history of contribution or his technical ability. My second reversion (86d512e0ec66) was knee-jerk and shouldn't have happened. -- Regards, Benjamin

Your work is great! I even agree with the changes on a coding best practices level. It's just that it belongs on the default (3.4) branch as it is an enhancement, not a bug fix. We don't do new things on release branches. I agree that can be extremely frustrating at times, knowing that code won't see the light of day for most users for years, but it is what has always kept the Python project stable. -gps On Jun 23, 2013 8:38 PM, "Raymond Hettinger" <raymond.hettinger@gmail.com> wrote:
On Jun 23, 2013, at 6:52 PM, Scott Dial <scott+python-dev@scottdial.com> wrote:
Nowadays, cache lines are still 64 bytes but pointers are 8 bytes, and we still allocating on 16 byte alignment, so you have a 25% chance of a cache miss now.
Honestly, I'm not sure what you're arguing for or against.
The struct should to be reordered so that the leftmost data element and left link are positioned side-by-side, and the same goes for the rightmost element and right link.
Whether the blocksize should change is debatable. It causes an unfortunate odd-size at the end (not good for the final cache line), but it does improve the speed of the deque_index() code. The former seems to make no difference in timings while the latter gives a measureable speed-up.
Unfortunately, I'm losing interest in this part of the deque work. I've already invested substantial time reading papers (such as http://www.akkadia.org/drepper/cpumemory.pdf and http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-... ), analyzing the code, reading disassembly, and making timings. But it isn't worth all the second guessing (and what feels like sniping). I've worked on this code for almost a decade. As far as I can tell, none of the participants in this thread has ever previously shown any interest in the deque object. It is discouraging to have a simple parameter change and struct reordering reverted. This thread has offered zero support or encouragement for my work. The python-dev social environment appears to be degrading over time. I only have a few hours of development time each week and now I'm wasting that time responding to these emails (it may go with the territory, but it is a waste none-the-less).
If it is what you guys what, then leave the code as is (with an incorrect comment, a blocklen that is unfavorable to indexing, and a struct order that doesn't exploit cache locality by following the natural access patterns in the code).
I understand that the code for Py2.7 is sensitive and understand if you all prefer to leave it untouched.
Raymond
P.S. This has all arisen in context of my working on patch for implementing slicing for deques. In that context, the code for deque_item() and deque_rotate() will become more important than they were before. Along the way, I am re-examining all my existing code.
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/greg%40krypto.org

On 24 June 2013 13:37, Raymond Hettinger <raymond.hettinger@gmail.com> wrote:
But it isn't worth all the second guessing (and what feels like sniping). I've worked on this code for almost a decade. As far as I can tell, none of the participants in this thread has ever previously shown any interest in the deque object. It is discouraging to have a simple parameter change and struct reordering reverted. This thread has offered zero support or encouragement for my work. The python-dev social environment appears to be degrading over time. I only have a few hours of development time each week and now I'm wasting that time responding to these emails (it may go with the territory, but it is a waste none-the-less).
The problem wasn't the change in and of itself, it was: - the fact it happened in a stable branch - the fact there was no explanation, even a week after one was requested (not even a quick "I do have the data to back up this assertion, please leave the commit alone until I have time to provide it"). The mailing lists, commit history, source code comments and files like dict_notes.txt are an important part of understanding *why* various parts of CPython are the way they are. So, no, "responding to emails" is *not* a waste of anybody's time. It helps to ensure important knowledge is transferred to more than the person that was responsible for the commit. This is *especially* important for areas where one person has substantially more knowledge and experience than others (alleviating that kind of problem is one of the reasons Brett, myself and others have been embarked on a multi-year campaign to make the import system less arcane and mysterious, and why I want to make the startup code more comprehensible). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Sun, 23 Jun 2013 20:37:37 -0700 Raymond Hettinger <raymond.hettinger@gmail.com> wrote:
But it isn't worth all the second guessing (and what feels like sniping). I've worked on this code for almost a decade. As far as I can tell, none of the participants in this thread has ever previously shown any interest in the deque object.
How do you measure "showing any interest"? I certainly use deques regularly (thank you for implementing them) and I'm sure others do too. I'm also interested in how the type is implemented and what the tradeoffs are there. Otherwise I wouldn't have noticed your commit. (if you remember, I was also interested enough to propose grafting deque-like functionality on OrderedDict: http://bugs.python.org/issue17100) Regards Antoine.

24.06.13 03:16, Raymond Hettinger написав(ла):
I truly wasn't expecting the Spanish Inquisition :-)
I only asked about the reasons. Previous reasons seemed reasonable to me and I wanted to know why a new code is better than the old. It will help me to use the best style in other places. Thank you for your clarification.

Hi Raymond, Thank you for your long explanation, it is exactly what Antoine asked for :-) I like micro-optimization even if I know that some other developers only see such work as noise, not providing an real speed up :-) So it was interesting to read your email! I'm sorry that you was injured by the revert of your commits (even I didn't do the revert myself, I agreed with Benjamin, and he's the maintainer of Python 2.7). I hope that you understood why they were rejected after reading this mail thread: the last Python 2.7 release (which one exactly?) contains regressions, and changes on the 2.7 stable branch are sensitive. By experience, there is always someone in the world relying to very specific implementation details. Cython, PyPy CPyExt, ctypes, etc. are sensitive to such change. People should not rely on such implementation details, but they do :-) For this specific case, I "hope" that nobody relies on the exact BLOCK structure (since it is a private structure). But it is just safer to not change it that in a minor Python version. If your compare the risk of introducing a regression versus the speedup, I guess that the risk is higher. As I wrote, I'm interested in micro-optimization, so please continue your work! It is just safer to only modify the active development branch (Python 3.4). Victor 2013/6/24 Raymond Hettinger <raymond.hettinger@gmail.com>:
On Jun 18, 2013, at 12:00 AM, Gregory P. Smith <greg@krypto.org> wrote:
Why did you do this in the 2.7 branch?
It hasn't been done in 3.3 or default
I worked on the 2.7 branch first because that was the one I had loaded and the one where I did timings and code disassembly. I intended to post it to 3.3 and 3.4 as well over the weekend.
Ideally, it makes maintenance simpler for me if I keep the branches the same as possible. I viewed the one-line struct transposition, comment correction, and one-line blocklen change as being somewhat innocuous non-algorithmic changes. The struct change fixed an unintended cache miss for left links and the blocksize change causes the deque_index code to compile more efficiently (using a right-shift and bitwise-and and rather than a measurably more expensive division and modulo calculation).
I truly wasn't expecting the Spanish Inquisition :-)
On Jun 22, 2013, at 1:43 PM, Scott Dial <scott+python-dev@scottdial.com> wrote:
changeset 84248:f1dc30a1be72 2.7 Arrange structure to match the common access patterns.
1.1 --- a/Modules/_collectionsmodule.c 1.2 +++ b/Modules/_collectionsmodule.c 1.3 @@ -48,8 +48,8 @@ 1.4 1.5 typedef struct BLOCK { 1.6 struct BLOCK *leftlink; 1.7 + PyObject *data[BLOCKLEN]; 1.8 struct BLOCK *rightlink; 1.9 - PyObject *data[BLOCKLEN]; 1.10 } block; 1.11 1.12 #define MAXFREEBLOCKS 10
, which seems like a strange micro-optimization, at best.
Yes, this is a micro-optimization. In working on implementing deque slicing for 3.4, I restudied the block access patterns. On an appendleft(), popleft() or extendleft() operation, the left link is accessed immediately before or after the leftmost entry in the data block. The old struct arrangement can cause an unnecessary cache miss when jumping leftward. This was something I overlooked when I originally wrote the code almost a decade ago.
On Jun 23, 2013, at 11:38 AM, Benjamin Peterson <benjamin@python.org> wrote:
I've backed this one out, too.
Really, you reverted my one-line change within 24 hours of it being posted? I can't be on-line every day and sometimes it takes a little while to catch up with python email.
On Jun 22, 2013, at 2:55 PM, Guido van Rossum <guido@python.org> wrote:
Actually the data buffer is an array of pointers too, so with the original BLOCKLEN value of 62, sizeof(block) would be 64 times sizeof(PyObject *). In the Py3 version of the source there's even a comment explaining this right before the #define BLOCKLEN. Raymond, can you explain?
I also tried to fix that comment so it would stop emphasizing the blocklen being a multiple of the cache line. Also long as there is a reasonably long data block, it matters less whether the data block size is an exact multiple of the cache line length (62 or 64 words of data versus a typical 8 byte cache line).
The data block size does matter elsewhere though. The benefit of the having the data block being a power-of-two is that the deque_index computation can use bits shifts rather division and modulo calculations. The benefit of switching data block size from 62 to 64 was measurable (a few percent) and observable in the disassembly of the code.
I experimented with one other ordering as well (putting the data block first and the links afterwards). That saved the scaled-index byte in the generated code but produced no measureable speed-up.
In short, I believe the following should go in:
* The comment fix. (Incorrectly suggesting that a 64-bit Py_ssize_t would overflow). The revised comment is simpler, more general, and correct.
* Putting the leftlink before the data block in the structure.
The follow is up for debate:
Changing the BLOCKLEN from 62 to 64 is debatable. It measureably improved deque_index without an observable negative effect on the other operations.
Lastly, there was a change I just put in to Py 3.4 replacing the memcpy() with a simple loop and replacing the "deque->" references with local variables. Besides giving a small speed-up, it made the code more clear and less at the mercy of various implementations of memcpy().
Ideally, I would like 2.7 and 3.3 to replace their use of memcpy() as well, but the flavor of this thread suggests that is right out.
Raymond
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/victor.stinner%40gmail.com

2013/6/24 Raymond Hettinger <raymond.hettinger@gmail.com>:
Lastly, there was a change I just put in to Py 3.4 replacing the memcpy() with a simple loop and replacing the "deque->" references with local variables. Besides giving a small speed-up, it made the code more clear and less at the mercy of various implementations of memcpy().
Ideally, I would like 2.7 and 3.3 to replace their use of memcpy() as well, but the flavor of this thread suggests that is right out.
The specific memcpy() function is usually highly optimized with assembler code for each architecture. The GNU libc now does better: it can choose the fastest version depending on the CPU version (MMX, SSE, etc.) at runtime. If I understood correctly, the glibc contains different version of memcpy, and the dynamic linker (ld.so) chooses the version depending on the CPU. GCC is also able to inline memcpy() when the size is known at compile time. I also saw two code paths when the size is only known at runtime: inline version for small size, and function call for larger copy. Python has a Py_MEMCPY which implements exactly that, but only for Visual Studio. I suppose that Visual Studio does not implement this optimization. By the way, Py_MEMCPY() is only used in few places. So it's surprising to read that a dummy loop is faster than memcpy()... even if I already see this in my own micro-benchmarks :-) Do you have an idea on how we can decide between the dummy loop and memcpy()? Using a benchmark? Or can it be decided just by reading the C code? What is the policy for using Py_MEMCPY() vs memcpy()? Victor

2013/6/24 Raymond Hettinger <raymond.hettinger@gmail.com>:
Changing the BLOCKLEN from 62 to 64 is debatable. It measureably improved deque_index without an observable negative effect on the other operations.
Out of curiosity, do you know (remember) how was the number 62 chosen? Is it a compromise between memory usage and performances? 62 is surprising because it is not a power of two :-) Is it to just have 64 (2+62) pointers in the structure? (64 is a power of 2) Does it help the memory allocator (reduce memory fragmentation) to have a structure of 256 bytes (32-bit pointer) or 512 bytes (64-bit pointer), which are power of 2, instead of 264 or 528 bytes? It would be interesting to test pymalloc memory allocator on deque: I plan to test Python globally with PyMem_Malloc using pymalloc. pymalloc has a threshold of 512 bytes (and 528 bytes is greater than 512!). I suppose that the memory allocator is (much) more expensive than integer divisions. Would it be possible to change dynamically BLOCKLEN depending on the total size of the deque? Python dict does something like that, but for other reasons (reduce the number of hash collisions). According to your comment, "Larger numbers reduce the number of calls to the memory allocator, give faster indexing and rotation, and reduce the link::data overhead ratio." With dynamic BLOCKLEN, it would also be possible to reduce the memory usage for small deque (less than 62 items). Victor

On Jun 24, 2013, at 4:07 AM, Victor Stinner <victor.stinner@gmail.com> wrote:
Out of curiosity, do you know (remember) how was the number 62 chosen? Is it a compromise between memory usage and performances? 62 is surprising because it is not a power of two :-)
Is it to just have 64 (2+62) pointers in the structure?
Yes, the goal was to have the struct size be an exact multiple of the cache line length (always a power-of-two, typically 64 bytes). What was different then is that deques weren't indexable. When indexing was added, the size of 62 became an unfavorable choice because it made the division and modulo calculation in deque_index() slower than for a power of two.
For this specific case, I "hope" that nobody relies on the exact BLOCK structure (since it is a private structure).
I don't know what you're talking about. This structure isn't externally visible. Raymond

On Mon, Jun 24, 2013 at 7:47 PM, Raymond Hettinger < raymond.hettinger@gmail.com> wrote:
On Jun 24, 2013, at 4:07 AM, Victor Stinner <victor.stinner@gmail.com> wrote:
Out of curiosity, do you know (remember) how was the number 62 chosen? Is it a compromise between memory usage and performances? 62 is surprising because it is not a power of two :-)
Is it to just have 64 (2+62) pointers in the structure?
Yes, the goal was to have the struct size be an exact multiple of the cache line length (always a power-of-two, typically 64 bytes). What was different then is that deques weren't indexable. When indexing was added, the size of 62 became an unfavorable choice because it made the division and modulo calculation in deque_index() slower than for a power of two.
A-ha! Finally an explanation of the change. It makes intuitive sense now. I think the general feeling is that folks overreacted (perhaps confused by your silence) and that the reversal will be rolled back. Benjamin? -- --Guido van Rossum (python.org/~guido)

On Jun 24, 2013 9:11 PM, "Guido van Rossum" <guido@python.org> wrote:
On Mon, Jun 24, 2013 at 7:47 PM, Raymond Hettinger <
raymond.hettinger@gmail.com> wrote:
On Jun 24, 2013, at 4:07 AM, Victor Stinner <victor.stinner@gmail.com>
wrote:
Out of curiosity, do you know (remember) how was the number 62 chosen? Is it a compromise between memory usage and performances? 62 is surprising because it is not a power of two :-)
Is it to just have 64 (2+62) pointers in the structure?
Yes, the goal was to have the struct size be an exact multiple of the cache line length (always a power-of-two, typically 64 bytes). What was different then is that deques weren't indexable. When indexing was added, the size of 62 became an unfavorable choice because it made the division and modulo calculation in deque_index() slower than for a power of two.
A-ha! Finally an explanation of the change. It makes intuitive sense now. I think the general feeling is that folks overreacted (perhaps confused by your silence) and that the reversal will be rolled back. Benjamin?
Seems likely to me.
-- --Guido van Rossum (python.org/~guido)
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe:
http://mail.python.org/mailman/options/python-dev/greg%40krypto.org

2013/6/24 Guido van Rossum <guido@python.org>:
On Mon, Jun 24, 2013 at 7:47 PM, Raymond Hettinger <raymond.hettinger@gmail.com> wrote:
On Jun 24, 2013, at 4:07 AM, Victor Stinner <victor.stinner@gmail.com> wrote:
Out of curiosity, do you know (remember) how was the number 62 chosen? Is it a compromise between memory usage and performances? 62 is surprising because it is not a power of two :-)
Is it to just have 64 (2+62) pointers in the structure?
Yes, the goal was to have the struct size be an exact multiple of the cache line length (always a power-of-two, typically 64 bytes). What was different then is that deques weren't indexable. When indexing was added, the size of 62 became an unfavorable choice because it made the division and modulo calculation in deque_index() slower than for a power of two.
A-ha! Finally an explanation of the change. It makes intuitive sense now. I think the general feeling is that folks overreacted (perhaps confused by your silence) and that the reversal will be rolled back. Benjamin?
Raymond, go ahead and reapply your change. -- Regards, Benjamin

On Jun 24, 2013, at 10:12 PM, Benjamin Peterson <benjamin@python.org> wrote:
Raymond, go ahead and reapply your change.
If you don't mind, I think you should be the one to undo your own reversions. Thank you, Raymond

2013/6/25 Raymond Hettinger <raymond.hettinger@gmail.com>:
On Jun 24, 2013, at 10:12 PM, Benjamin Peterson <benjamin@python.org> wrote:
Raymond, go ahead and reapply your change.
If you don't mind, I think you should be the one to undo your own reversions.
Done. -- Regards, Benjamin
participants (12)
-
Alexander Belopolsky
-
Antoine Pitrou
-
Benjamin Peterson
-
Eli Bendersky
-
Ethan Furman
-
Gregory P. Smith
-
Guido van Rossum
-
Nick Coghlan
-
Raymond Hettinger
-
Scott Dial
-
Serhiy Storchaka
-
Victor Stinner