optimization required: .format() is much slower than %
Dear fellow Python developers! Ten minutes ago I raised a concern about speed differences between the old style % formatting and the new .format() code. Some quick benchmarking from Benjamin and me showed, that it's even worse than I expected. $ ./python -m timeit "'%s'.format('nothing')" 100000 loops, best of 3: 2.63 usec per loop $ ./python -m timeit "'%s' % 'nothing'" 10000000 loops, best of 3: 0.188 usec per loop $ ./python -m timeit "'some text with {0}'.format('nothing')" 100000 loops, best of 3: 4.34 usec per loop $ ./python -m timeit "'some text with %s' % 'nothing'" 100000 loops, best of 3: 2.04 usec per loop $ ./python -m timeit "'some text with {0} {1}'.format('nothing', 'more')" 100000 loops, best of 3: 6.77 usec per loop $ ./python -m timeit "'some text with %s %s' % ('nothing', 'more')" 100000 loops, best of 3: 2.22 usec per loop As you can clearly see the new .format() code is *much* slower than the old style % code. I recommend we spend some time on optimizing common code paths of the new .format() code. As first step I propose the move the __format__ method to a new type slot. __format__() is called for every object. My gut feeling says that a slot method is going to speed up the most common usage "{0}".format(some_string). Christian
On Tue, May 27, 2008 at 12:43 AM, Christian Heimes <lists@cheimes.de> wrote:
Ten minutes ago I raised a concern about speed differences between the old style % formatting and the new .format() code. Some quick benchmarking from Benjamin and me showed, that it's even worse than I expected.
My own tests show a less dire difference: $ ./python -m timeit "'some text with %s'.format('nothing')" 1000000 loops, best of 3: 0.578 usec per loop $ ./python -m timeit "'some text with %s' % 'nothing'" 1000000 loops, best of 3: 0.472 usec per loop
$ ./python -m timeit "'%s'.format('nothing')" 100000 loops, best of 3: 2.63 usec per loop $ ./python -m timeit "'%s' % 'nothing'" 10000000 loops, best of 3: 0.188 usec per loop
It struct me as odd that this one case shows such a big difference while the others show less of one. My tests show that the old-style % formatting is much faster when the final string is 20 characters or less: $ ./python -m timeit "'....|....|....|...%s' % '12'" 10000000 loops, best of 3: 0.0764 usec per loop $ ./python -m timeit "'....|....|....|...%s' % '123'" 1000000 loops, best of 3: 0.481 usec per loop A read through stringobject.c didn't give me any clue as to why 20 is the magic number. %d shows an even bigger jump: $ ./python -m timeit "'....|....|....|...%d' % 12" 10000000 loops, best of 3: 0.0764 usec per loop $ ./python -m timeit "'....|....|....|...%d' % 123" 1000000 loops, best of 3: 1.28 usec per loop Schiavo Simon
Simon Cross wrote:
It struct me as odd that this one case shows such a big difference while the others show less of one.
I believe the %-formatting code has some optimisations in place where it realises it can just increment the reference count of the passed in string and return that, rather than having to build a new string object. As for why you didn't see any differences in a couple of your tests: the .format() call wasn't actually substituting anything. Unfortunately, the reasons why .format() doesn't complain about extra arguments never made it into the PEP. They can be found here: http://mail.python.org/pipermail/python-dev/2006-May/065062.html Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org
Simon Cross <hodgestar <at> gmail.com> writes:
My tests show that the old-style % formatting is much faster when the final string is 20 characters or less:
$ ./python -m timeit "'....|....|....|...%s' % '12'" 10000000 loops, best of 3: 0.0764 usec per loop
You are the victim of a constant-folding optimization: $ ./python -m timeit "'....|....|....|...%s' % '12'" 10000000 loops, best of 3: 0.0926 usec per loop $ ./python -m timeit -s "s='12'" "'....|....|....|...%s' % s" 1000000 loops, best of 3: 0.525 usec per loop
def f(): return '....|....|....|...%s' % '12' ... dis.dis(f) 1 0 LOAD_CONST 3 ('....|....|....|...12') 3 RETURN_VALUE
cheers Antoine.
On Mon, May 26, 2008 at 3:43 PM, Christian Heimes <lists@cheimes.de> wrote:
Dear fellow Python developers!
Ten minutes ago I raised a concern about speed differences between the old style % formatting and the new .format() code. Some quick benchmarking from Benjamin and me showed, that it's even worse than I expected.
$ ./python -m timeit "'%s'.format('nothing')" 100000 loops, best of 3: 2.63 usec per loop $ ./python -m timeit "'%s' % 'nothing'" 10000000 loops, best of 3: 0.188 usec per loop
$ ./python -m timeit "'some text with {0}'.format('nothing')" 100000 loops, best of 3: 4.34 usec per loop $ ./python -m timeit "'some text with %s' % 'nothing'" 100000 loops, best of 3: 2.04 usec per loop
$ ./python -m timeit "'some text with {0} {1}'.format('nothing', 'more')" 100000 loops, best of 3: 6.77 usec per loop $ ./python -m timeit "'some text with %s %s' % ('nothing', 'more')" 100000 loops, best of 3: 2.22 usec per loop
As you can clearly see the new .format() code is *much* slower than the old style % code. I recommend we spend some time on optimizing common code paths of the new .format() code.
As first step I propose the move the __format__ method to a new type slot. __format__() is called for every object. My gut feeling says that a slot method is going to speed up the most common usage "{0}".format(some_string).
My gut feels the same way... How about seeing the profile data of the new vs old string formatting functions first as a comparison. If those disagree vs the above then the ".format()" name lookup and function call overhead is likely consuming the time missing from a profile. -gps
Gregory P. Smith wrote:
My gut feels the same way... How about seeing the profile data of the new vs old string formatting functions first as a comparison. If those disagree vs the above then the ".format()" name lookup and function call overhead is likely consuming the time missing from a profile.
Because it's a more general and extensible mechanism, .format() will never be as fast as %-formatting. But I agree it could no doubt be improved. When implementing this, my gut feelings were that the performance difference is likely: - "__format__" lookup. - Each __format__() call returns an object. The overhead is non-trivial, compared to the hard coded types in %-formatting. Not that this means anything without actual profiling. Here's about the best direct comparison I could think of: $ ./python -m timeit "'nothing'.__format__('')" 1000000 loops, best of 3: 1.19 usec per loop $ ./python -m timeit "'{0}'.format('nothing')" 100000 loops, best of 3: 3.74 usec per loop Shows there is some overhead to the "format" lookup and result computation, but it's still much slower than: $ ./python -m timeit "'%s' % 'nothing'" 1000000 loops, best of 3: 0.223 usec per loop
"Gregory P. Smith" <greg@krypto.org> wrote in message news:52dc1c820805261717t3842df91ubc44111872d19b98@mail.gmail.com... | On Mon, May 26, 2008 at 3:43 PM, Christian Heimes <lists@cheimes.de> wrote: | > Dear fellow Python developers! | > | > Ten minutes ago I raised a concern about speed differences between the | > old style % formatting and the new .format() code. Some quick | > benchmarking from Benjamin and me showed, that it's even worse than I | > expected. | > | > $ ./python -m timeit "'%s'.format('nothing')" | > 100000 loops, best of 3: 2.63 usec per loop | > $ ./python -m timeit "'%s' % 'nothing'" | > 10000000 loops, best of 3: 0.188 usec per loop | > | > $ ./python -m timeit "'some text with {0}'.format('nothing')" | > 100000 loops, best of 3: 4.34 usec per loop | > $ ./python -m timeit "'some text with %s' % 'nothing'" | > 100000 loops, best of 3: 2.04 usec per loop | > | > $ ./python -m timeit "'some text with {0} {1}'.format('nothing', 'more')" | > 100000 loops, best of 3: 6.77 usec per loop | > $ ./python -m timeit "'some text with %s %s' % ('nothing', 'more')" | > 100000 loops, best of 3: 2.22 usec per loop | > | > As you can clearly see the new .format() code is *much* slower than the | > old style % code. I recommend we spend some time on optimizing common | > code paths of the new .format() code. | > | > As first step I propose the move the __format__ method to a new type | > slot. __format__() is called for every object. My gut feeling says that | > a slot method is going to speed up the most common usage | > "{0}".format(some_string). | | My gut feels the same way... How about seeing the profile data of the | new vs old string formatting functions first as a comparison. If | those disagree vs the above then the ".format()" name lookup In real usages, the format string is typically a constant. To mirror this, the .format name lookup and creation of the bound method object could be factored out of the testing loop and put in a setup string. IOW, I write code like emsg = 'Bad input {0}'.format for i in inputs: if squiggle(i): emsg(i) else: realwork(i) However, if I have done the following correctly (my first use of timeit), this seems not to be the slowdown.
t=timeit.Timer(stmt="'{0}'.format(None)") t.timeit() 3.769231005616632 t2=timeit.Timer(setup="m='{0}'.format", stmt='m(None)') t2.timeit() 3.7023311622007213
tjr
Following Terry's suggestion, I repeated Christian's tests with the lookup of the format method taken outside the loop (since one advantage of the new approach is to make that micro-optimisation easier), and also tried a few additional examples. First, there seems to be about a 0.25 microsecond hit (on my machine) which reflects the impact of the lookup of the "format" attribute itself. Secondly, the string % operator appears to have an explicit optimisation for the 'just return str(self)' case. This optimisation is missing from the new string format method. The only optimisation idea I came up with (other than the addition of a tp_format slot) is for string objects to be able to cache their (lookup:subformat) pairs rather than having to parse themselves every time. That has obvious memory consumption implications though. Cheers, Nick. Test results: # No formatting at all, just the method invocation: $ ./python -m timeit "''.format()" 1000000 loops, best of 3: 0.635 usec per loop $ ./python -m timeit -s "fmt = '%s'.format" "fmt()" 1000000 loops, best of 3: 0.48 usec per loop $ ./python -m timeit "'' % ()" 1000000 loops, best of 3: 0.389 usec per loop # Format one argument by position $ ./python -m timeit "'{0}'.format('nothing')" 1000000 loops, best of 3: 1.56 usec per loop $ ./python -m timeit -s "fmt = '{0}'.format" "fmt('nothing')" 1000000 loops, best of 3: 1.44 usec per loop $ ./python -m timeit "'%s' % 'nothing'" 10000000 loops, best of 3: 0.0736 usec per loop # Format one non-string argument by position $ ./python -m timeit "'{0}'.format(0)" 100000 loops, best of 3: 1.74 usec per loop $ ./python -m timeit -s "fmt = '{0}'.format" "fmt(0)" 1000000 loops, best of 3: 1.51 usec per loop $ ./python -m timeit "'%s' % 0" 10000000 loops, best of 3: 0.0987 usec per loop # Format one argument by position with additional text $ ./python -m timeit "'some text with {0}'.format('nothing')" 1000000 loops, best of 3: 1.7 usec per loop $ ./python -m timeit -s "fmt = 'some text with {0}'.format" "fmt('nothing')" 1000000 loops, best of 3: 1.45 usec per loop $ ./python -m timeit "'some text with %s' % 'nothing'" 1000000 loops, best of 3: 0.612 usec per loop # Format two arguments by position with additional text $ ./python -m timeit "'some text with {0} {1}'.format('nothing', 'more')" 100000 loops, best of 3: 2.63 usec per loop $ ./python -m timeit -s "fmt = 'some text with {0} {1}'.format" "fmt('nothing', 'more')" 100000 loops, best of 3: 2.31 usec per loop $ ./python -m timeit "'some text with %s %s' % ('nothing', 'more')" 1000000 loops, best of 3: 0.804 usec per loop # Format two non-string arguments by position with additional text $ ./python -m timeit "'some text with {0} {1}'.format(0, 1)" 100000 loops, best of 3: 2.81 usec per loop $ ./python -m timeit -s "fmt = 'some text with {0} {1}'.format" "fmt(0, 1)" 100000 loops, best of 3: 2.46 usec per loop $ ./python -m timeit "'some text with %s %s' % (0, 1)" 1000000 loops, best of 3: 0.851 usec per loop # Format two non-string arguments by name with additional text $ ./python -m timeit -s "d = dict(a=0, b=1)" "'some text with {a} {b}'.format(**d)" 100000 loops, best of 3: 2.81 usec per loop $ ./python -m timeit -s "d = dict(a=0, b=1)" -s "fmt = 'some text with {a} {b}'.format" "fmt(**d)" 100000 loops, best of 3: 2.58 usec per loop $ ./python -m timeit -s "d = dict(a=0, b=1)" "'some text with %(a)s %(b)s' % (d)" 1000000 loops, best of 3: 1 usec per loop -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org
Nick Coghlan <ncoghlan <at> gmail.com> writes:
The only optimisation idea I came up with (other than the addition of a tp_format slot) is for string objects to be able to cache their (lookup:subformat) pairs rather than having to parse themselves every time. That has obvious memory consumption implications though.
If a string is used as a formatting template, it is bound to be used again as such. So IMHO it sounds quite reasonable. In order to avoid memory consumption issues there could be a centralized cache as for regular expressions. It makes it easier to handle eviction based on various parameters, and it saves a few bytes for string objects which are never used as a formatting template. Regards Antoine.
On Tue, May 27, 2008 at 3:12 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Nick Coghlan <ncoghlan <at> gmail.com> writes:
The only optimisation idea I came up with (other than the addition of a tp_format slot) is for string objects to be able to cache their (lookup:subformat) pairs rather than having to parse themselves every time. That has obvious memory consumption implications though.
If a string is used as a formatting template, it is bound to be used again as such. So IMHO it sounds quite reasonable.
In order to avoid memory consumption issues there could be a centralized cache as for regular expressions. It makes it easier to handle eviction based on various parameters, and it saves a few bytes for string objects which are never used as a formatting template.
Regards
Antoine.
+1 to that idea.
Antoine Pitrou schrieb:
In order to avoid memory consumption issues there could be a centralized cache as for regular expressions. It makes it easier to handle eviction based on various parameters, and it saves a few bytes for string objects which are never used as a formatting template.
Good idea! I suggest you hook into the string interning code and use a similar approach. Christian
Christian Heimes wrote:
Antoine Pitrou schrieb:
In order to avoid memory consumption issues there could be a centralized cache as for regular expressions. It makes it easier to handle eviction based on various parameters, and it saves a few bytes for string objects which are never used as a formatting template.
Good idea! I suggest you hook into the string interning code and use a similar approach.
I don't think parsing the strings is where it spends its time. I think the time is spent in object creation, __format__ lookup, and building the result. I'm looking into some of these other optimizations now.
Nick> $ ./python -m timeit "'' % ()" Nick> 1000000 loops, best of 3: 0.389 usec per loop vs. Nick> $ ./python -m timeit "'%s' % 'nothing'" Nick> 10000000 loops, best of 3: 0.0736 usec per loop I think you need to use a tuple for the second case to make it comparable to the first. Skip
skip@pobox.com wrote:
Nick> $ ./python -m timeit "'' % ()" Nick> 1000000 loops, best of 3: 0.389 usec per loop
vs.
Nick> $ ./python -m timeit "'%s' % 'nothing'" Nick> 10000000 loops, best of 3: 0.0736 usec per loop
I think you need to use a tuple for the second case to make it comparable to the first.
It doesn't actually make that big a difference - I'm guessing a Py_INCREF shortcut ends up getting used either way: $ ./python -m timeit "'%s' % 'nothing'" 10000000 loops, best of 3: 0.0848 usec per loop $ ./python -m timeit "'%s' % 'nothing'," 10000000 loops, best of 3: 0.133 usec per loop $ ./python -m timeit "'' % ()" 1000000 loops, best of 3: 0.513 usec per loop (times are a bit variable at this very moment since I have a few different apps open) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org
Nick Coghlan wrote:
skip@pobox.com wrote:
Nick> $ ./python -m timeit "'' % ()" Nick> 1000000 loops, best of 3: 0.389 usec per loop
vs.
Nick> $ ./python -m timeit "'%s' % 'nothing'" Nick> 10000000 loops, best of 3: 0.0736 usec per loop
I think you need to use a tuple for the second case to make it comparable to the first.
It doesn't actually make that big a difference - I'm guessing a Py_INCREF shortcut ends up getting used either way:
$ ./python -m timeit "'%s' % 'nothing'" 10000000 loops, best of 3: 0.0848 usec per loop $ ./python -m timeit "'%s' % 'nothing'," 10000000 loops, best of 3: 0.133 usec per loop
Technically there you aren't using a tuple to provide the substitution argument, you are doing the substitution from a single string then putting the result in a tuple:
'%s' % 'nothing', ('nothing',)
On the basis of evidence like this: sholden@lifeboy ~ $ python -m timeit "'%s' % 'nothing'" 10000000 loops, best of 3: 0.0705 usec per loop sholden@lifeboy ~ $ python -m timeit "'%s' % ('nothing',)" 1000000 loops, best of 3: 0.691 usec per loop I'd say not using a tuple was the way to go if ever you needed to optimize the code for speed.
$ ./python -m timeit "'' % ()" 1000000 loops, best of 3: 0.513 usec per loop
If you want an even stranger result, it appears to take the interpreter longer to substitute nothing from a tuple than it does to substitute the empty string from a string: sholden@lifeboy ~ $ python -m timeit "'' % ()" 1000000 loops, best of 3: 0.454 usec per loop sholden@lifeboy ~ $ python -m timeit "'%s' % ''" 10000000 loops, best of 3: 0.0715 usec per loop
(times are a bit variable at this very moment since I have a few different apps open)
Me too. Timings are not benchmarks, I am not a lawyer, etc. Our machines seem to be of comparable speed. regards Steve -- Steve Holden +1 571 484 6266 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/
Eric Smith wrote:
Nick Coghlan wrote:
Secondly, the string % operator appears to have an explicit optimisation for the 'just return str(self)' case. This optimisation is missing from the new string format method.
I'll see if I can optimize this case.
3.0, from svn: $ ./python.exe -m timeit "'some text with {0}'.format('nothing')" 100000 loops, best of 3: 3.14 usec per loop 3.0, with PyUnicode_ExactCheck, skipping __format__ lookup: $ ./python.exe -m timeit "'some text with {0}'.format('nothing')" 100000 loops, best of 3: 2.32 usec per loop I could probably improve this some more, by skipping the creation of the object used to describe the format_spec. Compare with: $ ./python.exe -m timeit "'some text with %s' % 'nothing'" 1000000 loops, best of 3: 1.37 usec per loop
Eric Smith wrote:
Eric Smith wrote:
Nick Coghlan wrote:
Secondly, the string % operator appears to have an explicit optimisation for the 'just return str(self)' case. This optimisation is missing from the new string format method.
I'll see if I can optimize this case.
3.0, from svn: $ ./python.exe -m timeit "'some text with {0}'.format('nothing')" 100000 loops, best of 3: 3.14 usec per loop
3.0, with PyUnicode_ExactCheck, skipping __format__ lookup: $ ./python.exe -m timeit "'some text with {0}'.format('nothing')" 100000 loops, best of 3: 2.32 usec per loop
I could probably improve this some more, by skipping the creation of the object used to describe the format_spec.
Compare with: $ ./python.exe -m timeit "'some text with %s' % 'nothing'" 1000000 loops, best of 3: 1.37 usec per loop
r63826 has an optimization when formatting types that are exactly unicode, long, or float (not subclasses). Unfortunately it's only for 3.0. For 2.6 it's messier. I'm still pondering what to do with that. I'll get to it sometime, but maybe after the beta. There's much more that can be done, but I'd like to profile it before I just start guessing.
Eric Smith wrote:
Eric Smith wrote:
Eric Smith wrote:
Nick Coghlan wrote:
Secondly, the string % operator appears to have an explicit optimisation for the 'just return str(self)' case. This optimisation is missing from the new string format method.
I'll see if I can optimize this case.
3.0, from svn: $ ./python.exe -m timeit "'some text with {0}'.format('nothing')" 100000 loops, best of 3: 3.14 usec per loop
3.0, with PyUnicode_ExactCheck, skipping __format__ lookup: $ ./python.exe -m timeit "'some text with {0}'.format('nothing')" 100000 loops, best of 3: 2.32 usec per loop
I could probably improve this some more, by skipping the creation of the object used to describe the format_spec.
Compare with: $ ./python.exe -m timeit "'some text with %s' % 'nothing'" 1000000 loops, best of 3: 1.37 usec per loop
r63826 has an optimization when formatting types that are exactly unicode, long, or float (not subclasses). Unfortunately it's only for 3.0. For 2.6 it's messier. I'm still pondering what to do with that. I'll get to it sometime, but maybe after the beta.
I finally backported this to 2.6 in r65814. There's a similar 30% speedup for the simplest cases. Unicode optimization is worse than string optimization, because of the way int, long, and float formatters work. This can be fixed, but I'm not sure the gain justifies the effort. I doubt I'll have time to look into this before beta3. Eric.
Eric Smith <eric <at> trueblade.com> writes:
I finally backported this to 2.6 in r65814. There's a similar 30% speedup for the simplest cases. Unicode optimization is worse than string optimization, because of the way int, long, and float formatters work. This can be fixed, but I'm not sure the gain justifies the effort. I doubt I'll have time to look into this before beta3.
Given the choice, I would think it's more important to further speedup the implementation in 3.0 rather than try to backport the existing speedups to 2.6. In 2.6 the standard formatting facility is still the % operator. (Just my two suboptimal cents)
účastníci (10)
-
Antoine Pitrou
-
Christian Heimes
-
Eric Smith
-
Eric Smith
-
Gregory P. Smith
-
Nick Coghlan
-
Simon Cross
-
skip@pobox.com
-
Steve Holden
-
Terry Reedy