Equivalent of Perl chomp?

Mon Feb 4 03:15:11 EST 2002

On Sat, 2 Feb 2002 00:09:15 -0500, "Tim Peters" <tim.one at home.com> wrote:

>[Steve Holden]
>> 1,000,000 iterations.
>>
>> Testing for length 1:
>> Endswith 0.00000505
>> sliceIt  0.00000209
>>
>> Testing for length 2:
>> Endswith 0.00001137
>> sliceIt  0.00001148
>>
>> Length 3:
>> Endswith 0.00001055     [!]
>> sliceIt  0.00001264
>>
>> Of course, these tests are with the tests succeeding. I'm
>> presuming failure
>> would shorten the timings, but I'm not going to bother checking that
>> assumption.
>>
>> semi-believab-ly y'rs  - steve
>
>This is the kind of fun I was hoping the OP would have.  Good job!  A lesson
>to take is that timing even seemingly trivial things can turn up major
>surprises with seemingly trivial changes in exactly what it is you're
>timing.  The results are almost never accidental, but determining the true
>causes can leave you crawling on your belly for weeks begging for relief.
>
>A few things going on in this test under the covers:
>
>+ endswith does indeed pay a large price for building a bound
>  method object every time.  It's not only the dict lookup to
>  resolve "endswith", but also malloc and free to build the
>  bound method object and discard it afterwards.
>
>+ The huge increase in sliceit's time going from 1 to 2 is due to
>  that s[:-1] creates a one-character string, and Python interns
ITYM s[-1:] ;-)
>  those by magic:  every instance of, e.g., 'a', no matter how
>  computed, is exactly the same object (that's not actually true,
>  but it's close <wink>).  But 2-character computed strings are
>  not interned, so s[:-2] has to malloc a fresh string object each
...similarly
>  time to hold the two characters, and free it later.  endswith
>  doesn't have to build new string objects, so doesn't care as much.
>
>Now for some real fun:  I have no idea why endswith took a huge jump going
I don't think _it_ really did (my results don't show it ;-).

>from 1 to 2.  The saving grace for me is that it's so easy to find timing
>mysteries that I'd go mad if I felt compelled to chase them all down, so I
>can walk away without even looking back.  I hope not everyone is so
>well-adjusted <wink>, though, as I'd like to know why.
>
I tried to eliminate spurious big times to get a better measure. My goal
was to get a minimum time, hoping that spurious time-consuming effects
in the scaffolding code (loops, hidden memory allocations/deallocations,
interrupt processing, etc.) would be near equivalent in the best case.
Then I tried to make the best case more likely for the timed segment.

It is interesting how hard it is to time things on the order of a few
microseconds properly in this kind of context.

Code to try to counter some of these effects is at the end of this post.
Suggestions for improvements welcome.

Whereas I like to count cycles for an algorithm, it _is_ easy to have the
interpreter fool you. Scaffolding code has cache effects, and it's easy
to generate a lot of cache-smashing activity with interpreted code.

Putting just the thing of interest inside a loop of a million causes
other effects, and you hope they get averaged out, but there can be some
large outliers in the average.

In C/C++ timing 'nothing' takes 32 cycles on my box. Using an extension to do the
same pattern in python takes about 450 cycles. Which of those extra cycles are
invalidating cache, and which aren't? It's pretty hard to say. OTOH, if there
were a VM bytecode which logged a timestamp in an internal pre-configured
circular buffer, it could be done in very few instructions, minimizing disturbance.

Anyway, there's more to getting net time than subtracting out the _apparent_ cost
of "timing nothing."

Say the minimum cycles for executing x is X, and that sampling time takes T
cycles, with Tb cycles before the instantaneous sample point (say when
rdtsc instruction completes), and Ta cycles after. Let's call a non-optimum
time T' to show some unknown cache miss effect etc. Then simple timing of x
gives us
   dt = Ta' + X' + Tb'
I'm thinking that we can insert dummy T and X operations, on the assumption
that doing X' will assure as good a cache situation as possible for a
following X (therefore written without the [']) -- so we get
   dt  = T'*0 + Ta + X' + Tb'
(the '*0' meaning we ignore the value, just initializing cache so
we get predictable Ta at the beginning of the interval), and
   dt2 = T'*0 + Ta + X' + X + Tb'
and then subtracting,
   X = dt2-dt
We shouldn't have to know what the ['] effects are except to hope
that their _minimums_ would be closely repeatable and comparable in
the two sequences.

On this theory, I timed s.endswith and slice expression for
1- to 4-character tails. Also a "nothing" routine to subtract
the call/return time. I added gc just before each of the two
timing loops, and that seemed to narrow the variability a bit.
I also added an outer loop to average minimums found in the
inner loops and take advantage to get global minimums.

BTW, these are ticks at 300mhz on a P2 with 320MB RAM, with NT4.

[22:59] C:\pywk\crdtsc>python endsvslice.py 0 10   0 1000
10 loops, 1000 avgloops, 0 tadj
   nothing:   850, avg   856

(I.e., ten loops in inner loop finding a minimum estimated X, and 1000
of those averaging the minimums, as well as finding the global minimums
whose difference might be a better estimate of X, where X is theoretical
best time for the function call.)

(The values below are adjusted by 850 for 'nothing', which is
a function returning a constant 1, so were estimating how much
longer it takes to prepare the 1 for return using endswith or
slice ==, compared to the constant 1. Non-result lines snipped.)

[23:00] C:\pywk\crdtsc>python endsvslice.py 1 10 850 1000
10 loops, 1000 avgloops, 850 tadj
 endswith1:  1312, avg  1327
 endswith2:  1335, avg  1366
 endswith3:  1368, avg  1407
 endswith4:  1432, avg  1388
    slice1:   669, avg   740
    slice2:  1898, avg  1942
    slice3:  1937, avg  1934
    slice4:  1915, avg  1907

IOW, endswith did not take a huge jump measuring my way, but
the slice was certainly affected from 1 to 2.

The little dip for slice4 is curious. Here are some more trials of that:
    slice4:  1898, avg  1916
    slice4:  1909, avg  1913
    slice4:  1909, avg  1913
Hm, maybe slice3 above is off?
    slice3:  1945, avg  1934
    slice3:  1926, avg  1933
    slice3:  1928, avg  1942
    slice3:  1927, avg  1941
Curiouser and curiouser... anyway, no huge anomalies ;-)

Regards,
Bengt Richter

=======================================================================

Usage: python endsvslice.py id nloops adjust avgloops
  where id is:
  0: nothing
  1: endswith1
  2: endswith2
  3: endswith3
  4: endswith4
  5: slice1
  6: slice2
  7: slice3
  8: slice4

================ endsvslice.py ========================================
# endsvslice.py
import readtsc   #experimental, provides access to pentium rdtsc
import gc

def nothing(s):
    return 1

# brute force cut/paste to minimize scaffolding code
def endswith1(s):
    return s.endswith('\n')
def endswith2(s):
    return s.endswith(' \n')
def endswith3(s):
    return s.endswith('  \n')
def endswith4(s):
    return s.endswith('   \n')

def slice1(s):
    return s[-1:] == '\n'
def slice2(s):
    return s[-2:] == ' \n'
def slice3(s):
    return s[-3:] == '  \n'
def slice4(s):
    return s[-4:] == '   \n'

# Notation:
#  X is minimum time, X' may be disturbed by cache miss etc.
#  Tb is time before time sampling instant[1], Ta is time after
#  T is total time in time sample code, hence  T=Tb+Ta
#  [1] e.g., instant might be defined as when rdtsc instruction copies counter
def t(loops,avgloops,func,arg):
#    print loops,avgloops,func,len(arg)
    r = readtsc.readtsc_dt # returns delta since last lower 32 bits of 64-bit time stamp
    sum = sum2 = 0; gmin = gmin2 = 0x7fffffff
    for aloop in xrange(avgloops):
        dt = 0x7fffffff         # not likely min
        gc.collect()            # attempt to normalize state somewhat
        for i in xrange(loops):
            r()    # T'         # Possibly non-optimal
            r()    # Ta         # Tb is before, and Ta is hopefully run from cache
            func(arg) # X'      # x code not fully cached
            e=r()  # Tb'        # Tb' time before sample may have cache miss etc.
            dt = min(dt,e)      # min (Ta + X' + Tb')
        dt2 = 0x7fffffff        # not likely min
        gc.collect()
        for i in xrange(loops): # same as above except x is run cached one time
            r()    # T'
            r()    # Ta
            func(arg) # X'
            func(arg) # X       # x code hopefully cached after preceding X'
            e=r()   # Tb'
            dt2 = min(dt2,e)    # min (Ta + X' + X + Tb')
        ddt = dt2-dt            # X, more or less
        sum += ddt
        gmin = min(gmin,dt)
        gmin2 = min(gmin2,dt2)    # finding min X
    return gmin2-gmin,(sum)/avgloops # should be X time for one clean func(arg) call

if __name__=='__main__':
    import sys
    selections = [nothing,endswith1,endswith2,endswith3,endswith4,
        slice1,slice2,slice3,slice4]
    try:
        s='s'*1000000+'\n'; tadj = 0; loops = 10; avgloops = 1
        id = int(sys.argv[1])
        if len(sys.argv)>2: loops = int(sys.argv[2])
        if len(sys.argv)>3: tadj= int(sys.argv[3])
        if len(sys.argv)>4: avgloops= int(sys.argv[4])
        tm,tav =t(loops,avgloops,selections[id],s)
        print loops,'loops,',avgloops,'avgloops,',tadj,'tadj'
        print '%10s: %5d, avg %5d' % (selections[id].__name__, tm-tadj, tav-tadj)

    except:
        print 'Usage: python endsvslice.py id nloops adjust avgloops'
        print '  where id is:'
        for i in xrange(len(selections)):
            print '  %d: %s' %(i,selections[i].__name__)
=======================================================================