[Python-ideas] string codes & substring equality

Fri Nov 29 03:08:20 CET 2013

On 11/28/2013 7:45 PM, Steven D'Aprano wrote:
> On Thu, Nov 28, 2013 at 12:43:05PM +0100, spir wrote:
>> All right, thank you all for the exchange, the issue of substring
>> comparison for equality is solved, with either .startswith(substr, i) or
>> .find(substr, i,j). But there remain the problem of getting codes (unicodes
>> code point) at arbitrary indexes in a string?
>
> ord(s[i]) is the accepted solution to that.
>
>> Is it weird to consider a .code(i) string method?
>
> Such a method should not be called "code", since "ordinal" or "ord" is
> the accepted term for it.
>
> Should strings have an ord() method?
>
> Disadvantages:
>
> - another piece of code to be written, debugged, maintained, documented;
>
> - another thing for users to learn;
>
> - cognitive load of having to decide whether to use the ord() method or
>    the ord() function.
>
>
> Advantage:
>
> - you save the cost of extracting a one-character string before passing
>    it to the ord() function.
>
>
> In this case, both the disadvantages and advantages are tiny. That being
> the case, I would expect that unless somebody else goes "Yes! That's
> exactly what I need too!" and is motiviated to write the patch for you,
> the only way this has *any* chance of happening is for you to write the
> patch yourself. That means:
>
> - write the code;
> - test that it doesn't break anything;
> - write tests for it;
> - write documentation for it;
>
> and most importantly:
>
> - write benchmarks that demonstrate that calling your str.ord(i) method
> really is faster than calling ord(s[i]).
>
> When you have to do all that work yourself, you will soon see that it's
> perhaps not as "tiny & simple" as when somebody else does the work.
>
> On balance, is the benefit greater than the cost? I think it is a close
> call, balanced on a knife-edge, but having benchmarked it in Python 3.3
> I think that perhaps there could be some on balance a tiny nett benefit.
>
> Here is my benchmark:
>
> py> from timeit import Timer
> py> setup = "s = 'abcdef'"
> py> t1 = Timer("ord('c')")  # establish a base-mark of calling ord
> py> t2 = Timer("ord(s[2])", setup)
> py> min(t1.repeat(repeat=5))
> 0.13925810158252716
> py> min(t2.repeat(repeat=5))
> 0.2207092922180891

Thanks for real data.

> The difference is the cost of creating a single character string before
> taking the ordinal value of it.
>
> Still, that cost is tiny: less than 0.1 microseconds on my machine. On
> my PC, I could extract ten million such ordinals before the total cost
> exceeded one second. I find it difficult to see that this cost could be
> a bottleneck in any real-world application, but still, in Python 3.3 it
> seems to be a reasonable micro-optimization to have an ord method.

 From my reading of developer discussions on the tracker (and pydev), I 
believe most would consider .1 microsecond too little gain for adding a 
new (duplicate) string method.

> But even if there is such a benefit, the benefit is so small that I have
> no interest in pushing for it. I have more important things to work on.
>
> +0 on a str.ord method.

-- 
Terry Jan Reedy