[Python-ideas] string codes & substring equality

Fri Nov 29 01:45:27 CET 2013

On Thu, Nov 28, 2013 at 12:43:05PM +0100, spir wrote:
> All right, thank you all for the exchange, the issue of substring 
> comparison for equality is solved, with either .startswith(substr, i) or 
> .find(substr, i,j). But there remain the problem of getting codes (unicodes 
> code point) at arbitrary indexes in a string?

ord(s[i]) is the accepted solution to that.

> Is it weird to consider a .code(i) string method? 

Such a method should not be called "code", since "ordinal" or "ord" is 
the accepted term for it.

Should strings have an ord() method?

Disadvantages:

- another piece of code to be written, debugged, maintained, documented;

- another thing for users to learn;

- cognitive load of having to decide whether to use the ord() method or 
  the ord() function.

Advantage:

- you save the cost of extracting a one-character string before passing 
  it to the ord() function.

In this case, both the disadvantages and advantages are tiny. That being 
the case, I would expect that unless somebody else goes "Yes! That's 
exactly what I need too!" and is motiviated to write the patch for you, 
the only way this has *any* chance of happening is for you to write the 
patch yourself. That means:

- write the code;
- test that it doesn't break anything;
- write tests for it;
- write documentation for it;

and most importantly:

- write benchmarks that demonstrate that calling your str.ord(i) method 
really is faster than calling ord(s[i]).

When you have to do all that work yourself, you will soon see that it's 
perhaps not as "tiny & simple" as when somebody else does the work.

On balance, is the benefit greater than the cost? I think it is a close 
call, balanced on a knife-edge, but having benchmarked it in Python 3.3 
I think that perhaps there could be some on balance a tiny nett benefit.

Here is my benchmark:

py> from timeit import Timer
py> setup = "s = 'abcdef'"
py> t1 = Timer("ord('c')")  # establish a base-mark of calling ord
py> t2 = Timer("ord(s[2])", setup)
py> min(t1.repeat(repeat=5))
0.13925810158252716
py> min(t2.repeat(repeat=5))
0.2207092922180891

The difference is the cost of creating a single character string before 
taking the ordinal value of it.

Still, that cost is tiny: less than 0.1 microseconds on my machine. On 
my PC, I could extract ten million such ordinals before the total cost 
exceeded one second. I find it difficult to see that this cost could be 
a bottleneck in any real-world application, but still, in Python 3.3 it 
seems to be a reasonable micro-optimization to have an ord method.

But even if there is such a benefit, the benefit is so small that I have 
no interest in pushing for it. I have more important things to work on.

+0 on a str.ord method.

-- 
Steven