[Python-ideas] string codes & substring equality
Steven D'Aprano
steve at pearwood.info
Fri Nov 29 01:45:27 CET 2013
On Thu, Nov 28, 2013 at 12:43:05PM +0100, spir wrote:
> All right, thank you all for the exchange, the issue of substring
> comparison for equality is solved, with either .startswith(substr, i) or
> .find(substr, i,j). But there remain the problem of getting codes (unicodes
> code point) at arbitrary indexes in a string?
ord(s[i]) is the accepted solution to that.
> Is it weird to consider a .code(i) string method?
Such a method should not be called "code", since "ordinal" or "ord" is
the accepted term for it.
Should strings have an ord() method?
Disadvantages:
- another piece of code to be written, debugged, maintained, documented;
- another thing for users to learn;
- cognitive load of having to decide whether to use the ord() method or
the ord() function.
Advantage:
- you save the cost of extracting a one-character string before passing
it to the ord() function.
In this case, both the disadvantages and advantages are tiny. That being
the case, I would expect that unless somebody else goes "Yes! That's
exactly what I need too!" and is motiviated to write the patch for you,
the only way this has *any* chance of happening is for you to write the
patch yourself. That means:
- write the code;
- test that it doesn't break anything;
- write tests for it;
- write documentation for it;
and most importantly:
- write benchmarks that demonstrate that calling your str.ord(i) method
really is faster than calling ord(s[i]).
When you have to do all that work yourself, you will soon see that it's
perhaps not as "tiny & simple" as when somebody else does the work.
On balance, is the benefit greater than the cost? I think it is a close
call, balanced on a knife-edge, but having benchmarked it in Python 3.3
I think that perhaps there could be some on balance a tiny nett benefit.
Here is my benchmark:
py> from timeit import Timer
py> setup = "s = 'abcdef'"
py> t1 = Timer("ord('c')") # establish a base-mark of calling ord
py> t2 = Timer("ord(s[2])", setup)
py> min(t1.repeat(repeat=5))
0.13925810158252716
py> min(t2.repeat(repeat=5))
0.2207092922180891
The difference is the cost of creating a single character string before
taking the ordinal value of it.
Still, that cost is tiny: less than 0.1 microseconds on my machine. On
my PC, I could extract ten million such ordinals before the total cost
exceeded one second. I find it difficult to see that this cost could be
a bottleneck in any real-world application, but still, in Python 3.3 it
seems to be a reasonable micro-optimization to have an ord method.
But even if there is such a benefit, the benefit is so small that I have
no interest in pushing for it. I have more important things to work on.
+0 on a str.ord method.
--
Steven
More information about the Python-ideas
mailing list