[Python-ideas] string codes & substring equality

Wed Nov 27 15:32:48 CET 2013

Hello,

Coming back to python after a long time.
My present project is about (yet another) top-down matching / parsing lib. There 
are 2 issues that, I guess, may be rather easily solved by simple string 
methods. The core point is that any scanning / parsing process ends up, a the 
lowest level, constantly comparing either single-char (rather single-code) 
substrings or constant (literal) substrings of the source string. This is the 
only operation which, when successful, actually advances in the source. Thus, it 
is certainly worth having it efficient, or at the minimum not having it 
needlessly inefficient. I suppose the same functionalities can be highly useful 
in various other use cases of text processing.
Note again that I'm rediscovering Python (with some pleasure :-), thus may miss 
known solutions -- but I asked on the tutor mailing list.

In both cases, I guess ordinary idiomatic Python code actually _creates_ a new 
string object, as a substring of length 1 or more, which is otherwise useless; 
for instance:

     if s[i] == char:
         # match ok -- object s[i] unneeded

     if s[i:j] == substr:
         # match ok -- object s[i:j] unneeded

What is actually needed is just to check for equality (or another check about a 
code, see below).

The case of single-code checking appears when (1) a substring happens to hold a 
single code (meaning it represents a simple or precomposed unicode char) (2) 
when matching a char from a given set, range, or more complex class (eg in regex 
[a-zA-Z0-9_-']). In all cases, what we want is tocheck the code: compare it to a 
constant value, check whether it belongs to a set of value, or lies inside a 
given range. We need the code --not a single-code string. Ideally, I'd like 
expressions like:

     c = s.code(i)	# or s.ord(i) or s.ucode(i) [3]

     # and then one of:
     if c = code:
         # match ok

     if c in codes:
         # match ok

     if c >= code1 and c <= code2:
         # match ok

The builtin function ord(char) does not do the job, since it only works for a 
single-char string. We would again need to create a new string, with ord(s[i]). 
The right solution apparently is a string method like code(self, i) giving the 
code at an arbitrary index. I guess this is trivial.
I'm surprised it does not exist; maybe some may think this is a symptom there is 
no strong need for it; instead, I guess people routinely use a typical Python 
idiom without even noticing it creates a unneeded string object. [2] [3]

What do you think?

A second need is checking substring equality against constant substrings of 
arbitrary sizes. This is similar to startswith & endswith, except at any code 
index in the source string; a generalisation. In C implementation, it would 
probably delegate to memcomp, with a start pointer set to p_source+i. On the 
Python side, it may be a string method like sub_equals(self, substr, i). Choose 
you preferred name ;-). [1] [4]

     if s.sub_equals(substr, i):
         # match ok

What do you think? (bis)

Thank you,
Denis

[1] I am unsure whether an end index is useful, actually I don't really 
understand its usage for startswith & endswith neither.

[2] Actually, the compiler, if smart enough, may eliminate this object 
construction and just check the code; does it? Anyway, I think it is not that 
easy in the cases of ranges & sets.

[3] As a side-note, 'ord' is in my view a misnomer, since character codes are 
not ordinals, with significant order, but nominals, plain numerical codes which 
only need to be all distinct; they are kinds of id's. For unicode, I call them 
'ucodes', an idea I stole somewhere. But I would be happy is the method is 
called 'ord' anyway, since the term is established in the Python community.

[4] Would such a new method make startswith & endswith unneeded?