[Python-ideas] string codes & substring equality
spir
denis.spir at gmail.com
Wed Nov 27 15:32:48 CET 2013
Hello,
Coming back to python after a long time.
My present project is about (yet another) top-down matching / parsing lib. There
are 2 issues that, I guess, may be rather easily solved by simple string
methods. The core point is that any scanning / parsing process ends up, a the
lowest level, constantly comparing either single-char (rather single-code)
substrings or constant (literal) substrings of the source string. This is the
only operation which, when successful, actually advances in the source. Thus, it
is certainly worth having it efficient, or at the minimum not having it
needlessly inefficient. I suppose the same functionalities can be highly useful
in various other use cases of text processing.
Note again that I'm rediscovering Python (with some pleasure :-), thus may miss
known solutions -- but I asked on the tutor mailing list.
In both cases, I guess ordinary idiomatic Python code actually _creates_ a new
string object, as a substring of length 1 or more, which is otherwise useless;
for instance:
if s[i] == char:
# match ok -- object s[i] unneeded
if s[i:j] == substr:
# match ok -- object s[i:j] unneeded
What is actually needed is just to check for equality (or another check about a
code, see below).
The case of single-code checking appears when (1) a substring happens to hold a
single code (meaning it represents a simple or precomposed unicode char) (2)
when matching a char from a given set, range, or more complex class (eg in regex
[a-zA-Z0-9_-']). In all cases, what we want is tocheck the code: compare it to a
constant value, check whether it belongs to a set of value, or lies inside a
given range. We need the code --not a single-code string. Ideally, I'd like
expressions like:
c = s.code(i) # or s.ord(i) or s.ucode(i) [3]
# and then one of:
if c = code:
# match ok
if c in codes:
# match ok
if c >= code1 and c <= code2:
# match ok
The builtin function ord(char) does not do the job, since it only works for a
single-char string. We would again need to create a new string, with ord(s[i]).
The right solution apparently is a string method like code(self, i) giving the
code at an arbitrary index. I guess this is trivial.
I'm surprised it does not exist; maybe some may think this is a symptom there is
no strong need for it; instead, I guess people routinely use a typical Python
idiom without even noticing it creates a unneeded string object. [2] [3]
What do you think?
A second need is checking substring equality against constant substrings of
arbitrary sizes. This is similar to startswith & endswith, except at any code
index in the source string; a generalisation. In C implementation, it would
probably delegate to memcomp, with a start pointer set to p_source+i. On the
Python side, it may be a string method like sub_equals(self, substr, i). Choose
you preferred name ;-). [1] [4]
if s.sub_equals(substr, i):
# match ok
What do you think? (bis)
Thank you,
Denis
[1] I am unsure whether an end index is useful, actually I don't really
understand its usage for startswith & endswith neither.
[2] Actually, the compiler, if smart enough, may eliminate this object
construction and just check the code; does it? Anyway, I think it is not that
easy in the cases of ranges & sets.
[3] As a side-note, 'ord' is in my view a misnomer, since character codes are
not ordinals, with significant order, but nominals, plain numerical codes which
only need to be all distinct; they are kinds of id's. For unicode, I call them
'ucodes', an idea I stole somewhere. But I would be happy is the method is
called 'ord' anyway, since the term is established in the Python community.
[4] Would such a new method make startswith & endswith unneeded?
More information about the Python-ideas
mailing list