Hello, Coming back to python after a long time. My present project is about (yet another) top-down matching / parsing lib. There are 2 issues that, I guess, may be rather easily solved by simple string methods. The core point is that any scanning / parsing process ends up, a the lowest level, constantly comparing either single-char (rather single-code) substrings or constant (literal) substrings of the source string. This is the only operation which, when successful, actually advances in the source. Thus, it is certainly worth having it efficient, or at the minimum not having it needlessly inefficient. I suppose the same functionalities can be highly useful in various other use cases of text processing. Note again that I'm rediscovering Python (with some pleasure :-), thus may miss known solutions -- but I asked on the tutor mailing list. In both cases, I guess ordinary idiomatic Python code actually _creates_ a new string object, as a substring of length 1 or more, which is otherwise useless; for instance: if s[i] == char: # match ok -- object s[i] unneeded if s[i:j] == substr: # match ok -- object s[i:j] unneeded What is actually needed is just to check for equality (or another check about a code, see below). The case of single-code checking appears when (1) a substring happens to hold a single code (meaning it represents a simple or precomposed unicode char) (2) when matching a char from a given set, range, or more complex class (eg in regex [a-zA-Z0-9_-']). In all cases, what we want is tocheck the code: compare it to a constant value, check whether it belongs to a set of value, or lies inside a given range. We need the code --not a single-code string. Ideally, I'd like expressions like: c = s.code(i) # or s.ord(i) or s.ucode(i) [3] # and then one of: if c = code: # match ok if c in codes: # match ok if c >= code1 and c <= code2: # match ok The builtin function ord(char) does not do the job, since it only works for a single-char string. We would again need to create a new string, with ord(s[i]). The right solution apparently is a string method like code(self, i) giving the code at an arbitrary index. I guess this is trivial. I'm surprised it does not exist; maybe some may think this is a symptom there is no strong need for it; instead, I guess people routinely use a typical Python idiom without even noticing it creates a unneeded string object. [2] [3] What do you think? A second need is checking substring equality against constant substrings of arbitrary sizes. This is similar to startswith & endswith, except at any code index in the source string; a generalisation. In C implementation, it would probably delegate to memcomp, with a start pointer set to p_source+i. On the Python side, it may be a string method like sub_equals(self, substr, i). Choose you preferred name ;-). [1] [4] if s.sub_equals(substr, i): # match ok What do you think? (bis) Thank you, Denis [1] I am unsure whether an end index is useful, actually I don't really understand its usage for startswith & endswith neither. [2] Actually, the compiler, if smart enough, may eliminate this object construction and just check the code; does it? Anyway, I think it is not that easy in the cases of ranges & sets. [3] As a side-note, 'ord' is in my view a misnomer, since character codes are not ordinals, with significant order, but nominals, plain numerical codes which only need to be all distinct; they are kinds of id's. For unicode, I call them 'ucodes', an idea I stole somewhere. But I would be happy is the method is called 'ord' anyway, since the term is established in the Python community. [4] Would such a new method make startswith & endswith unneeded?
participants (15)
-
Andrew Barnert
-
Chris Angelico
-
Ethan Furman
-
Greg Ewing
-
Guido van Rossum
-
Joshua Landau
-
Mark Lawrence
-
Masklinn
-
MRAB
-
Nick Coghlan
-
Paul Moore
-
spir
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Terry Reedy