On 6/4/2014 5:08 PM, Glenn Linderman
On 6/4/2014 5:03 PM, Greg Ewing
Serhiy Storchaka wrote:
re.compile, tokenize.tokenize don't use iterators. They use
indices, str.find and/or regular expressions. Common use case
is quickly find substring starting from current position using
str.find or re.search, process found token, advance position
For that kind of thing, you don't need an actual character
index, just some way of referring to a place in a string.
I think you meant codepoint index, rather than character index.
This starts to diverge from Python codepoint indexing via
integers. Calculating or caching the codepoint index to byte
offset as part of the str implementation stays compatible with
Python. Introducing StringPosition makes a Python-like language.
Or so it seems to me.
Instead of an integer, str.find() etc. could return a
StringPosition, which would be an opaque reference to a
particular point in a particular string. You would be
able to pass StringPositions to indexing and slicing
operations to get fast indexing into the string that
they were derived from.
StringPositions could support the following operations:
StringPosition + int --> StringPosition
StringPosition - int --> StringPosition
StringPosition - StringPosition --> int
These would be computed by counting characters forwards
or backwards in the string, which would be slower than
int arithmetic but still faster than counting from the
beginning of the string every time.
In other contexts, StringPositions would coerce to ints
(maybe being an int subclass?) allowing them to be used
in any existing algorithm that slices strings using ints.
Another thought is that StringPosition only works (quickly, at
least), as you point out, for the string that they were derived
from... so algorithms that walk two strings at a time cannot use the
same StringPosition to do so... yep, this is quite divergent from
CPython and Python.