On Wed, Jun 04, 2014 at 03:32:25PM +0000, Steve Dower wrote:
Steven D'Aprano wrote:
The language semantics says that a string is an array of code points. Every index relates to a single code point, no code point extends over two or more indexes. There's a 1:1 relationship between code points and indexes. How is direct indexing "likely to be incorrect"?
We're discussing the behaviour under a different (hypothetical) design decision than a 1:1 relationship between code points and indexes, so arguing from that stance doesn't make much sense.
I'm open to different implementations. I earlier even suggested that the choice of O(1) indexing versus O(N) indexing was a quality of implementation issue, not a make-or-break issue for whether something can call itself Python (or even 99% compatible with Python"). But I don't believe that exposing that implementation at the Python level is valid: regardless of whether it is efficient or not, I should be able to write code like this: a = [mystring[i] for i in range(len(mystring))] b = list(mystring) assert a == b That is not the case if you expose the underlying byte-level implementation at the Python level, and treat strings as an array of *bytes*. Paul seems to want to do this, or at least he wants Python 4 to do this. I think it is *completely* inappropriate to do so. I *think* you may agree with me, (correct me if I'm wrong) because you go on to agree with me:
e.g.
s = "---ÿ---" offset = s.index('ÿ') assert s[offset] == 'ÿ'
That cannot fail with Python's semantics.
Agreed, and it shouldn't
but I'm not actually sure.
(I was actually referring to the optimization being incorrect for the goal, not the language semantics). What you'd probably find is that sizeof('ÿ') == sizeof(s[offset]) == 2, which may be surprising, but is also correct.
You don't seem to be taking about sys.getsizeof, so I guess you're talking about something at the C level (or other underlying implementation), ignoring the object overhead. I don't know why you think I'd find that surprising -- one cannot fit 0x10FFFF Unicode code points in a single byte, so whether you use UTF-32, UTF-16, UTF-8, Python 3.3's FSR or some other implementation, at least some code points are going to use more than one byte.
But what are you trying to achieve (why are you writing this code)? All this example really shows is that you're only using indexing for trivial purposes.
I'm trying to understand what point you are trying to make, because I'm afraid I don't quite get it. [...]
If copying into a separate list is a problem (memory-wise), re.finditer('\\S+', string) also provides the same behaviour and gives me the sliced string, so there's no need to index for anything.
finditer returns a bunch of MatchObjects, which give you the indexes of the found substring. Whether you do it yourself, or get the re module to do it, you're indexing somewhere.
The downside is that it isn't as easy to teach as the 1:1 relationship, and currently it doesn't perform as well *in CPython*. But if MicroPython is focusing on size over speed, I don't see any reason why they shouldn't permit different performance characteristics and require a slightly different approach to highly-optimized coding.
I don't have a problem with different implementations, so long as that implementation isn't exposed at the Python level with changes of semantics such as breaking the promise that a string is an array of code points, not of bytes.
In any case, this is an interesting discussion with a genuine effect on the Python interpreter ecosystem. Jython and IronPython already have different string implementations from CPython - having official (and hopefully flexible) guidance on deviations from the reference implementation would I think help other implementations provide even more value, which is only a good thing for Python.
Yes, agreed. -- Steven