Slice as a copy... by design?

Hi! A thread in PyAr raised the question that, considering that strings are immutable, why a slice of a string is a copy and not a reference to a part of that string. I couldn't answer why, so I'm asking here...Is it because the reference counting will be complicated? Is it because it'd be inefficient in other way? It's something else? Or is something that could be done... but is not done yet? Thank you very much! -- . Facundo Blog: http://www.taniquetil.com.ar/plog/ PyAr: http://www.python.org/ar/

Facundo Batista wrote:
Hi!
A thread in PyAr raised the question that, considering that strings are immutable, why a slice of a string is a copy and not a reference to a part of that string.
I couldn't answer why, so I'm asking here...Is it because the reference counting will be complicated? Is it because it'd be inefficient in other way? It's something else? Or is something that could be done... but is not done yet?
Thank you very much!
In fact, a slice is *not* always a copy! In at least some (simple) cases, a slice references the original string:
s = 'abc' t = s[:] s is t True id(s) 3081872000L id(t) 3081872000L
Gary Herron

On Thu, 22 May 2008, Gary Herron wrote:
In fact, a slice is *not* always a copy! In at least some (simple) cases, a slice references the original string:
s = 'abc' t = s[:] s is t True id(s) 3081872000L id(t) 3081872000L
I think the more interesting case is where the string objects are not the same object but use (portions of) the same underlying array in memory. If I understand correctly, Python does not do this, and I thought I read something about why not but I can't remember the details. Sharing contents is an obvious optimization which in some circumstances can dramatically reduce the amount of copying that goes on, but without a reasonably clever algorithm to decide when to let the underlying storage go (copying any part still in use), extremely bad behaviour can result - imagine reading in lots of long lines, then keeping just a short piece of each one. By contrast, the worst that can happen with no sharing is that performance and memory use is what you expect - the only "bad" is the apparent missed opportunity for optimization. I wonder if a "shared slice" object would be useful? That is, an object which behaves like a string obtained from a slicing operation except that it shares storage. It could have a .release method to go ahead and copy the underlying storage. One complexity comes to mind immediately - what happens if one takes a shared slice of a shared slice? Presumably it shares the original string's storage, but if the first shared slice is .released what happens to the second shared slice? It would be nice if it shared with the first shared slice, but keeping track of everything could get tricky. I'd be interested in pointers to any existing discussion on this issue. Trivia - right now there are *no* Google hits for 'python shared slice', although there are lots for 'python shared slices'. They don't appear to be talking about the same thing, however (without being exhaustive). Isaac Morland CSCF Web Guru DC 2554C, x36650 WWW Software Specialist

2008/5/22 Isaac Morland <ijmorlan@cs.uwaterloo.ca>:
By contrast, the worst that can happen with no sharing is that performance and memory use is what you expect - the only "bad" is the apparent missed opportunity for optimization.
Exactly, "apparent". Also, this could be handled like a "good writing tip". For example, right now everybody knows that appending a letter to a string a zillion times is not efficient, you should store them in a list, and then .join() them. Similarly, we could know that slicing zillions of long lines and keeping small portion of them is not memory efficient, you should do everytime "shortstring = str(longstring[:2])", for example. Note that this "special coding" will be for an "special case"... in your normal life the code just will be more efficient... Regards, -- . Facundo Blog: http://www.taniquetil.com.ar/plog/ PyAr: http://www.python.org/ar/

On Thu, May 22, 2008 at 12:28:47PM -0300, Facundo Batista wrote:
considering that strings are immutable, why a slice of a string is a copy and not a reference to a part of that string.
I remember some discussions... let me see... google to help... aha: http://mail.python.org/pipermail/python-3000/2006-August/003224.html http://mail.python.org/pipermail/python-3000/2006-August/003242.html 2006, August... I don't remember what was the resolution of the discussion. Oleg. -- Oleg Broytmann http://phd.pp.ru/ phd@phd.pp.ru Programmers don't die, they just GOSUB without RETURN.

2008/5/22 Oleg Broytmann <phd@phd.pp.ru>:
I remember some discussions... let me see... google to help... aha:
http://mail.python.org/pipermail/python-3000/2006-August/003224.html http://mail.python.org/pipermail/python-3000/2006-August/003242.html
These descussions are too general, and implies some work regarding the unification between slice and range, and discusses sequences in general... I'm just asking about strings (the discussion could apply similarly to bytes, but not bytearrays), and why slicing them returns a copy. Of course that when the slice has the step component different that 1, or reverses the string, this could not be done. But when the slice is mystring[x:y], being x < y, we *could* return just a view of the string (llot of emphasis to that "could"). Thanks. -- . Facundo Blog: http://www.taniquetil.com.ar/plog/ PyAr: http://www.python.org/ar/

Facundo Batista wrote:
I couldn't answer why, so I'm asking here...Is it because the reference counting will be complicated? Is it because it'd be inefficient in other way? It's something else? Or is something that could be done... but is not done yet?
If we changed Python to slice-by-reference, then tomorrow someone would be asking why it isn't slice-by-copy. There are pros and cons to both that are dependent on your application. It's not hard to imagine applications where you want to hold onto a small portion of a large string, thereby forcing the entire string to remain in memory. If a slices had a copy method, then I suppose this would be moot. -Scott -- Scott Dial scott@scottdial.com scodial@cs.indiana.edu

2008/5/22 Scott Dial <scott+python-dev@scottdial.com>:
If we changed Python to slice-by-reference, then tomorrow someone would be asking why it isn't slice-by-copy. There are pros and cons to both that are
Which are the cons of slice-by-reference of an immutable string?
dependent on your application. It's not hard to imagine applications where you want to hold onto a small portion of a large string, thereby forcing the entire string to remain in memory. If a slices had a copy method, then I
This is a garbage collection issue. It's real, and maybe could be optimized somehow... but I think that this un-optimization is by far smaller than the optimization of not copying it in the first place. -- . Facundo Blog: http://www.taniquetil.com.ar/plog/ PyAr: http://www.python.org/ar/

On Thu, 2008-05-22 at 13:27 -0300, Facundo Batista wrote:
2008/5/22 Scott Dial <scott+python-dev@scottdial.com>:
If we changed Python to slice-by-reference, then tomorrow someone would be asking why it isn't slice-by-copy. There are pros and cons to both that are
Which are the cons of slice-by-reference of an immutable string?
You have to consider the ramifications of such a design choice. There are two cases to consider: either slices return strings, or they return a different types. If they return strings, then all strings must grow three additional fields: start, end, and the reference to the actual string. That is 16 more bytes for *every* string, hardly a net win. You could argue that string slicing should return a separate type. But then, every piece of code that handles strings must be taught to handle this type in addition. PyString_Check and PyString_CheckExact must go. Instead of writing PyString_Foo, the C code would have to use PyObject_CallMethod and friends for the simplest tasks. Time-saving macros like PyString_AS_STRING and PyString_GET_SIZE become obsolete. Where performance matters (as it often does in low-level C code dealing with strings), this is a real problem. Besides, you lose various special optimizations, such as dicts being much faster when all keys are strings.

On Mon, May 26, 2008 at 4:21 AM, Hrvoje Nikšić <hrvoje.niksic@avl.com> wrote:
On Thu, 2008-05-22 at 13:27 -0300, Facundo Batista wrote:
2008/5/22 Scott Dial <scott+python-dev@scottdial.com>:
If we changed Python to slice-by-reference, then tomorrow someone would be asking why it isn't slice-by-copy. There are pros and cons to both that are
Which are the cons of slice-by-reference of an immutable string?
You have to consider the ramifications of such a design choice. There are two cases to consider: either slices return strings, or they return a different types.
If they return strings, then all strings must grow three additional fields: start, end, and the reference to the actual string. That is 16 more bytes for *every* string, hardly a net win.
A lot of dynamic language implementations have a complex string representation, where individual bits of the string tell what the rest of the representation is. Mozilla's JavaScript implementation is like this. At the moment, a string in JavaScript is two pointer-sized words, and JavaScript has O(1) slicing and, in many cases, O(len(s2)) string concatenation. There's a rather dense comment here explaining it: http://hg.mozilla.org/mozilla-central/index.cgi/file/79924d3b5bba/js/src/jss... The equivalent of PyString_AS_STRING and PyString_GET_SIZE contains a branch. I don't think the implementation avoids the worst cases Guido was talking about; tiny substrings can keep huge strings alive. -j

On Thu, May 22, 2008, Facundo Batista wrote:
A thread in PyAr raised the question that, considering that strings are immutable, why a slice of a string is a copy and not a reference to a part of that string.
Someone did a patch for this at one point, but I don't remember what happened. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ Need a book? Use your library!

On Thu, May 22, 2008 at 10:01 AM, Aahz <aahz@pythoncraft.com> wrote:
On Thu, May 22, 2008, Facundo Batista wrote:
A thread in PyAr raised the question that, considering that strings are immutable, why a slice of a string is a copy and not a reference to a part of that string.
Someone did a patch for this at one point, but I don't remember what happened.
I rejected it because I came up with a worst-case scenario where the behavior with the patch in place was much, much worse than the behavior without it -- improved performance in other scenarios notwithstanding. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Facundo Batista schrieb:
Hi!
A thread in PyAr raised the question that, considering that strings are immutable, why a slice of a string is a copy and not a reference to a part of that string.
Because the reference approach is more complicated, harder to implement and may lead to unexpected behavior. For example:
a = "a long string with 500,000 chars ..." b = a[0] del a
With the slice-as-copy design the string 'a' is immediately freed. The slice-as-reference design would keep the 500kB string in memory although you are only interested in the first character. The buffer interface was designed for the slice-as-copy use case:
a = "abcdefg" b = buffer(a, 2, 3) b <read-only buffer for 0x839c2e0, size 3, offset 2 at 0x8391c40> str(b) 'cde' sys.getrefcount(a) 3 del b sys.getrefcount(a) 2
Christian

On Thu, 22 May 2008, Christian Heimes wrote:
The buffer interface was designed for the slice-as-copy use case:
a = "abcdefg" b = buffer(a, 2, 3) b <read-only buffer for 0x839c2e0, size 3, offset 2 at 0x8391c40> str(b) 'cde' [....]
This answers my musing about shared slices. But it points me at another question: why is buffer() listed in "Non-essential Built-in Functions"? While it is obviously not essential like str() or list(), it isn't deprecated like apply(). On the other hand, some other built-in functions listed under "Built-in Functions" are probably also not essential (I'm not going to go any further out on the limb by giving an example!). Perhaps I misunderstand the intent of this manual page. http://docs.python.org/lib/non-essential-built-in-funcs.html#l2h-88 Isaac Morland CSCF Web Guru DC 2554C, x36650 WWW Software Specialist

Isaac Morland wrote:
On Thu, 22 May 2008, Christian Heimes wrote:
The buffer interface was designed for the slice-as-copy use case:
a = "abcdefg" b = buffer(a, 2, 3) b <read-only buffer for 0x839c2e0, size 3, offset 2 at 0x8391c40> str(b) 'cde' [....]
This answers my musing about shared slices. But it points me at another question: why is buffer() listed in "Non-essential Built-in Functions"? While it is obviously not essential like str() or list(), it isn't deprecated like apply().
Even worse, it's gone in Py3: Python 3.0a5 (r30a5:62856, May 9 2008, 11:26:14) [GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2 Type "help", "copyright", "credits" or "license" for more information.
buffer Traceback (most recent call last): File "<stdin>", line 1, in <module> NameError: name 'buffer' is not defined
Stefan

Hi, Christian Heimes wrote:
Stefan Behnel wrote:
Even worse, it's gone in Py3:
No, it has been replaced by a better system.
Try "memoryview"
I know. We are already discussing the buffer protocol on the Cython list. http://comments.gmane.org/gmane.comp.python.cython.devel/1763?set_lines=1000... Stefan

I couldn't answer why, so I'm asking here...Is it because the reference counting will be complicated? Is it because it'd be inefficient in other way? It's something else? Or is something that could be done... but is not done yet?
There are two problems with that approach: a) you may hold onto very long strings for a long time, even though you don't "see" the larger string anymore, e.g. when you have only a single line of an entire file that you hold onto. That, in turn, may cause the program to consume much more memory than the copying slicing. b) it is very difficult to implement and maintain. At any point in the code that deals with string representations, you have to make the distinction of whether you have a "real" string, or a slice. Nobody has contributed a correct and efficient implementation of that yet, let alone one that is also easy to maintain. c) (related to b) Currently, the string object's internal representation might be used in extension modules. We would need to find a way to protect against errors that may occur when we silently change the representation semantically. (as any good list of two items, this has three :-) Regards, Martin

Facundo Batista wrote:
A thread in PyAr raised the question that, considering that strings are immutable, why a slice of a string is a copy and not a reference to a part of that string.
Because it would make it too easy to accidentally keep a large string alive via a reference to a small part of it. Some way of explicitly requesting a view into another string might be desirable, but it shouldn't be the default behaviour for string slicing. -- Greg

Greg Ewing wrote:
Some way of explicitly requesting a view into another string might be desirable, but it shouldn't be the default behaviour for string slicing.
Backporting 3.0's memoryview to 2.6 would be the way to go about that. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org

2008/5/22 Facundo Batista <facundobatista@gmail.com>:
I couldn't answer why, so I'm asking here...Is it because the reference counting will be complicated? Is it because it'd be inefficient in other way? It's something else? Or is something that could be done... but is not done yet?
Thank you all for the information provided! -- . Facundo Blog: http://www.taniquetil.com.ar/plog/ PyAr: http://www.python.org/ar/

If I may, I'd like to describe the D programming language's slice and array semantics. This may seem like a digression (and it probably is), but this thread is already quite lengthy, and it may be interesting to look at a language which implements the slice-by-reference semantics we're talking about here. If you're unfamiliar with D, here are its relevant features: It is statically typed, it is garbage collected, it has dynamic arrays as a builtin type, and strings are simply arrays of characters. (In fact, strings are slightly more complicated than this, as D is fairly Unicode-aware, but this is beside the point.) It also styles itself as a systems programming language, meaning you can play directly with pointers if you wish to. A dynamic array in D is a sort of fat pointer, which looks like this: struct array { size_t length; void* ptr; } That is, it consists of two words, which are passed around by value. The language makes no distinction between an array and a slice of an array. They are both just a pointer and a length. In D, which is garbage collected, any pointer into a GC-allocated chunk of memory will keep that memory alive. Here is some code (its meaning should not be hard to divine): int[] a = [1, 2, 3, 4, 5]; int[] b = a[2..4]; Here we have allocated a dynamic array on the heap, and assigned a slice over the whole array to 'a'. Then we have assigned a sub-slice of the array to 'b'. The important point is that slicing is a very cheap operation in D. If we alter an element of b, the changes will be reflected in a: b[0] = 10; assert(a[2] == 10); (Already this is drifting off topic. Arrays in D are mutable, and strings in Python are not. But please bear with me.) If we get rid of the original reference to the entire array: a = null; Then b will still keep the entire array alive. This is probably the most common criticism of this behavior. If the original array is quite large, then even a single tiny slice of it will keep the entire thing alive. However, these slice semantics can be quite useful. Imagine an XML parser in D. One way you might implement it is to read the entire file into a buffer, then parse out the relevant sub-strings. These sub-strings are slices, and are very cheap to make. The important point here is that, outside of the original buffer, no heap allocations of any sort are made, and this can lead to a very efficient parser. And because this is the language's regular, default behavior, the code which does this looks very natural. This same behavior exists for any similar parsing problem, and can be particularly useful when the application is long-lived, such as an HTTP server parsing HTTP requests. While D's slice semantics are generally nice, I do not necessarily advocate their use in Python, since I don't think there's any good way to get this behavior in Python. (At least not without radically changing it in other ways.) It is still a worthy example of how slice-as-reference semantics can be useful. On Thu, May 22, 2008 at 8:28 AM, Facundo Batista <facundobatista@gmail.com> wrote:
Hi!
A thread in PyAr raised the question that, considering that strings are immutable, why a slice of a string is a copy and not a reference to a part of that string.
I couldn't answer why, so I'm asking here...Is it because the reference counting will be complicated? Is it because it'd be inefficient in other way? It's something else? Or is something that could be done... but is not done yet?
Thank you very much!
-- . Facundo
Blog: http://www.taniquetil.com.ar/plog/ PyAr: http://www.python.org/ar/ _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/kirklin.mcdonald%40gmail.c...
participants (15)
-
"Martin v. Löwis"
-
Aahz
-
Christian Heimes
-
Facundo Batista
-
Gary Herron
-
Greg Ewing
-
Guido van Rossum
-
Hrvoje Nikšić
-
Isaac Morland
-
Jason Orendorff
-
Kirk McDonald
-
Nick Coghlan
-
Oleg Broytmann
-
Scott Dial
-
Stefan Behnel