Re: [Patches] Readline replacement under QNX in myreadline.c

I seem to recall that this came up recently but I don't remember where. Can anybody jog my memory? What did we decide in the end? --Guido van Rossum (home page: http://www.python.org/~guido/)

when hacking on SRE's substitution code, I stumbled upon a problem. to do a substitution, SRE needs to merge slices from the target strings and from the sub- stitution pattern. here's a simple example: re.sub( "(perl|tcl|java)", "python (not \\1)", "perl rules" ) contains a "substitution pattern" consisting of three parts: "python (not " (a slice from the substitution string) group 1 (a slice from the target string) ")" (a slice from the substitution string) PCRE implements this by doing the slicing (thus creating three new strings), and then doing a "join" by hand into a PyString buffer. this isn't very efficient, and it also doesn't work for uni- code strings. in other words, this needs to be fixed. but how? ... here's one proposal, off the top of my head: 1. introduce a PySliceListObject, which behaves like a simple sequence of strings, but stores them as slices. the type structure looks something like this: typedef struct { PyObject* string; int start; int end; } PySliceListItem; typedef struct { PyObject_VAR_HEAD PySliceListItem item[1]; } PySliceListObject; where start and end are normalized (0..len(string)) __len__ returns self->ob_size __getitem__ calls PySequence_GetSlice() PySliceListObjects are only used internally; they have no Python-level interface. 2. tweak string.join and unicode.join to look for PySliceListObject's, and have special code that copies slices directly from the source strings. (note that a slice list can still be used with any method that expects a sequence of strings, but at a cost) ... give the above, the substitution engine can now create a slice list by combining slices from the match object and the substitution object, and hand the result off to the string implementation; e.g: sep = PySequence_GetSlice(subst_string, 0, 0): result = PyObject_CallMethod(sep, "join", "O", slice_list) Py_DECREF(sep); (can anyone come up with something more elegant than the [0:0] slice?) comments? better ideas? </F>

On Sun, 27 Feb 2000, Fredrik Lundh wrote:
It occurred to me when i read this that *all* slices could be references within the original string, since strings are immutable. That is, s[x:y] could return a PyStringRefObject that behaves just like a string, but contains a length y - x and a pointer to &(s->ob_sval) + x instead of the character data itself. The creation of this PyStringRefObject would increment the reference count of s by 1. Perhaps this has been suggested before. The string methods could transparently work on such PyStringRefObjects, and any extensions that were polite enough to use only the Python API for playing with strings could continue to work fine; but things which directly manipulated the ob_sval field would break. Perhaps a possibility for Python 3K? Anyway -- as a solution for your particular problem, Fredrik, the PySliceListObject sounds like a good idea. -- ?!ng "To be human is to continually change. Your desire to remain as you are is what ultimately limits you." -- The Puppet Master, Ghost in the Shell

Ka-Ping Yee wrote:
The general approach is "cords" (in Hans Boehm's GC, garbage-collected), and "ropes" (in SGI's STL, http://www.sgi.com/Technology/STL/Rope.html, reference-counted). It's a great idea, IMO. Why create and copy strings all the time? -jcw

Each of these "improvements" slows things down in the common case. Believe me, I thought about this a lot when I designed Python's string object. ABC had an extremely complicated string implementation that used tricks like this and was proven to be asymptotically optimal. Unfortunately, the constant factor was large, and it was very slow for typical string ops. KISS, folks! --Guido van Rossum (home page: http://www.python.org/~guido/)

Ka-Ping Yee wrote:
as an experiment, I actually implemented this for the original unicode string type (where "split" and "slice" returned slice references, not string copies). here are some arguments against it: a) bad memory behaviour if you slice small strings out of huge input strings -- which may surprise newbies. b) harder to interface to underlying C libraries -- the current string implementation guarantees that a Python string is also a C string (with a trailing null). personally, I don't care much about (a) (e.g. match objects already keep references to the input string, and if this is a real problem, you can always use a more elaborate data structure...). (b) is a bit harder to ignore, though. </F>

Fredrik Lundh wrote:
... [b strings with NULL]
(b) is a bit harder to ignore, though.
I think the explicit string slice object is a good thing, but changing [:] behavior is not. With special stuff made explicit, it is easy to write optimized code, while the average user is not affected. cheers - chris -- Christian Tismer :^) <mailto:tismer@appliedbiometrics.com> Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF we're tired of banana software - shipped green, ripens at home

[/F, upon the reinvention of substring descriptors]
Experts too. Dragon has gobs of code that copies little strings via loops in Java and C++, because Java's and MFC's descriptor-based string classes routinely keep a megabyte string alive after you've sliced out the 3 bytes <0.5 wink> you needed. Last year my group finally wrote its own string classes, to just copy the damn things. Performance improvement was significant (both space & time). Boehm's "cords"/"ropes" (he's the primary author of both pkgs JC mentioned) were specifically designed to support efficient random & repeated editing of giant mutable strings -- agree with Guido that it's overall major loss for pedestrian uses. Heck, why not implement strings as giant B-trees like the Tcl text widget does <wink>.
c) For apps that use oodles of short strings, the space overhead of maintaining descriptors exceeds that of making copies. A buddy in Sun's Java development group tells me Java is despised for this by Major Players in the DB world; so don't be surprised if Java eventually drops the descriptor idea too (or, more Java-like, introduces 5 new flavors of strings <0.7 wink>). So there's no pure win here. Python's current scheme is at least predictable, and by everyone, with finite effort. Agree you have a particular good but limited use it for it, though, and Greg's suggestion of using buffer objects under the covers is almost certainly "the right" idea.

Tim Peters <tim_one@email.msn.com> wrote:
hmm. I'm not so sure about that... with Greg's scheme, SRE needs to create a list full of buffer objects, while the SliceList scheme involves creating *one* object per substitution -- and to create that object, SRE only needs to copy slots from the match and substitution objects, and bump the reference counts. (...and btw, using raw buffer objects to point into a set of strings of mixed types doesn't sound right to me...) I think I'll stick to the SliceList model, with or without explicit support in the string join methods. </F>

On Sun, 27 Feb 2000, Ka-Ping Yee wrote:
This is exactly what the PyBufferObject does. I just documented the thing in api.tex a week ago or so. Regardless, the thing can operate exactly like a lightweight slice object. It it very similar at the Python level to a string, but it doesn't have the new string methods (yet) :-( If you want a temporary object for your slices (before recomposition with a "".join), then you should be able to use the buffer objects. [ unfortunately, the "".join method is nowhere near as optimal as it could be... it converts elems to string objects during the concatenation; it should have a variant that uses the buffer interface to precalculate the joined size, then use the interface to fetch the data ] Cheers, -g -- Greg Stein, http://www.lyra.org/

Fredrik Lundh wrote:
Why not ? The Unicode implementation has an API PyUnicode_Join() which does eaxctly this: extern DL_IMPORT(PyObject*) PyUnicode_Join( PyObject *separator, /* Separator string */ PyObject *seq /* Sequence object */ ); Note that the PyUnicode_Join() API takes a sequence of Unicode objects, strings or objects providing the charbuf interface, coerces all of these into a Unicode object and then does the joining. There is also a _PyUnicode_Resize() API. It is currently not exported though... but that's easy to fix. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

when hacking on SRE's substitution code, I stumbled upon a problem. to do a substitution, SRE needs to merge slices from the target strings and from the sub- stitution pattern. here's a simple example: re.sub( "(perl|tcl|java)", "python (not \\1)", "perl rules" ) contains a "substitution pattern" consisting of three parts: "python (not " (a slice from the substitution string) group 1 (a slice from the target string) ")" (a slice from the substitution string) PCRE implements this by doing the slicing (thus creating three new strings), and then doing a "join" by hand into a PyString buffer. this isn't very efficient, and it also doesn't work for uni- code strings. in other words, this needs to be fixed. but how? ... here's one proposal, off the top of my head: 1. introduce a PySliceListObject, which behaves like a simple sequence of strings, but stores them as slices. the type structure looks something like this: typedef struct { PyObject* string; int start; int end; } PySliceListItem; typedef struct { PyObject_VAR_HEAD PySliceListItem item[1]; } PySliceListObject; where start and end are normalized (0..len(string)) __len__ returns self->ob_size __getitem__ calls PySequence_GetSlice() PySliceListObjects are only used internally; they have no Python-level interface. 2. tweak string.join and unicode.join to look for PySliceListObject's, and have special code that copies slices directly from the source strings. (note that a slice list can still be used with any method that expects a sequence of strings, but at a cost) ... give the above, the substitution engine can now create a slice list by combining slices from the match object and the substitution object, and hand the result off to the string implementation; e.g: sep = PySequence_GetSlice(subst_string, 0, 0): result = PyObject_CallMethod(sep, "join", "O", slice_list) Py_DECREF(sep); (can anyone come up with something more elegant than the [0:0] slice?) comments? better ideas? </F>

On Sun, 27 Feb 2000, Fredrik Lundh wrote:
It occurred to me when i read this that *all* slices could be references within the original string, since strings are immutable. That is, s[x:y] could return a PyStringRefObject that behaves just like a string, but contains a length y - x and a pointer to &(s->ob_sval) + x instead of the character data itself. The creation of this PyStringRefObject would increment the reference count of s by 1. Perhaps this has been suggested before. The string methods could transparently work on such PyStringRefObjects, and any extensions that were polite enough to use only the Python API for playing with strings could continue to work fine; but things which directly manipulated the ob_sval field would break. Perhaps a possibility for Python 3K? Anyway -- as a solution for your particular problem, Fredrik, the PySliceListObject sounds like a good idea. -- ?!ng "To be human is to continually change. Your desire to remain as you are is what ultimately limits you." -- The Puppet Master, Ghost in the Shell

Ka-Ping Yee wrote:
The general approach is "cords" (in Hans Boehm's GC, garbage-collected), and "ropes" (in SGI's STL, http://www.sgi.com/Technology/STL/Rope.html, reference-counted). It's a great idea, IMO. Why create and copy strings all the time? -jcw

Each of these "improvements" slows things down in the common case. Believe me, I thought about this a lot when I designed Python's string object. ABC had an extremely complicated string implementation that used tricks like this and was proven to be asymptotically optimal. Unfortunately, the constant factor was large, and it was very slow for typical string ops. KISS, folks! --Guido van Rossum (home page: http://www.python.org/~guido/)

Ka-Ping Yee wrote:
as an experiment, I actually implemented this for the original unicode string type (where "split" and "slice" returned slice references, not string copies). here are some arguments against it: a) bad memory behaviour if you slice small strings out of huge input strings -- which may surprise newbies. b) harder to interface to underlying C libraries -- the current string implementation guarantees that a Python string is also a C string (with a trailing null). personally, I don't care much about (a) (e.g. match objects already keep references to the input string, and if this is a real problem, you can always use a more elaborate data structure...). (b) is a bit harder to ignore, though. </F>

Fredrik Lundh wrote:
... [b strings with NULL]
(b) is a bit harder to ignore, though.
I think the explicit string slice object is a good thing, but changing [:] behavior is not. With special stuff made explicit, it is easy to write optimized code, while the average user is not affected. cheers - chris -- Christian Tismer :^) <mailto:tismer@appliedbiometrics.com> Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF we're tired of banana software - shipped green, ripens at home

[/F, upon the reinvention of substring descriptors]
Experts too. Dragon has gobs of code that copies little strings via loops in Java and C++, because Java's and MFC's descriptor-based string classes routinely keep a megabyte string alive after you've sliced out the 3 bytes <0.5 wink> you needed. Last year my group finally wrote its own string classes, to just copy the damn things. Performance improvement was significant (both space & time). Boehm's "cords"/"ropes" (he's the primary author of both pkgs JC mentioned) were specifically designed to support efficient random & repeated editing of giant mutable strings -- agree with Guido that it's overall major loss for pedestrian uses. Heck, why not implement strings as giant B-trees like the Tcl text widget does <wink>.
c) For apps that use oodles of short strings, the space overhead of maintaining descriptors exceeds that of making copies. A buddy in Sun's Java development group tells me Java is despised for this by Major Players in the DB world; so don't be surprised if Java eventually drops the descriptor idea too (or, more Java-like, introduces 5 new flavors of strings <0.7 wink>). So there's no pure win here. Python's current scheme is at least predictable, and by everyone, with finite effort. Agree you have a particular good but limited use it for it, though, and Greg's suggestion of using buffer objects under the covers is almost certainly "the right" idea.

Tim Peters <tim_one@email.msn.com> wrote:
hmm. I'm not so sure about that... with Greg's scheme, SRE needs to create a list full of buffer objects, while the SliceList scheme involves creating *one* object per substitution -- and to create that object, SRE only needs to copy slots from the match and substitution objects, and bump the reference counts. (...and btw, using raw buffer objects to point into a set of strings of mixed types doesn't sound right to me...) I think I'll stick to the SliceList model, with or without explicit support in the string join methods. </F>

On Sun, 27 Feb 2000, Ka-Ping Yee wrote:
This is exactly what the PyBufferObject does. I just documented the thing in api.tex a week ago or so. Regardless, the thing can operate exactly like a lightweight slice object. It it very similar at the Python level to a string, but it doesn't have the new string methods (yet) :-( If you want a temporary object for your slices (before recomposition with a "".join), then you should be able to use the buffer objects. [ unfortunately, the "".join method is nowhere near as optimal as it could be... it converts elems to string objects during the concatenation; it should have a variant that uses the buffer interface to precalculate the joined size, then use the interface to fetch the data ] Cheers, -g -- Greg Stein, http://www.lyra.org/

Fredrik Lundh wrote:
Why not ? The Unicode implementation has an API PyUnicode_Join() which does eaxctly this: extern DL_IMPORT(PyObject*) PyUnicode_Join( PyObject *separator, /* Separator string */ PyObject *seq /* Sequence object */ ); Note that the PyUnicode_Join() API takes a sequence of Unicode objects, strings or objects providing the charbuf interface, coerces all of these into a Unicode object and then does the joining. There is also a _PyUnicode_Resize() API. It is currently not exported though... but that's easy to fix. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
participants (8)
-
Christian Tismer
-
Fredrik Lundh
-
Greg Stein
-
Guido van Rossum
-
Jean-Claude Wippler
-
Ka-Ping Yee
-
M.-A. Lemburg
-
Tim Peters