Mailman 3 Re: [Patches] Readline replacement under QNX in myreadline.c - Python-Dev

Fredrik Lundh

February 2000

7:01 a.m.

New subject: RFD: how to build strings from lots of slices?

when hacking on SRE's substitution code, I stumbled upon a problem. to do a substitution, SRE needs to merge slices from the target strings and from the sub- stitution pattern. here's a simple example: re.sub( "(perl|tcl|java)", "python (not \\1)", "perl rules" ) contains a "substitution pattern" consisting of three parts: "python (not " (a slice from the substitution string) group 1 (a slice from the target string) ")" (a slice from the substitution string) PCRE implements this by doing the slicing (thus creating three new strings), and then doing a "join" by hand into a PyString buffer. this isn't very efficient, and it also doesn't work for uni- code strings. in other words, this needs to be fixed. but how? ... here's one proposal, off the top of my head: 1. introduce a PySliceListObject, which behaves like a simple sequence of strings, but stores them as slices. the type structure looks something like this: typedef struct { PyObject* string; int start; int end; } PySliceListItem; typedef struct { PyObject_VAR_HEAD PySliceListItem item[1]; } PySliceListObject; where start and end are normalized (0..len(string)) __len__ returns self->ob_size __getitem__ calls PySequence_GetSlice() PySliceListObjects are only used internally; they have no Python-level interface. 2. tweak string.join and unicode.join to look for PySliceListObject's, and have special code that copies slices directly from the source strings. (note that a slice list can still be used with any method that expects a sequence of strings, but at a cost) ... give the above, the substitution engine can now create a slice list by combining slices from the match object and the substitution object, and hand the result off to the string implementation; e.g: sep = PySequence_GetSlice(subst_string, 0, 0): result = PyObject_CallMethod(sep, "join", "O", slice_list) Py_DECREF(sep); (can anyone come up with something more elegant than the [0:0] slice?) comments? better ideas? </F>

Reply

Sign in to reply online Use email software

Jean-Claude Wippler

7:23 a.m.

New subject: RFD: how to build strings from lots of slices?

Ka-Ping Yee wrote:

...

The general approach is "cords" (in Hans Boehm's GC, garbage-collected), and "ropes" (in SGI's STL, http://www.sgi.com/Technology/STL/Rope.html, reference-counted). It's a great idea, IMO. Why create and copy strings all the time? -jcw

Reply

Sign in to reply online Use email software

Fredrik Lundh

7:41 a.m.

New subject: RFD: how to build strings from lots of slices?

Ka-Ping Yee wrote:

...

as an experiment, I actually implemented this for the original unicode string type (where "split" and "slice" returned slice references, not string copies). here are some arguments against it: a) bad memory behaviour if you slice small strings out of huge input strings -- which may surprise newbies. b) harder to interface to underlying C libraries -- the current string implementation guarantees that a Python string is also a C string (with a trailing null). personally, I don't care much about (a) (e.g. match objects already keep references to the input string, and if this is a real problem, you can always use a more elaborate data structure...). (b) is a bit harder to ignore, though. </F>

Reply

Sign in to reply online Use email software

Tim Peters

6:19 p.m.

New subject: RFD: how to build strings from lots of slices?

[/F, upon the reinvention of substring descriptors]

...

Experts too. Dragon has gobs of code that copies little strings via loops in Java and C++, because Java's and MFC's descriptor-based string classes routinely keep a megabyte string alive after you've sliced out the 3 bytes <0.5 wink> you needed. Last year my group finally wrote its own string classes, to just copy the damn things. Performance improvement was significant (both space & time). Boehm's "cords"/"ropes" (he's the primary author of both pkgs JC mentioned) were specifically designed to support efficient random & repeated editing of giant mutable strings -- agree with Guido that it's overall major loss for pedestrian uses. Heck, why not implement strings as giant B-trees like the Tcl text widget does <wink>.

...

c) For apps that use oodles of short strings, the space overhead of maintaining descriptors exceeds that of making copies. A buddy in Sun's Java development group tells me Java is despised for this by Major Players in the DB world; so don't be surprised if Java eventually drops the descriptor idea too (or, more Java-like, introduces 5 new flavors of strings <0.7 wink>). So there's no pure win here. Python's current scheme is at least predictable, and by everyone, with finite effort. Agree you have a particular good but limited use it for it, though, and Greg's suggestion of using buffer objects under the covers is almost certainly "the right" idea.

Reply

Sign in to reply online Use email software

Greg Stein

1:42 p.m.

New subject: RFD: how to build strings from lots of slices?

On Sun, 27 Feb 2000, Ka-Ping Yee wrote:

...

This is exactly what the PyBufferObject does. I just documented the thing in api.tex a week ago or so. Regardless, the thing can operate exactly like a lightweight slice object. It it very similar at the Python level to a string, but it doesn't have the new string methods (yet) :-( If you want a temporary object for your slices (before recomposition with a "".join), then you should be able to use the buffer objects. [ unfortunately, the "".join method is nowhere near as optimal as it could be... it converts elems to string objects during the concatenation; it should have a variant that uses the buffer interface to precalculate the joined size, then use the interface to fetch the data ] Cheers, -g -- Greg Stein, http://www.lyra.org/

Reply

Sign in to reply online Use email software

Fredrik Lundh

February 2000

7:01 a.m.

New subject: RFD: how to build strings from lots of slices?

when hacking on SRE's substitution code, I stumbled upon a problem. to do a substitution, SRE needs to merge slices from the target strings and from the sub- stitution pattern. here's a simple example: re.sub( "(perl|tcl|java)", "python (not \\1)", "perl rules" ) contains a "substitution pattern" consisting of three parts: "python (not " (a slice from the substitution string) group 1 (a slice from the target string) ")" (a slice from the substitution string) PCRE implements this by doing the slicing (thus creating three new strings), and then doing a "join" by hand into a PyString buffer. this isn't very efficient, and it also doesn't work for uni- code strings. in other words, this needs to be fixed. but how? ... here's one proposal, off the top of my head: 1. introduce a PySliceListObject, which behaves like a simple sequence of strings, but stores them as slices. the type structure looks something like this: typedef struct { PyObject* string; int start; int end; } PySliceListItem; typedef struct { PyObject_VAR_HEAD PySliceListItem item[1]; } PySliceListObject; where start and end are normalized (0..len(string)) __len__ returns self->ob_size __getitem__ calls PySequence_GetSlice() PySliceListObjects are only used internally; they have no Python-level interface. 2. tweak string.join and unicode.join to look for PySliceListObject's, and have special code that copies slices directly from the source strings. (note that a slice list can still be used with any method that expects a sequence of strings, but at a cost) ... give the above, the substitution engine can now create a slice list by combining slices from the match object and the substitution object, and hand the result off to the string implementation; e.g: sep = PySequence_GetSlice(subst_string, 0, 0): result = PyObject_CallMethod(sep, "join", "O", slice_list) Py_DECREF(sep); (can anyone come up with something more elegant than the [0:0] slice?) comments? better ideas? </F>

Reply

Sign in to reply online Use email software

Jean-Claude Wippler

7:23 a.m.

New subject: RFD: how to build strings from lots of slices?

Ka-Ping Yee wrote:

...

The general approach is "cords" (in Hans Boehm's GC, garbage-collected), and "ropes" (in SGI's STL, http://www.sgi.com/Technology/STL/Rope.html, reference-counted). It's a great idea, IMO. Why create and copy strings all the time? -jcw

Reply

Sign in to reply online Use email software

Fredrik Lundh

7:41 a.m.

New subject: RFD: how to build strings from lots of slices?

Ka-Ping Yee wrote:

...

as an experiment, I actually implemented this for the original unicode string type (where "split" and "slice" returned slice references, not string copies). here are some arguments against it: a) bad memory behaviour if you slice small strings out of huge input strings -- which may surprise newbies. b) harder to interface to underlying C libraries -- the current string implementation guarantees that a Python string is also a C string (with a trailing null). personally, I don't care much about (a) (e.g. match objects already keep references to the input string, and if this is a real problem, you can always use a more elaborate data structure...). (b) is a bit harder to ignore, though. </F>

Reply

Sign in to reply online Use email software

Tim Peters

February 2000

11:19 p.m.

New subject: RFD: how to build strings from lots of slices?

[/F, upon the reinvention of substring descriptors]

...

Experts too. Dragon has gobs of code that copies little strings via loops in Java and C++, because Java's and MFC's descriptor-based string classes routinely keep a megabyte string alive after you've sliced out the 3 bytes <0.5 wink> you needed. Last year my group finally wrote its own string classes, to just copy the damn things. Performance improvement was significant (both space & time). Boehm's "cords"/"ropes" (he's the primary author of both pkgs JC mentioned) were specifically designed to support efficient random & repeated editing of giant mutable strings -- agree with Guido that it's overall major loss for pedestrian uses. Heck, why not implement strings as giant B-trees like the Tcl text widget does <wink>.

...

c) For apps that use oodles of short strings, the space overhead of maintaining descriptors exceeds that of making copies. A buddy in Sun's Java development group tells me Java is despised for this by Major Players in the DB world; so don't be surprised if Java eventually drops the descriptor idea too (or, more Java-like, introduces 5 new flavors of strings <0.7 wink>). So there's no pure win here. Python's current scheme is at least predictable, and by everyone, with finite effort. Agree you have a particular good but limited use it for it, though, and Greg's suggestion of using buffer objects under the covers is almost certainly "the right" idea.

Reply

Sign in to reply online Use email software

Greg Stein

6:42 p.m.

New subject: RFD: how to build strings from lots of slices?

On Sun, 27 Feb 2000, Ka-Ping Yee wrote:

...

This is exactly what the PyBufferObject does. I just documented the thing in api.tex a week ago or so. Regardless, the thing can operate exactly like a lightweight slice object. It it very similar at the Python level to a string, but it doesn't have the new string methods (yet) :-( If you want a temporary object for your slices (before recomposition with a "".join), then you should be able to use the buffer objects. [ unfortunately, the "".join method is nowhere near as optimal as it could be... it converts elems to string objects during the concatenation; it should have a variant that uses the buffer interface to precalculate the joined size, then use the interface to fetch the data ] Cheers, -g -- Greg Stein, http://www.lyra.org/

Reply

Sign in to reply online Use email software

Re: [Patches] Readline replacement under QNX in myreadline.c

Guido van Rossum

Fredrik Lundh

Ka-Ping Yee

Jean-Claude Wippler

Guido van Rossum

Fredrik Lundh

Christian Tismer

Tim Peters

Fredrik Lundh

Greg Stein

M.-A. Lemburg

Fredrik Lundh

Ka-Ping Yee

Jean-Claude Wippler

Guido van Rossum

Fredrik Lundh

Christian Tismer

Tim Peters

Fredrik Lundh

Greg Stein

M.-A. Lemburg

tags

participants (8)