Fredrik Lundh wrote:
when hacking on SRE's substitution code, I stumbled upon a problem. to do a substitution, SRE needs to merge slices from the target strings and from the sub- stitution pattern.
here's a simple example:
re.sub( "(perl|tcl|java)", "python (not \\1)", "perl rules" )
contains a "substitution pattern" consisting of three parts:
"python (not " (a slice from the substitution string) group 1 (a slice from the target string) ")" (a slice from the substitution string)
PCRE implements this by doing the slicing (thus creating three new strings), and then doing a "join" by hand into a PyString buffer.
this isn't very efficient, and it also doesn't work for uni- code strings.
Why not ? The Unicode implementation has an API PyUnicode_Join() which does eaxctly this:
extern DL_IMPORT(PyObject*) PyUnicode_Join( PyObject *separator, /* Separator string */ PyObject *seq /* Sequence object */ );
Note that the PyUnicode_Join() API takes a sequence of Unicode objects, strings or objects providing the charbuf interface, coerces all of these into a Unicode object and then does the joining.
There is also a _PyUnicode_Resize() API. It is currently not exported though... but that's easy to fix.