[Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

Wed Aug 24 19:26:34 CEST 2005

On Wed, 2005-08-24 at 07:33, "Martin v. Löwis" wrote:
> Walter Dörwald wrote:
> > Martin v. Löwis wrote:
> > 
> >> Walter Dörwald wrote:
[...]
> Actually, on a second thought - it would not remove the quadratic
> aspect. You would still copy the rest string completely on each
> split. So on the first split, you copy N lines (one result line,
> and N-1 lines into the rest string), on the second split, N-2
> lines, and so on, totalling N*N/2 line copies again. The only
> thing you save is the join (as the rest is already joined), and
> the IsLineBreak calls (which are necessary only for the first
> line).
[...]

In the past, I've avoided the string copy overhead inherent in split()
by using buffers...

I've always wondered why Python didn't use buffer type tricks internally
for split-type operations. I haven't looked at Python's string
implementation, but the fact that strings are immutable surely means
that you can safely and efficiently reference an implementation level
"data" object for all strings... ie all strings are "buffers".

The only problem I can see with this is huge "data" objects might hang
around just because some small fragment of it is still referenced by a
string. Surely a simple huristic or two like "if len(string) <
len(data)/8: copy data; else: reference data" would go a long way
towards avoiding that.

In my limited playing around with manipulating of strings and
benchmarking stuff, the biggest overhead is nearly always the copys.

-- 
Donovan Baarda <abo at minkirri.apana.org.au>