[Python-3000] Making more effective use of slice objects in Py3k

Mon Aug 28 22:07:55 CEST 2006

Those are all microbenchmarks. It's easy to prove the superiority of
an approach that way. But what about realistic applications? What if
your views don't end up saving memory or time for an application, but
still cost in terms of added complexity in all string operations?

Anyway, let me begin with your  microbenchmark.

The first one pits a linear algorithm against a quadratic algorithm
with the expected result.

The second one is more interesting; your version doesn't copy while
the split() version copies, and that gives your version the expected
speedup. I never doubted this.

But your code has a worst-case problem: if you take a single short
view of a really long string and then drop the long string, the view
keeps it around. Something like this:

rest = ... # your mailbox file
results = []
for i in range(1000):
  x = rest + "." # Just to force a copy
  results.append(x.partition("\r\n.\r\n")[0]) Save the *first* message
over and over

Now watch the memory growth with your version vs. with standard partition.

Now fix this in your code and re-run your benchmark.

Then I come with another worst-case scenario, etc.

Then I ask you to make it so that string views are 99.999%
indistinguishable from strings -- they have all the same methods, are
usable everywhere else, etc.

--Guido

On 8/28/06, Josiah Carlson <jcarlson at uci.edu> wrote:
>
> "Guido van Rossum" <guido at python.org> wrote:
> >
> > Josiah (and other supporters of string views),
> >
> > You seem to be utterly convinced of the superior performance of your
> > proposal without having done any measurements.
> >
> > You appear to have a rather naive view on what makes code execute fast
> > or slow (e.g. you don't seem to appreciate the savings due to a string
> > object header and its data being consecutive in memory).
> >
> > Unless you have serious benchmark data (for realistic Python code) I
> > can't continue to participate in this discussion, where you have said
> > nothing new in many posts.
>
> Put up or shut up, eh?
>
> I have written a simple extension module using Pyrex (my manual C
> extension writing is awful).  Here are some sample interactions showing
> that string views are indeed quite fast.  In all of these examples, a
> naive implementation using only stringview.partition() was able to beat
> Python 2.5 str.partition, str.split, and re.finditer.
>
> Attached you will find the implementation of stringview I used, along
> with sufficient build scripts to get it working using Python 2.3 and
> Pyrex 0.9.3 .  Aside from replacing int usage with Py_ssize_t for 2.5,
> and *nix users performing a dos2unix call, it should work without change
> with the most recent Python and Pyrex versions.
>
>  - Josiah
>
>
> Using 2.3 :
>     >>> x = stringview(40000*' ')
>     >>> if 1:
>     ...     t = time.time()
>     ...     while x:
>     ...             _1, _2, x = x.partition(' ')
>     ...     print time.time()-t
>     ...
>     0.18700003624
>     >>>
>
> Compared with Python 2.5 beta 2
>     >>> x = 40000*' '
>     >>> if 1:
>     ...     t = time.time()
>     ...     while x:
>     ...             _1, _2, x = x.partition(' ')
>     ...     print time.time()-t
>     ...
>     0.625
>     >>>
>
> But that's about as bad for Python 2.5 as it can get.  What about
> something else?  Like a mail file?  In my 21.5 meg archive of py3k,
> which contains 3456 messages, I wanted to discover all messages.
>
> Python 2.3.5 (#62, Feb  8 2005, 16:23:02) [MSC v.1200 32 bit (Intel)] on win32
> Type "help", "copyright", "credits" or "license" for more information.
> >>> from stringview import *
> >>> rest = stringview(open('mail', 'rb').read())
> >>> import time
> >>> if 1:
> ...     x = []
> ...     t = time.time()
> ...     while rest:
> ...         cur, found, rest = rest.partition('\r\n.\r\n')
> ...         x.append(cur)
> ...     print time.time()-t, len(x)
> ...
> 0.0780000686646 3456
> >>>
>
> What about Python 2.5 using split?  That should be fast...
>
> Python 2.5b2 (r25b2:50512, Jul 11 2006, 10:16:14) [MSC v.1310 32 bit (Intel)] on
>  win32
> Type "help", "copyright", "credits" or "license" for more information.
> >>> rest = open('mail', 'rb').read()
> >>> import time
> >>> if 1:
> ...     t = time.time()
> ...     x = rest.split('\r\n.\r\n')
> ...     print time.time()-t, len(x)
> ...
> 0.109999895096 3457
> >>>
>
> Hrm...what about using re?
> >>> import re
> >>> pat = re.compile('\r\n\.\r\n')
> >>> rest = open('mail', 'rb').read()
> >>> import time
> >>> if 1:
> ...     x = []
> ...     t = time.time()
> ...     for i in pat.finditer(rest):
> ...         x.append(i)
> ...     print time.time()-t, len(x)
> ...
> 0.125 3456
> >>>
>
> Even that's not as good as Python 2.3 + string views.
>
>
>
>

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)