[Python-3000] Making more effective use of slice objects in Py3k

Mon Aug 28 04:43:36 CEST 2006

"Guido van Rossum" <guido at python.org> wrote:
> 
> On 8/27/06, Josiah Carlson <jcarlson at uci.edu> wrote:
> > [1] When I say "tree persistance", I mean those cases like a -> b -> c,
> > where view b persist because view a persists, even though b doesn't have
> > a reference otherwise.  Making both views a and b reference c directly
> > allows for b to be freed when it is no longer used.
> 
> Yeah, but you're still keeping c alive, which is the real memory waste.

It depends on the application.

1. Let us say I was parsing XML.  Rather than allocating a bunch of small
strings for the various tags, attributes, and data, I could instead
allocate a bunch of string views with pointers into the one larger XML
string.

Because all of the views are the same size, we can use a free list and
optimize allocation, deallocation, etc.  Small strings, on the other
hand, can't have such optimizations, and we would end up fragmentinh
memory over a long series of XML parsings (possibly leading to an
eventual MemoryError).

Even better, if the underlying parsing mechanism expects to recieve a
string, and we pass it a string view instead, then with the proper
string+view implementation, it wouldn't ever need to know that it is
working on views, it would just work, and we would recieve the parsing
with views instead of sliced strings.

2.Another example is the parsing of email or any other [header, blank
line, body] structured data (and even mime-like headers).  Say you have
read in a single email, you can have a view (or views) of the various
headers, with the multipart body, etc., and wouldn't need to copy
anything. Never mind that one could easily handle the insertion of
headers, body portions, etc., all without slicing the original (possibly
large) email, allowing for the easy manipulation of data with little
memory overhead.

Heck, one could even read in an entire mbox-formatted file, pull out all
of the original emails, rearrange them (resort folder by sent
date/recieved time), and write them back to disk, again without ever
slicing up the original mailbox file, resulting in roughly 1/2 the
memory overhead of an equivalent operation using string slicing.

3. In the 2.x byte string case (str not unicode), we have seen with the
various str.find() to str.partition() that chopping up data isn't
uncommon, and that generally most pieces are used, meaning that the
equivalent memory use of the original string is going to persist in
memory anyways.

Also, I would just like to state that I am not advocating the automatic
creation of views depending on string operations, one should always
construct the views explicitly, with something like view = stringview(st). 
Then the operations on the view should return further views and perhaps
occasionally strings, but operations on strings should never return
views.

---
Speaking of the 2.x byte strings and using str.partition() in 3.x, if
2.x strings are going away in 3.x, shouldn't we be either transitioning
everything to using bytes or unicode?  Initial translation of the
standard library to use partition/index seems like a huge time
investment, unless it is planned on being backported to the trunk for
2.6 .

Which reminds me, on August 28, 2005, Raymond sent me an initial patch
for a find -> partition patch for the full 2.5 standard library at the
time.  I can provide everyone with that patch along with my comments,
which may or may not be enough to transition most of the standard
library today.

 - Josiah