[Python-Dev] RFC: Add a new builtin strarray type to Python?

Sun Oct 2 15:00:01 CEST 2011

Le samedi 1 octobre 2011 22:21:01, Antoine Pitrou a écrit :
> So, since people are confused at the number of possible options, you
> propose to add a new option and therefore increase the confusion?

The idea is to provide an API very close to the str type. So if your program 
becomes slow in some functions and these functions are manipulating strings: 
just try to replace str() by strarray() at the beginning of your loop, and 
redo your benchmark.

I don't know if we really need all str methods: ljust(), endswith(), 
isspace(), lower(), strip(), ... or if a UnicodeBuilder supporting in-place 
a+=b would be enough. I suppose that it just would be more practical to have 
the same methods.

Another useful use case is to be able to replace a substring: using strarray, 
you can use the standard array[a:b] = newsubstring to insert, replace or 
delete. Extract of strarray unit tests:

        abc = strarray('abc')
        abc[:1] = '123' # replace
        self.assertEqual(abc, '123bc')
        abc[3:3] = '45' # insert
        self.assertEqual(abc, '12345bc')
        abc[5:] = '' # delete
        self.assertEqual(abc, '12345')

But only "replace" would be O(1). ("insert" requires less work than a replace 
in a classic str if the replaced string is near the end.) You cannot 
insert/delete using StringIO, str.join, or StringBuilder/UnicodeBuilder, but 
you can using array('u'). Of course, you can replace a single character: 
strarray[i] = 'x'.

(Using array[a:b]=newstr and array.index(), you can implement your in-place 
.replace() function.)

> I don't understand why StringIO couldn't simply be optimized a little
> more, if it needs to.

Honestly, I didn't know that StringIO.write() is more efficient than str+=str, 
and it is surprising to use the io module (which is supposed to be related to 
files) to manipulate strings. But we can maybe document some "trick" (is it a 
trick or not?) in str documementation (and in FAQ, and in stackoverflow.com, 
and ...).

> Or, if straightforward string concatenation really needs to be fast,
> then str + str should be optimized (like it used to be).

We cannot have best performance and lowest memory usage at the same time with 
the new str implementation (PEP 393). The new implementation is even more 
focused on read-only (constant) strings than the previous one (Py_UNICODE 
array using two memory blocks).

The PEP 393 uses one memory block, you cannot resize a str object anymore. The 
old str type, StringIO, array (and strarray) use two memory blocks, so it is 
possible to resize them (objects keep their identifier after the resize).

I *might* be possible to implement strarray that is fast on concatenation and 
has small memory footprint, but we cannot use it for the str type because str 
is immutable in Python.

--

On a second thaught, it may be easy to implement strarray if it reuses 
unicodeobject.c. For example, strarray can be a special case (mutable) of 
PyUnicodeObject (which use two memory blocks): the string would always be 
ready, be never compact.

By the way, bytesobject.c and bytearrayobject.c is a fiasco: most functions are 
duplicated whereas the code is very close. A big refactor is required to 
remove duplicate code there.

Victor