[Python-Dev] RFC: Add a new builtin strarray type to Python?
Victor Stinner
victor.stinner at haypocalc.com
Sun Oct 2 15:00:01 CEST 2011
Le samedi 1 octobre 2011 22:21:01, Antoine Pitrou a écrit :
> So, since people are confused at the number of possible options, you
> propose to add a new option and therefore increase the confusion?
The idea is to provide an API very close to the str type. So if your program
becomes slow in some functions and these functions are manipulating strings:
just try to replace str() by strarray() at the beginning of your loop, and
redo your benchmark.
I don't know if we really need all str methods: ljust(), endswith(),
isspace(), lower(), strip(), ... or if a UnicodeBuilder supporting in-place
a+=b would be enough. I suppose that it just would be more practical to have
the same methods.
Another useful use case is to be able to replace a substring: using strarray,
you can use the standard array[a:b] = newsubstring to insert, replace or
delete. Extract of strarray unit tests:
abc = strarray('abc')
abc[:1] = '123' # replace
self.assertEqual(abc, '123bc')
abc[3:3] = '45' # insert
self.assertEqual(abc, '12345bc')
abc[5:] = '' # delete
self.assertEqual(abc, '12345')
But only "replace" would be O(1). ("insert" requires less work than a replace
in a classic str if the replaced string is near the end.) You cannot
insert/delete using StringIO, str.join, or StringBuilder/UnicodeBuilder, but
you can using array('u'). Of course, you can replace a single character:
strarray[i] = 'x'.
(Using array[a:b]=newstr and array.index(), you can implement your in-place
.replace() function.)
> I don't understand why StringIO couldn't simply be optimized a little
> more, if it needs to.
Honestly, I didn't know that StringIO.write() is more efficient than str+=str,
and it is surprising to use the io module (which is supposed to be related to
files) to manipulate strings. But we can maybe document some "trick" (is it a
trick or not?) in str documementation (and in FAQ, and in stackoverflow.com,
and ...).
> Or, if straightforward string concatenation really needs to be fast,
> then str + str should be optimized (like it used to be).
We cannot have best performance and lowest memory usage at the same time with
the new str implementation (PEP 393). The new implementation is even more
focused on read-only (constant) strings than the previous one (Py_UNICODE
array using two memory blocks).
The PEP 393 uses one memory block, you cannot resize a str object anymore. The
old str type, StringIO, array (and strarray) use two memory blocks, so it is
possible to resize them (objects keep their identifier after the resize).
I *might* be possible to implement strarray that is fast on concatenation and
has small memory footprint, but we cannot use it for the str type because str
is immutable in Python.
--
On a second thaught, it may be easy to implement strarray if it reuses
unicodeobject.c. For example, strarray can be a special case (mutable) of
PyUnicodeObject (which use two memory blocks): the string would always be
ready, be never compact.
By the way, bytesobject.c and bytearrayobject.c is a fiasco: most functions are
duplicated whereas the code is very close. A big refactor is required to
remove duplicate code there.
Victor
More information about the Python-Dev
mailing list