Use of StringIO vs cStringIO in standard modules

Thu Jun 3 14:13:50 EDT 1999

Hrvoje Niksic <hniksic at srce.hr> wrote:
: Guido van Rossum <guido at CNRI.Reston.VA.US> writes:

:> Hrvoje Niksic <hniksic at srce.hr>:
:> 
:> > I noticed that many standard modules use StringIO and not
:> > cStringIO, although they don't need subclassing.  Is this
:> > intentional?
:> > 
:> > For example, base64.py uses StringIO to implement encodestring()
:> > and decodestring().  Since both functions write to output line by
:> > line, I imagine the performance hit of StringIO vs cStringIO might
:> > be non-negligible.
:> 
:> Have you noticed any speed difference?

: Yes, quite a bit.  Trivially replacing StringIO with cStringIO in
: base64.py makes encoding 2.3 times and decoding 3.6 times faster.
: That's on my system (Ultra 2 under Solaris 2.6), measured repeatedly
: with time.clock() and an approx. 1M sample string.  I can post the
: script if there is interest.

: Maybe the correct solution for base64.py would be to use something
: like this at top-level:

: try:
:     from cStringIO import StringIO
: except:
:     from StringIO import StringIO

:> cPickle, because calling it from C is much faster than calling
:> StringIO from C; however I believe that for calls from Python,
:> StringIO isn't that much slower.

: I've looked at the code, and to me it seems that the slowness comes
: from creating new strings on each write, where cStringIO just resizes
: its internal buffer and creates the string only at the end.

:> > Furthermore, is there a particular reason for maintaining two
:> > parallel StringIO implementations?  If subclassing is the reason,
:> > I assume it would be trivial to rewrite StringIO to encapsulate
:> > cStringIO the same way that UserDict encapsulates dictionary
:> > objects.
:> 
:> That's not the reason; it's got more to do with not requiring a C
:> extension where plain Python code will do.  Also to have a reference
:> implementation.

: But then you have to maintain both, *and* you get much slower code.
: Is it worth it?

There are some interface differences between calls to cStringIO.StringIO
and StringIO.StringIO which I think makes StringIO more "usable" in some
cases.  Specifically, if you create an instances of cStringIO.StringIO
with initializeing data, then the instance becomes unwritable:

  >>> import cStringIO, StringIO
  >>> f = StringIO.StringIO("Hi there")
  >>> f.seek(0, 2)
  >>> f.write("\n")
  >>> f.getvalue()
  'Hi there\012'
  >>> g = cStringIO.StringIO("Hi there")
  >>> g.seek(0, 2)
  >>> g.write("\n")
  Traceback (innermost last):
    File "<stdin>", line 1, in ?
  AttributeError: write
  >>>

This could surely break some of the existing code.  But I will mention
that:
  f = StringIO.StringIO(str)
is equivalent to:
  f = cStringIO.StringIO()
  f.write(str)
  f.seek(0)

I would want to see cStringIO mimic StringIO before the two are merged.

And yes, Guido, I have noticed some differences at times.

  -Arcege