[Python-3000] Poll: Lazy Unicode Strings For Py3k

Josiah Carlson jcarlson at uci.edu
Thu Feb 1 01:18:17 CET 2007


"Jim Jewett" <jimjjewett at gmail.com> wrote:
> 
> On 1/31/07, Josiah Carlson <jcarlson at uci.edu> wrote:
> >
> > "Jim Jewett" <jimjjewett at gmail.com> wrote:
> > > On 1/31/07, Josiah Carlson <jcarlson at uci.edu> wrote:
> 
> > > > Do you remember my "string view" post from last September/October or so?
> > > > It implemented almost all of the string API exactly as the string API
> > > > did, except that rather than returning strings, it returned views.
> 
> > > So there would be places where you couldn't safely use it, even though
> > > it had all the required functionality.
> 
> > Almost certainly, but the point is that you could get back to what you
> > wanted via str(obj), unicode(obj), etc., which would incur (in the worst
> > case) the overhead you saved before, or raise a MemoryError exception
> > (unless its linux, in which case you will likely segfault).
> 
> > > How would you feel if it also
> 
> > > (1)  Claimed to be a subclass of str (though it might not actually
> > > inherit anything)
> > > (2)  Implemented the rest of the methods by delegation.  (Call str on
> > > itself, switch its "real" object to the new string, and delegate to
> > > that.)
> 
> > I'm not terribly concerned about the implementation details of an object
> > I don't need to use.  As long as it works, it is fine.  I am concerned
> > about the implementation details of objects I will use.
> 
> The reason to ask for these is that then you could use it anywhere a
> str could be used (unless they explicitly did CheckExact).  Since the
> object itself would be in charge of creating a "normal" str when
> needed, you wouldn't have to do it pre-emptively before passing it to
> a library.

Certainly, but a well-behaved C extension or library should be using the
single segment buffer interface anyways, to allow for the passing of
array.array, numpy.array, buffer(...), etc.  For views, this works great,
you just return a pointer and length into the original object.  For
concatenation objects, one needs to render the string, but that is
expected.

For reference, I have written quite a bit of code that expects
string-like things to be passed to C extensions, and by using the buffer
interface, I have been able to use str, array and mmap instances
interchangeably depending on what I want as a result, or what I'm using
as temporary memory.


> > I believe the base type included with Python should allocate the memory
> > on creation.  Why?  Because the implementation is simple, and I believe
> > that a base type implementation should be as simple as possible.
> 
> Do you think it should happen to do that as an implementation detail,
> or that it should *promise* to do so, and bind all string-alikes to
> the same promise?

What subtypes do are their own business.  I only have an opinion for the
str and unicode types allocating memory on creation.  Aside from being
simpler, it keeps the base types from delaying a MemoryError in low
memory conditions.  An implementation of string views (or concatenations)
that is a subclass of the string type, which defers creation until
necessary (or never), is perfectly reasonable. Whether it is in the
stdlib or 3rd party, I don't care.  (I swear, this is at least the
second or third time I've said this)


 - Josiah



More information about the Python-3000 mailing list