[Python-3000] Poll: Lazy Unicode Strings For Py3k
Josiah Carlson
jcarlson at uci.edu
Thu Feb 1 01:18:17 CET 2007
"Jim Jewett" <jimjjewett at gmail.com> wrote:
>
> On 1/31/07, Josiah Carlson <jcarlson at uci.edu> wrote:
> >
> > "Jim Jewett" <jimjjewett at gmail.com> wrote:
> > > On 1/31/07, Josiah Carlson <jcarlson at uci.edu> wrote:
>
> > > > Do you remember my "string view" post from last September/October or so?
> > > > It implemented almost all of the string API exactly as the string API
> > > > did, except that rather than returning strings, it returned views.
>
> > > So there would be places where you couldn't safely use it, even though
> > > it had all the required functionality.
>
> > Almost certainly, but the point is that you could get back to what you
> > wanted via str(obj), unicode(obj), etc., which would incur (in the worst
> > case) the overhead you saved before, or raise a MemoryError exception
> > (unless its linux, in which case you will likely segfault).
>
> > > How would you feel if it also
>
> > > (1) Claimed to be a subclass of str (though it might not actually
> > > inherit anything)
> > > (2) Implemented the rest of the methods by delegation. (Call str on
> > > itself, switch its "real" object to the new string, and delegate to
> > > that.)
>
> > I'm not terribly concerned about the implementation details of an object
> > I don't need to use. As long as it works, it is fine. I am concerned
> > about the implementation details of objects I will use.
>
> The reason to ask for these is that then you could use it anywhere a
> str could be used (unless they explicitly did CheckExact). Since the
> object itself would be in charge of creating a "normal" str when
> needed, you wouldn't have to do it pre-emptively before passing it to
> a library.
Certainly, but a well-behaved C extension or library should be using the
single segment buffer interface anyways, to allow for the passing of
array.array, numpy.array, buffer(...), etc. For views, this works great,
you just return a pointer and length into the original object. For
concatenation objects, one needs to render the string, but that is
expected.
For reference, I have written quite a bit of code that expects
string-like things to be passed to C extensions, and by using the buffer
interface, I have been able to use str, array and mmap instances
interchangeably depending on what I want as a result, or what I'm using
as temporary memory.
> > I believe the base type included with Python should allocate the memory
> > on creation. Why? Because the implementation is simple, and I believe
> > that a base type implementation should be as simple as possible.
>
> Do you think it should happen to do that as an implementation detail,
> or that it should *promise* to do so, and bind all string-alikes to
> the same promise?
What subtypes do are their own business. I only have an opinion for the
str and unicode types allocating memory on creation. Aside from being
simpler, it keeps the base types from delaying a MemoryError in low
memory conditions. An implementation of string views (or concatenations)
that is a subclass of the string type, which defers creation until
necessary (or never), is perfectly reasonable. Whether it is in the
stdlib or 3rd party, I don't care. (I swear, this is at least the
second or third time I've said this)
- Josiah
More information about the Python-3000
mailing list