Mailman 3 What's a PyStructSequence ? - Python-Dev

What's a PyStructSequence ?

M.-A. Lemburg

Nov. 26, 2001

8:08 p.m.

A bug report on SF made me aware of an apprently new type in Python called PyStructSequence. There are no docs on the type (at least not in the usual places). Is it official yet ? -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/

Show replies by date

Martin v. Loewis

November 2001

10:31 p.m.

...

A bug report on SF made me aware of an apprently new type in Python called PyStructSequence. There are no docs on the type (at least not in the usual places).

Is it official yet ?

It will ship as part of Python 2.2, if that is what you are asking. os.stat is documented to return one of these (if you read it carefully). Regards, Martin

M.-A. Lemburg

10:26 a.m.

"Martin v. Loewis" wrote:

...

...
A bug report on SF made me aware of an apparently new type in Python called PyStructSequence. There are no docs on the type (at least not in the usual places).

Is it official yet ?

It will ship as part of Python 2.2, if that is what you are asking. os.stat is documented to return one of these (if you read it carefully).

Wouldn't it make sense to expose this object in Python, e.g. by contructing it from a dictionary of string mappings ? (The type constructor is not made available in bltinmodule.c.) -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/

Martin v. Loewis

8:05 p.m.

...

Wouldn't it make sense to expose this object in Python, e.g. by contructing it from a dictionary of string mappings ?

(The type constructor is not made available in bltinmodule.c.)

No. AFAIR, this idea was explicitly rejected at the time the patch was designed (see the comments on the stat patch for the exact history). The rationale was that it is easy enough to create a class that doubles as tuple in Python yourself (perhaps through inheritance from tuple), so there would be no need to expose this type. Regards, Martin

M.-A. Lemburg

8:28 p.m.

"Martin v. Loewis" wrote:

...

...
Wouldn't it make sense to expose this object in Python, e.g. by contructing it from a dictionary of string mappings ?

(The type constructor is not made available in bltinmodule.c.)

No. AFAIR, this idea was explicitly rejected at the time the patch was designed (see the comments on the stat patch for the exact history).

The rationale was that it is easy enough to create a class that doubles as tuple in Python yourself (perhaps through inheritance from tuple), so there would be no need to expose this type.

Hmm, isn't the trick with this type that you can access the various elements as attributes *and* using index notation ? Also, why should we hide something useful from the Python programmer if it's there anyway ? (One thing I've always wondered about is why Python doesn't expose Py_True and Py_False through the builtin module...) I guess, I'll add a constructor to mx.Tools, my repository for missing builtins ;-) -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/

Michel Pelletier

8:38 p.m.

"M.-A. Lemburg" wrote:

...

Also, why should we hide something useful from the Python programmer if it's there anyway ? (One thing I've always wondered about is why Python doesn't expose Py_True and Py_False through the builtin module...)

I have no idea why they are not exposed, of course. but my guess would be because there is no boolean type, there is no need for them. I myself have never needed a boolean type, "zero" or "empty" have always worked for me as a boolean false. What are they used at the C level for?

...

I guess, I'll add a constructor to mx.Tools, my repository for missing builtins ;-)

Does mx.Tools offer a boolean type? -Michel

M.-A. Lemburg

8:54 p.m.

Michel Pelletier wrote:

...

"M.-A. Lemburg" wrote:

...
Also, why should we hide something useful from the Python programmer if it's there anyway ? (One thing I've always wondered about is why Python doesn't expose Py_True and Py_False through the builtin module...)

I have no idea why they are not exposed, of course. but my guess would be because there is no boolean type, there is no need for them. I myself have never needed a boolean type, "zero" or "empty" have always worked for me as a boolean false.

What are they used at the C level for?

All simple compares return either Py_True or Py_False (e.g. 1==1 returns a reference to Py_True).

...

...
I guess, I'll add a constructor to mx.Tools, my repository for missing builtins ;-)

Does mx.Tools offer a boolean type?

No, but I'm thinking about adding a Boolean number type to mxNumber. I'll also need some form of a binary type to make the set complete for XML-RPC. Currently, I can work around this by using True and False (which mx.Tools adds) and using buffer objects as wrappers to mean "this is a binary type". -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/

barry＠zope.com

8:54 p.m.

...

...
...
...
...
"MP" == Michel Pelletier <michel@zope.com> writes:

MP> I have no idea why they are not exposed, of course. but my MP> guess would be because there is no boolean type, there is no MP> need for them. I myself have never needed a boolean type, MP> "zero" or "empty" have always worked for me as a boolean MP> false. There have been many implementations of a boolean type; I know 'cause I've done at least 3. For grins I might even try a fourth for Python 2.2. But you're right, in practice there isn't much of a need. -Barry

Skip Montanaro

8:57 p.m.

Michel> I have no idea why they are not exposed, of course. but my guess Michel> would be because there is no boolean type, there is no need for Michel> them. I myself have never needed a boolean type, "zero" or Michel> "empty" have always worked for me as a boolean false. XML-RPC defines a Boolean data type, so xmlrpclib defines a Boolean class with True and False instances. Programmers wanting to send boolean values must pass one of them (and expect to receive them when data arrives). Given that Py_True and Py_False are sitting there just below the surface, it seems a (small) shame that /F had to do that, more so now that xmlrpclib is part of the core distribution. Interestingly (or oddly, not sure which) enough, Lib/test/test_iter.py also defines a small Boolean class. -- Skip Montanaro (skip@pobox.com - http://www.mojam.com/)

Martin v. Loewis

9:21 p.m.

...

What are they used at the C level for?

To return them to Python. You write Py_INCREF(Py_True); return Py_True; or result = c ? Py_True : Py_False; Py_INCREF(result); return result; just as you return None. That saves atleast one function call. Py_True, of course, *is* 1. There is no proper boolean type. Regards, Martin

Mark Hammond

2:20 a.m.

...

Py_True, of course, *is* 1. There is no proper boolean type.

Seeing you added emphasis on the *is*, I assume you meant *is* :)

...

...
...
true=(1==1) true is 1 0

Py_True == 1, but is *not* 1 Pedantically, Mark.

Martin v. Loewis

6:22 a.m.

...

...
Py_True, of course, *is* 1. There is no proper boolean type.

Seeing you added emphasis on the *is*, I assume you meant *is* :)

Indeed, that's what I really meant.

...

Py_True == 1, but is *not* 1

Thanks for clarifying it; that was surprising since I assumed 1 was a proper singleton under all circumstances. Regards, Martin

M.-A. Lemburg

9:15 a.m.

Mark Hammond wrote:

...

...
Py_True, of course, *is* 1. There is no proper boolean type.

Seeing you added emphasis on the *is*, I assume you meant *is* :)

...
...
...
true=(1==1) true is 1 0

Py_True == 1, but is *not* 1

Py_True is a singleton and all 1 integers in Python are cached and shared, so we end up having exactly two different objects for the number 1 in Python: Py_True and 1 (or 3-2 or 4/4 or ...). -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/

Martin v. Loewis

9:09 p.m.

...

Hmm, isn't the trick with this type that you can access the various elements as attributes *and* using index notation ?

Indeed. In Py 2.2, you can do that two ways: A) indexmap = {'st_dev':0, 'st_ino':1} # etc class StatResult(tuple): def __getattr__(self,name): return self[indexmap[name]] B) fields = ['st_dev', 'st_ino'] #etc class StatResult(UserList.UserList): def __init__(self, dev, ino): self.st_dev = dev self.st_ino = ino def __getattr__(self, name): if name=="data": return [getattr(self,fname) for fname in fields] raise AttributeError, name def __setattr__(self, name, value): if name=="data": raise AttributeError, "data is read-only" self.__dict__[name] = value

...

Also, why should we hide something useful from the Python programmer if it's there anyway ?

Because it has unknown limitations (atleast, they are unknown to me at the moment; I probably could report them if I searched long enough). Regards, Martin

nickm＠alum.mit.edu

1:27 a.m.

Marc-Andre Lemburg wrote:

...

"Martin v. Loewis" wrote:

...
...
...
A bug report on SF made me aware of an apparently new type in Python called PyStructSequence. There are no docs on the type (at least not in the usual places).

Is it official yet ? It will ship as part of Python 2.2, if that is what you are asking. os.stat is documented to return one of these (if you read it carefully).

Wouldn't it make sense to expose this object in Python, e.g. by contructing it from a dictionary of string mappings ?

(The type constructor is not made available in bltinmodule.c.)

Hi, all. I'm not subscribed to python-dev, but I'm the author of the original patch, and I thought I should comment. If you look closely, you'll find that PyStructSequence is not a type itself, but rather a tool used to construct new tuple/struct hybrid types, like the results of os.stat and time.gmtime. In reality, PyStructSequence is only a set of common implementation logic for a set of other types, including os.stat_result, os.statvfs_result, and time.struct_time. There are a few possible objections to this scheme: Q. Nick, why didn't you make it a _real_ metatype? A. Writing a real metatype in C was beyond my Python abilities. If anybody wants to, I'd be thrilled.a Q. Okay, so why not expose it to python? A. Because it isn't a real metatype. Every type that uses it _is_ exposed to python. I think this isn't a problem, because it's way easier to re-implement PyStructSequence in Python than it is to turn it into a metatype. Q. If it's so easy to write in Python, why not do it that way? A. Because there are fringe benefits to doing it in C. For example, on some Unix machines (such as Linux), struct stat has some attributes that don't correspond to any elements of the old tuple view. To expose (say) st_rdev to Python code at all, you'd need to change the result of posix.stat... but this would break code that used posix.stat directly. But because PyStructSequence is written in C, posix.stat can return an augmented tuple/struct hybrid that (when accessed as a tuple) still has 10 elements, but also exposes st_rdev as an attribute. HTH, -- Nick Mathewson <nickm@alum.mit.edu>

M.-A. Lemburg

9:14 a.m.

nickm@alum.mit.edu wrote:

...

Marc-Andre Lemburg wrote:

...
"Martin v. Loewis" wrote:

...
...
...
A bug report on SF made me aware of an apparently new type in Python called PyStructSequence. There are no docs on the type (at least not in the usual places).

Is it official yet ? It will ship as part of Python 2.2, if that is what you are asking. os.stat is documented to return one of these (if you read it carefully).

Wouldn't it make sense to expose this object in Python, e.g. by contructing it from a dictionary of string mappings ?

(The type constructor is not made available in bltinmodule.c.)

Hi, all. I'm not subscribed to python-dev, but I'm the author of the original patch, and I thought I should comment.

If you look closely, you'll find that PyStructSequence is not a type itself, but rather a tool used to construct new tuple/struct hybrid types, like the results of os.stat and time.gmtime.

Indeed -- and I have a question there: why did you have to implement this as meta-type ? It seems that the same thing could have been done using a normal type which then gets initialized after instantiation. Or was it to get used to the new type system :-?

...

In reality, PyStructSequence is only a set of common implementation logic for a set of other types, including os.stat_result, os.statvfs_result, and time.struct_time.

There are a few possible objections to this scheme:

Q. Nick, why didn't you make it a _real_ metatype?

A. Writing a real metatype in C was beyond my Python abilities. If anybody wants to, I'd be thrilled.a

Q. Okay, so why not expose it to python?

A. Because it isn't a real metatype. Every type that uses it _is_ exposed to python.

I think this isn't a problem, because it's way easier to re-implement PyStructSequence in Python than it is to turn it into a metatype.

Q. If it's so easy to write in Python, why not do it that way?

A. Because there are fringe benefits to doing it in C.

For example, on some Unix machines (such as Linux), struct stat has some attributes that don't correspond to any elements of the old tuple view. To expose (say) st_rdev to Python code at all, you'd need to change the result of posix.stat... but this would break code that used posix.stat directly.

But because PyStructSequence is written in C, posix.stat can return an augmented tuple/struct hybrid that (when accessed as a tuple) still has 10 elements, but also exposes st_rdev as an attribute.

This would have also been possible using the "normal" approach; I'm still not convinced -- it looks too much like an academic experiment ;-). -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/

Martin v. Loewis

10:22 a.m.

...

Indeed -- and I have a question there: why did you have to implement this as meta-type ?

MAL, please do read the patch discussion first, at http://sourceforge.net/tracker/?group_id=5470&atid=305470&func=detail&aid=462296 Regards, Martin

M.-A. Lemburg

11:56 a.m.

"Martin v. Loewis" wrote:

...

...
Indeed -- and I have a question there: why did you have to implement this as meta-type ?

MAL, please do read the patch discussion first, at

http://sourceforge.net/tracker/?group_id=5470&atid=305470&func=detail&aid=462296

The discussion on SF doesn't really answer my question. What Nick did is fascinating: he reused the type object implementation to mimic a sequence ! That's cool, but looks like an awfully tricky way of doing something straight forward such as sub-classing the tuple type to extend it with an additional dictionary. So the question remains: why did Nick *have* to implement this as meta-type ? BTW, Nick's stuff is a nice intro to the more complicated capabilities of the new type system and I think people can learn a lot from it. I certainly want to learn from it, because I haven't really interned the details behind all the new C type slots yet. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/

Thomas Heller

5:23 p.m.

From: "M.-A. Lemburg" <mal@lemburg.com>

...

"Martin v. Loewis" wrote:

...
...
Indeed -- and I have a question there: why did you have to implement this as meta-type ?

MAL, please do read the patch discussion first, at

http://sourceforge.net/tracker/?group_id=5470&atid=305470&func=detail&aid=462296

The discussion on SF doesn't really answer my question. What Nick did is fascinating: he reused the type object implementation to mimic a sequence ! That's cool, but looks like an awfully tricky way of doing something straight forward such as sub-classing the tuple type to extend it with an additional dictionary. So the question remains: why did Nick *have* to implement this as meta-type ?

As I understand it, PyStructSequence_InitType() is a factory for types aka metaclasses. The above statment 'he reused the type object to mimic a sequence' is IMO wrong. *My* question would be (maybe this is what MAL meant): why aren't the created types subclasses of PyTupleType? Thomas

Martin v. Loewis

10:22 p.m.

...

*My* question would be (maybe this is what MAL meant): why aren't the created types subclasses of PyTupleType?

How would you inherit from PyTupleType in C? E.g. by doing struct PyStatType{ PyTupleType foo; PyObject *additional_field; }; That reads well, but it is wrong: PyTupleType ends in a flexible array member, so it cannot be used as the member of another struct. Regards, Martin

Martin v. Loewis

9:51 p.m.

...

The discussion on SF doesn't really answer my question. What Nick did is fascinating: he reused the type object implementation to mimic a sequence ! That's cool, but looks like an awfully tricky way of doing something straight forward such as sub-classing the tuple type to extend it with an additional dictionary. So the question remains: why did Nick *have* to implement this as meta-type ?

For one thing, you'll see from the discussion that extending the tuple type with an additional dict is non-trivial: You cannot define a C data type that does this. You'll also see that there was a version that did it, and that it was rejected precisely because of this problem. Regards, Martin

M.-A. Lemburg

9:31 a.m.

New subject: Subclassing varying length types (What's a PyStructSequence ?)

"Martin v. Loewis" wrote:

...

...
The discussion on SF doesn't really answer my question. What Nick did is fascinating: he reused the type object implementation to mimic a sequence ! That's cool, but looks like an awfully tricky way of doing something straight forward such as sub-classing the tuple type to extend it with an additional dictionary. So the question remains: why did Nick *have* to implement this as meta-type ?

For one thing, you'll see from the discussion that extending the tuple type with an additional dict is non-trivial: You cannot define a C data type that does this. You'll also see that there was a version that did it, and that it was rejected precisely because of this problem.

Ok, now we're getting somewhere... you're saying that Python types using the PyObject struct itself to store variable size data cannot be subclassed in C. Even though it's not trivial as you indicate, this should well be possible by appending new object data to the end of the allocated data field -- not very elegant, but still a way to cope with the problem. However, Nick's approach of creating a new type from a template by using the fact that the list of known name-to-index mappings is not going to change for instances of the type makes things a lot cleaner, since now the mapping can live in the type definition rather than the instance (which may very well be a varying length PyObject type). Hmm, this makes me wonder: perhaps we should start thinking about phasing out varying length PyObjects in the interpreter... esp. the inability to subclass strings looks like a bummer for future extensions of this particular type. Unicode doesn't have this problem, BTW. Or we need to come up with a fairly nice way of making subclassing varying length types a lot easier, e.g. by adding a special pointer ob_ext to PyObject_VAR_HEAD which then allows declaring type extensions in an malloced buffer. Thoughts ? -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/

Tim Peters

11:35 p.m.

New subject: Subclassing varying length types (What's a PyStructSequence ?)

[M.-A. Lemburg]

...

... Hmm, this makes me wonder: perhaps we should start thinking about phasing out varying length PyObjects in the interpreter...

No chance, IMO: the memory savings is too great.

...

esp. the inability to subclass strings looks like a bummer for future extensions of this particular type. Unicode doesn't have this problem, BTW.

OTOH, I know someone at Zope Corp who could testify with force about the memory burden of switching to Unicode strings -- if you've got gobs of 'em, it's much worse than a factor of 2 blowup. Moving to obmalloc.c should help that a lot (two malloc overheads per Unicode string, and obmalloc overheads are much lower).

...

Or we need to come up with a fairly nice way of making subclassing varying length types a lot easier, e.g. by adding a special pointer ob_ext to PyObject_VAR_HEAD which then allows declaring type extensions in an malloced buffer.

Thoughts ?

Not a one <wink>.

M.-A. Lemburg

December 2001

11:27 a.m.

New subject: Subclassing varying length types (What's a PyStructSequence ?)

Tim Peters wrote:

...

[M.-A. Lemburg]

...
... Hmm, this makes me wonder: perhaps we should start thinking about phasing out varying length PyObjects in the interpreter...

No chance, IMO: the memory savings is too great.

...
esp. the inability to subclass strings looks like a bummer for future extensions of this particular type. Unicode doesn't have this problem, BTW.

OTOH, I know someone at Zope Corp who could testify with force about the memory burden of switching to Unicode strings -- if you've got gobs of 'em, it's much worse than a factor of 2 blowup. Moving to obmalloc.c should help that a lot (two malloc overheads per Unicode string, and obmalloc overheads are much lower).

Perhaps we should start thinking about optimizing at least one of the Unicode malloc away in Python 2.3: the Unicode object itself can well be kept on a free list with the Py_UNICODE buffer freed and set to NULL. Doesn't save any memory but would improve the performance. BTW, is the memory burden really such a big argument these days ? I can imagine this being an argument on resource restrained platforms such as Palms (thanks to Martin, the Plam guys can now switch off Unicode completely), but hardly on gigabyte machines with access 100s of GBs swap-space :-)

...

...
Or we need to come up with a fairly nice way of making subclassing varying length types a lot easier, e.g. by adding a special pointer ob_ext to PyObject_VAR_HEAD which then allows declaring type extensions in an malloced buffer.

Thoughts ?

Not a one <wink>.

Any idea how we could make subclassing these types less hackish, then ? -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/

Martin v. Loewis

1:53 p.m.

New subject: Subclassing varying length types (What's a PyStructSequence ?)

...

BTW, is the memory burden really such a big argument these days ?

You should consider that malloc overhead is often 16 bytes per object. Given that PyUnicodeObject is 24 bytes in 2.2, system malloc will allocate 48 bytes per Unicode object on modern architectures. I would think 100% overhead *is* a big argument. If you relate this to the actual data, it gets worse: A Unicode string of length 1 would still require 32 bytes on an allocator that aligns to 16. Therefore, to store 2 bytes of real data, you need 80 bytes of memory. I don't know how much overhead pymalloc adds, though; I believe it is significantly less expensive. Regards, Martin

M.-A. Lemburg

2:49 p.m.

New subject: Subclassing varying length types (What's a PyStructSequence ?)

"Martin v. Loewis" wrote:

...

...
BTW, is the memory burden really such a big argument these days ?

You should consider that malloc overhead is often 16 bytes per object. Given that PyUnicodeObject is 24 bytes in 2.2, system malloc will allocate 48 bytes per Unicode object on modern architectures. I would think 100% overhead *is* a big argument.

If you relate this to the actual data, it gets worse: A Unicode string of length 1 would still require 32 bytes on an allocator that aligns to 16. Therefore, to store 2 bytes of real data, you need 80 bytes of memory.

I don't know how much overhead pymalloc adds, though; I believe it is significantly less expensive.

Oh, I wasn't arguing against using pymalloc :-) I only think that nowadays, the trade-off "more flexibility vs. memory consumption" leans more towards the former than the latter. Not only because memory is cheap, but also because more flexibility tends to result in use of better algorithms and these can lead to better overall performance and lower total memory use. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/

Mark Hammond

12:52 a.m.

New subject: Subclassing varying length types (What's a PyStructSequence ?)

[MAL]

...

I only think that nowadays, the trade-off "more flexibility vs. memory consumption" leans more towards the former than the latter. Not only because memory is cheap, but also because more flexibility tends to result in use of better algorithms and these can lead to better overall performance and lower total memory use.

This tends to sound great - until you have a large app that is consuming too many MBs :) Mozilla's use of strings is fascinating. They have a very complex C++ string API - all aimed at being as space-optimized as possible. They go to huge lengths to avoid string copies and extra allocation at the expense of an API nobody understands :) The point is that for fundamental types (including Unicode), medium size apps may use millions of these objects. Everything we can do to optimize their speed and size is beneficial. Mark.

Tim Peters

10:36 a.m.

New subject: Subclassing varying length types (What's a PyStructSequence ?)

[Martin v. Loewis]

...

You should consider that malloc overhead is often 16 bytes per object. Given that PyUnicodeObject is 24 bytes in 2.2, system malloc will allocate 48 bytes per Unicode object on modern architectures. I would think 100% overhead *is* a big argument.

If you relate this to the actual data, it gets worse: A Unicode string of length 1 would still require 32 bytes on an allocator that aligns to 16.

I think that's unusual -- 8-byte alignment is most common even on 64-bit boxes. KSR had to align to 128-byte boundaries, but there's a reason KSR died <wink -- alas, gross alignment requirements wasn't really it>.

...

Therefore, to store 2 bytes of real data, you need 80 bytes of memory.

I don't know how much overhead pymalloc adds, though; I believe it is significantly less expensive.

Yes, much less. On a 32-bit box, using the current #define's, and ignoring "arena" overhead(*), pymalloc uses 32 bytes per 4096 for bookkeeping. The remaining 4064 bytes can all be user data, but subject to 8-byte alignment, and to how many whole chunks of a given size can fit in 4064 bytes. For the PyUnicodeObject example, 8-byte alignment is without cost, and for the rest

...

...
...
divmod(4096 - 32, 24) (169, 8)

That is, pymalloc can get 169 PyUnicodeObjects out of a 4KB "page", with 32 bytes for bookkeeping, and 8 bytes left over (unused) -- total overhead is about 1%. (*) pymalloc gets "arenas" from the system malloc, where an arena is currently 256KB. Up to (worst case) 4KB of that is lost to align the start address to a 4KB boundary, and there's also the comparatively trivial (compared to 4KB!) overhead from the system malloc.

Tim Peters

8:01 p.m.

New subject: Subclassing varying length types (What's a PyStructSequence ?)

[MAL]

...

Perhaps we should start thinking about optimizing at least one of the Unicode malloc away in Python 2.3: the Unicode object itself can well be kept on a free list with the Py_UNICODE buffer freed and set to NULL. Doesn't save any memory but would improve the performance.

pymalloc would improve both, so I'd much rather pursue that in 2.3 than yet another type-specific free list.

...

BTW, is the memory burden really such a big argument these days ? I can imagine this being an argument on resource restrained platforms such as Palms (thanks to Martin, the Plam guys can now switch off Unicode completely), but hardly on gigabyte machines with access 100s of GBs swap-space :-)

Most of us have machines between those extremes, and the difference between 100MB and 300MB can be make-or-break. I don't see that any "flexibility" is gained merely by wasting memory <wink>.

...

... Any idea how we could make subclassing these types less hackish, then ?

Subclassing seems easy enough to me from the Python level; I don't have time to revisit C-level subclasssing here (and I don't know that it's hackish there either, but do think it's in need of docs).

M.-A. Lemburg

9:09 a.m.

New subject: Subclassing varying length types (What's a PyStructSequence ?)

Tim Peters wrote:

...

[MAL]

...
Perhaps we should start thinking about optimizing at least one of the Unicode malloc away in Python 2.3: the Unicode object itself can well be kept on a free list with the Py_UNICODE buffer freed and set to NULL. Doesn't save any memory but would improve the performance.

pymalloc would improve both, so I'd much rather pursue that in 2.3 than yet another type-specific free list.

Have you tried disabling all free list and using pymalloc instead ? If this pays off, I agree, we should get rid off all of them.

...

...
BTW, is the memory burden really such a big argument these days ? I can imagine this being an argument on resource restrained platforms such as Palms (thanks to Martin, the Plam guys can now switch off Unicode completely), but hardly on gigabyte machines with access 100s of GBs swap-space :-)

Most of us have machines between those extremes, and the difference between 100MB and 300MB can be make-or-break. I don't see that any "flexibility" is gained merely by wasting memory <wink>.

I would consider moving from 8-bit strings to Unicode an improvement in flexibility. It also results in better algroithms (== simpler, less error-prone, etc. in this case). As I said, it's a tradeoff flexibility vs. memory consumption. Whether it pays off depends on your application environment. It certainly does for companies like Micron and pays off stock-wise for a lot of people... uhm, getting off-topic here :-)

...

...
... Any idea how we could make subclassing these types less hackish, then ?

Subclassing seems easy enough to me from the Python level; I don't have time to revisit C-level subclasssing here (and I don't know that it's hackish there either, but do think it's in need of docs).

It is beautifully easy for non-varying-length types. Unfortunately, it happens that some of the basic types which would be attractive for subclassing are varying length types (such as string and tuples). In my case, I'm looking for away to subclass strings, but I haven't yet found an elegant solution to the problem of adding extra data to the instances. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/

Tim Peters

5:22 a.m.

New subject: Subclassing varying length types (What's a PyStructSequence ?)

[MAL]

...

Have you tried disabling all free list and using pymalloc instead ?

No, but I haven't tried anything -- it's a 2.3 issue.

...

If this pays off, I agree, we should get rid off all of them.

When I do try it <wink>, it will be slower but more memory-efficient (both data and code) than the type-specific free lists, and faster and much more memory-efficient than using malloc().

...

... I would consider moving from 8-bit strings to Unicode an improvement in flexibility.

Sure. Moving from one malloc to two is orthogonal.

...

It also results in better algroithms (== simpler, less error-prone, etc. in this case).

Unclear what "it" means; assuming it means using two mallocs instead of one for a Unicode string object, the 8-bit string algorithms haven't been a particular source of bugs. People mutating strings at the C level has been.

...

As I said, it's a tradeoff flexibility vs. memory consumption. Whether it pays off depends on your application environment. It certainly does for companies like Micron and pays off stock-wise for a lot of people... uhm, getting off-topic here :-)

I've got nothing against Unicode (apart from the larger issue that the whole world would obviously be a lot better off if they switched to American English <wink>).

...

...
Subclassing seems easy enough to me from the Python level; I don't have time to revisit C-level subclasssing here (and I don't know that it's hackish there either, but do think it's in need of docs).

...

It is beautifully easy for non-varying-length types. Unfortunately, it happens that some of the basic types which would be attractive for subclassing are varying length types (such as string and tuples).

It's easy to subclass from str and tuple in Python -- even to add your own instance data.

...

In my case, I'm looking for away to subclass strings, but I haven't yet found an elegant solution to the problem of adding extra data to the instances.

It's easy if you're willing to use a dict: class STR(str): def __new__(cls, strguts, n): self = str.__new__(cls, strguts) self.n = n return self s = STR('abc', 42) print s # abc print s.n # 42 __slots__ doesn't work here, though. I admit I personally don't see much attraction to subclassing from str and tuple, apart from adding additional *methods*. I suppose someone could code up two-malloc variants ...

M.-A. Lemburg

10:57 a.m.

New subject: Subclassing varying length types (What's a PyStructSequence ?)

Tim Peters wrote:

...

[MAL]

...
Have you tried disabling all free list and using pymalloc instead ?

No, but I haven't tried anything -- it's a 2.3 issue.

...
If this pays off, I agree, we should get rid off all of them.

When I do try it <wink>, it will be slower but more memory-efficient (both data and code) than the type-specific free lists, and faster and much more memory-efficient than using malloc().

Well, let's do some pybench runs next year and see what the results look like.

...

...
... I would consider moving from 8-bit strings to Unicode an improvement in flexibility.

Sure. Moving from one malloc to two is orthogonal.

You know that I know that you knew what I was talking about :-)

...

...
It also results in better algroithms (== simpler, less error-prone, etc. in this case).

Unclear what "it" means; assuming it means using two mallocs instead of one for a Unicode string object, the 8-bit string algorithms haven't been a particular source of bugs. People mutating strings at the C level has been.

If you ever try to support more than ASCII text in a user program, you'll find that having to deal with only one encoding safes you a whole lot of trouble. I won't even start talking about variable length encodings, encodings with builtin shift state and other goodies which are a complete nightmare to handle (e.g. various character properties such as title case, upper/lower mappings, different ways to encode a single character, collation,...).

...

...
As I said, it's a tradeoff flexibility vs. memory consumption. Whether it pays off depends on your application environment. It certainly does for companies like Micron and pays off stock-wise for a lot of people... uhm, getting off-topic here :-)

I've got nothing against Unicode (apart from the larger issue that the whole world would obviously be a lot better off if they switched to American English <wink>).

I suppose Mandarin would reach a larger share in world population ... and they *need* Unicode :-)

...

...
...
Subclassing seems easy enough to me from the Python level; I don't have time to revisit C-level subclasssing here (and I don't know that it's hackish there either, but do think it's in need of docs).

...
It is beautifully easy for non-varying-length types. Unfortunately, it happens that some of the basic types which would be attractive for subclassing are varying length types (such as string and tuples).

It's easy to subclass from str and tuple in Python -- even to add your own instance data.

Yeah, but that's not the point. I want to do this in C...

...

...
In my case, I'm looking for away to subclass strings, but I haven't yet found an elegant solution to the problem of adding extra data to the instances.

It's easy if you're willing to use a dict:

I would be willing to use a dictionary. It's only that even the dictionary trick doesn't seem to work at C level.

...

class STR(str): def __new__(cls, strguts, n): self = str.__new__(cls, strguts) self.n = n return self

s = STR('abc', 42) print s # abc print s.n # 42

__slots__ doesn't work here, though.

I admit I personally don't see much attraction to subclassing from str and tuple, apart from adding additional *methods*. I suppose someone could code up two-malloc variants ...

If you look at mxURL you'll find an extension type which tries to play nice with strings -- it would be a good candidate for a string subtype. A string type which carries along an encoding attribute would be another good candidate for a string subtype. Both need extra attributes/data fields. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/

8498

Age (days ago)

8512

Last active (days ago)

List overview

Download

31 comments

9 participants

participants (9)

barry＠zope.com
M.-A. Lemburg
Mark Hammond
Martin v. Loewis
Michel Pelletier
nickm＠alum.mit.edu
Skip Montanaro
Thomas Heller
Tim Peters

What's a PyStructSequence ?

M.-A. Lemburg

M.-A. Lemburg

M.-A. Lemburg

Michel Pelletier

M.-A. Lemburg

Mark Hammond

M.-A. Lemburg

nickm＠alum.mit.edu

M.-A. Lemburg

M.-A. Lemburg

Thomas Heller

M.-A. Lemburg

Tim Peters

M.-A. Lemburg

M.-A. Lemburg

Mark Hammond

Tim Peters

Tim Peters

M.-A. Lemburg

Tim Peters

M.-A. Lemburg

tags

participants (9)