pickle currently puts tuples into the memo on pickling, but only ever uses the position field ([0]), never the object itself ([1]). I understand that the reference to the object is needed to keep it alive while pickling. Unfortunately, this means one needs to allocate 36 bytes for the tuple. I think this memory consumption could be reduced by saving the objects in a list, and only saving the position in the memo dictionary. That would save roughly 32 bytes per memoized object, assuming there is no malloc overhead. What do you think? Regards, Martin P.S. It would be even more efficient if there was an identity dictionary.
pickle currently puts tuples into the memo on pickling, but only ever uses the position field ([0]), never the object itself ([1]).
I understand that the reference to the object is needed to keep it alive while pickling.
Unfortunately, this means one needs to allocate 36 bytes for the tuple.
I think this memory consumption could be reduced by saving the objects in a list, and only saving the position in the memo dictionary. That would save roughly 32 bytes per memoized object, assuming there is no malloc overhead.
What do you think?
Is it worth it? Have you made a patch? What use case are you thinking of?
Regards, Martin
P.S. It would be even more efficient if there was an identity dictionary.
Sorry, what's an identity dict? --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum <guido@python.org> writes:
Is it worth it?
If you believe that the problem is real, yes.
Have you made a patch?
Not yet, no. With Mark's objection, it is more difficult than I thought.
What use case are you thinking of?
People repeatedly complain that pickle consumes too much memory. The most recent instance was http://groups.google.de/groups?selm=slrnal05v1.c05.Andreas.Leitgeb%40pc7499.gud.siemens.at&output=gplain Earlier reports are http://groups.google.de/groups?hl=de&lr=&ie=UTF-8&selm=mailman.1026940226.16076.python-list%40python.org http://groups.google.de/groups?q=pickle+memory+group:comp.lang.python.*&hl=de&lr=&ie=UTF-8&selm=396B069A.9EBDD68B%40muc.das-werk.de&rnum=4
Sorry, what's an identity dict?
IdentityDictionary is the name of a Smalltalk class that uses identity instead of equality when comparing keys: http://minnow.cc.gatech.edu/squeak/1845 In Python, it would allow arbitrary objects as keys, and allow equal duplicates as different keys. For pickle, this would mean that we could save both the creation of the id() object (since the object itself is used as a key), and the creation of the tuple (since the value is only the position). Regards, Martin
Martin v. Loewis wrote:
Guido van Rossum <guido@python.org> writes:
Is it worth it?
If you believe that the problem is real, yes.
I think that the tuple is not the problem here, it's the fact that so many objects are recorded in the memo to later rebuild recursive structures. Now, I believe that recursive structures in pickles are not very common, so the memo is mostly useless in these cases. Perhaps pickle could grow an option to assume that a data structure is non-recursive ?! In that case, no data would be written to the memo (or only the id() mapped to 1 to double-check). -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
I think that the tuple is not the problem here, it's the fact that so many objects are recorded in the memo to later rebuild recursive structures.
Now, I believe that recursive structures in pickles are not very common, so the memo is mostly useless in these cases.
Use cPickle, it's much more frugal with the memo, and also has some options to control the memo (read the docs, I forget the details and am in a hurry).
Perhaps pickle could grow an option to assume that a data structure is non-recursive ?! In that case, no data would be written to the memo (or only the id() mapped to 1 to double-check).
The memo is also for sharing. There's no recursion in this example, but the sharing may be important: a = [1,2,3] b = [a,a,a] --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum wrote:
I think that the tuple is not the problem here, it's the fact that so many objects are recorded in the memo to later rebuild recursive structures.
Now, I believe that recursive structures in pickles are not very common, so the memo is mostly useless in these cases.
Use cPickle, it's much more frugal with the memo, and also has some options to control the memo (read the docs, I forget the details and am in a hurry).
Just to clarify: I don't have a problem with the memo in pickle at all :-) Martin brought up this issue.
Perhaps pickle could grow an option to assume that a data structure is non-recursive ?! In that case, no data would be written to the memo (or only the id() mapped to 1 to double-check).
The memo is also for sharing. There's no recursion in this example, but the sharing may be important:
a = [1,2,3] b = [a,a,a]
Right. I don't think these references are too common in pickles. Zope Corp should know much more about this, I guess, since ZODB is all about pickleing. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
Perhaps pickle could grow an option to assume that a data structure is non-recursive ?! In that case, no data would be written to the memo (or only the id() mapped to 1 to double-check).
The memo is also for sharing. There's no recursion in this example, but the sharing may be important:
a = [1,2,3] b = [a,a,a]
Right. I don't think these references are too common in pickles.
I think they are.
Zope Corp should know much more about this, I guess, since ZODB is all about pickleing.
Sharing object references is essential in Zope. But only to certain objects; sharing strings and numbers is not important, and I believe cPickle doesn't put those in the memo, while pickle.py puts essentially everything in the memo... --Guido van Rossum (home page: http://www.python.org/~guido/)
"M.-A. Lemburg" <mal@lemburg.com> writes:
Use cPickle, it's much more frugal with the memo, and also has some options to control the memo (read the docs, I forget the details and am in a hurry).
Just to clarify: I don't have a problem with the memo in pickle at all :-) Martin brought up this issue.
I don't have a problem with the memo, either. I have a problem with the tuples in the memo. Regards, Martin
"M.-A. Lemburg" <mal@lemburg.com> writes:
I think that the tuple is not the problem here, it's the fact that so many objects are recorded in the memo to later rebuild recursive structures.
It's not a matter of beliefs: each dictionary entry contributes 12 bytes. Each integer key contributes 12 bytes, each integer position contributes 12 bytes. Each tuple contributes 36 bytes. Assuming pymalloc and the integer allocator, this makes a total of 76 bytes per recorded object. The tuples contribute over 50% to that.
Perhaps pickle could grow an option to assume that a data structure is non-recursive ?! In that case, no data would be written to the memo (or only the id() mapped to 1 to double-check).
That is already possible: You can pass a fake dictionary that records nothing. Regards, Martin
[Martin v. Loewis]
It's not a matter of beliefs: each dictionary entry contributes 12 bytes. Each integer key contributes 12 bytes, each integer position contributes 12 bytes. Each tuple contributes 36 bytes.
I'm not in love with giant pickle memos myself, but to reduce expectations closer to reality, note that each dict entry consumes at least 18 bytes (we keep the load factor under 2/3, so there's at least one unused entry for every two real entries; it's an indirect overhead, but a real one).
I'm not in love with giant pickle memos myself, but to reduce expectations closer to reality, note that each dict entry consumes at least 18 bytes (we keep the load factor under 2/3, so there's at least one unused entry for every two real entries; it's an indirect overhead, but a real one).
Is there perhaps a more memory-efficient data structure that could be used here instead of a dict? A b-tree, perhaps, which with a suitable bucket size could use no more than about 8 byte per entry -- 4 for the object reference and 4 for the integer index that it maps to. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+
[Greg Ewing]
Is there perhaps a more memory-efficient data structure that could be used here instead of a dict? A b-tree, perhaps, which with a suitable bucket size could use no more than about 8 byte per entry -- 4 for the object reference and 4 for the integer index that it maps to.
The code to support BTrees would be quite a burden. Zope already has that, so it wouldn't be a new burden there, except to get away from comparing keys via __cmp__ we'd have to use IIBTrees, and those map 4-byte ints to 4-byte ints (i.e., they wouldn't work right for this purpose on a 64-bit box -- although Yet Another Flavor of BTree could be compiled that would). Judy tries look perfect for "this kind of thing" (fast, memory-efficient, and would likely get significant compression benefit from that the high bits of user-space addresses tend to be the same): http://sf.net/projects/judy/
Tim Peters <tim.one@comcast.net>:
The code to support BTrees would be quite a burden.
It wouldn't be all that complicated, surely? And you'd only need about half of it, because you only need to be able to add keys, never delete them.
Is there any information available about this other than the 3 lines I managed to find amongst all the sourceforge crud? Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+
The code to support BTrees would be quite a burden.
[Greg Ewing]
It wouldn't be all that complicated, surely? And you'd only need about half of it, because you only need to be able to add keys, never delete them.
Have you done a production-quality, portable B-Tree implementation? Could you prevail in arguing that memory reduction is more important than speed here? Etc. B-Trees entail a messy set of tradeoffs.
Is there any information available about this other than the 3 lines I managed to find amongst all the sourceforge crud?
It was a short-lived topic on Python-Dev about two weeks ago. Try, mmm, http://www.hp.com/go/judy/ for lots of info.
"M.-A. Lemburg" <mal@lemburg.com>:
Perhaps pickle could grow an option to assume that a data structure is non-recursive ?
Then you'd probably want some means of detecting cycles, or you'd get infinite recursion when you got it wrong. That would mean keeping a stack of objects, I think -- probably less memory than keeping all of them at once. But I think the idea of keeping the object references in a list is well worth trying first. 4 bytes per object instead of 36 sounds like a good improvement to me! Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+
Perhaps pickle could grow an option to assume that a data structure is non-recursive ?
Then you'd probably want some means of detecting cycles, or you'd get infinite recursion when you got it wrong. That would mean keeping a stack of objects, I think -- probably less memory than keeping all of them at once.
cPickle has an obscure options for this. You create a pickler object and set the attribute "fast" to True, I believe. It detects cycles by using a nesting counter, I believe (read the source to learn more).
But I think the idea of keeping the object references in a list is well worth trying first. 4 bytes per object instead of 36 sounds like a good improvement to me!
So maybe we need to create an identitydict... --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum <guido@python.org> writes:
But I think the idea of keeping the object references in a list is well worth trying first. 4 bytes per object instead of 36 sounds like a good improvement to me!
So maybe we need to create an identitydict...
In that case, the backwards compatibility problems are more serious. Of course, we could chose the type of dictionary based on whether this is a Pickler subclass or not (with some protocol to let the subclass make us aware that we should use the identitydict, anyway). This is, of course, a bigger change than I originally had in mind. Regards, Martin
Martin quoted a complaint about cPickle performance: http://groups.google.de/groups?hl=en&lr=&ie=UTF-8&selm=mailman.1026940226.16076.python-list%40python.org But if you read the full thread, it's clear that this complaint came about because the author wasn't using binary pickle mode. In binary mode his times became acceptable. I've run the test and I haven't seen abnormal memory behavior -- the process grows to 26 Mb just to create the test data, and then adds about 1 Mb during pickling. The loading almost doubles the process size, because another copy of the test data is read (the test data isn't thrown away). The slowdown of text-mode pickle is due to the extremely expensive way of unpickling pickled strings in text-mode: it invokes eval() (well, PyRun_String()) to parse the string literal! (After checking that there's really only a string literal there to prevent trojan horses.) So I'm not sure that the memo size is worth pursuing. I didn't look at the other complaints referenced by Martin, but I bet they're more of the same. What might be worth looking into: (1) Make binary pickling the default (in cPickle as well as pickle). This would definitely give the most bang for the buck. (2) Replace the PyRun_String() call in cPickle with something faster. Maybe the algorithm from parsestr() from compile.c can be exposed; although the error reporting must be done differently then. --Guido van Rossum (home page: http://www.python.org/~guido/)
[Guido]
... (2) Replace the PyRun_String() call in cPickle with something faster. Maybe the algorithm from parsestr() from compile.c can be exposed; although the error reporting must be done differently then.
Note that Martin has had a patch pending for this since, umm, January: http://www.python.org/sf/505705 Maybe he should review it himself <wink>.
One of the things I mentioned on the c.l.py thread is the Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS calls around every fwrite() call. I looked into it, using Penrose's test case, and found that the locking alone added 25% overhead. I expect the layer or two of C function calls above fwrite() add overhead. I also expect that calling fwrite() repeatedly for very small strings is inefficient. If I were to suggest a cPickle project, it would be an efficient internal buffering scheme. Jeremy
One of the things I mentioned on the c.l.py thread is the Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS calls around every fwrite() call. I looked into it, using Penrose's test case, and found that the locking alone added 25% overhead. I expect the layer or two of C function calls above fwrite() add overhead. I also expect that calling fwrite() repeatedly for very small strings is inefficient.
If I were to suggest a cPickle project, it would be an efficient internal buffering scheme.
Who's got time? It's fast enough for Zope. :-) --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido:
(1) Make binary pickling the default (in cPickle as well as pickle).
That would break a lot of programs that use pickle without opening the file in binary mode.
(2) Replace the PyRun_String() call in cPickle with something faster. Maybe the algorithm from parsestr() from compile.c can be exposed;
I like that idea -- it could be useful for other things, too. I could use something like that in Pyrex, for example. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+
(1) Make binary pickling the default (in cPickle as well as pickle).
That would break a lot of programs that use pickle without opening the file in binary mode.
Really? That's unfortunate. The example thread on Google shows that binary pickling isn't as widely known as it should be. --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido:
Me:
That would break a lot of programs that use pickle without opening the file in binary mode.
Really? That's unfortunate.
Unfortunate, yes, and true, as far as I can see. It bit me recently -- I decided to change something to use binary pickling, and forgot to change the way I was opening the file. If you must do this, I suppose you could start issuing warnings if pickling is done without specifying a mode, and then change the default later. If there's a way of making non-binary unpickling dramatically faster, though -- even if only with cPickle -- that would be a *big* win, and shouldn't cause any compatibity problems. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+
[Greg E]
That would break a lot of programs that use pickle without opening the file in binary mode.
[Guido]
Really? That's unfortunate.
[Greg E]
Unfortunate, yes, and true, as far as I can see. It bit me recently -- I decided to change something to use binary pickling, and forgot to change the way I was opening the file.
If you must do this, I suppose you could start issuing warnings if pickling is done without specifying a mode, and then change the default later.
I thought of that. But probably not worth the upheaval.
If there's a way of making non-binary unpickling dramatically faster, though -- even if only with cPickle -- that would be a *big* win, and shouldn't cause any compatibity problems.
python/sf/505705 is close to acceptance, and reduced one particularly slow unpickling example 6-fold in speed. --Guido van Rossum (home page: http://www.python.org/~guido/)
[Greg Ewing]
That would break a lot of programs that use pickle without opening the file in binary mode.
[Guido]
Really? That's unfortunate.
[Greg]
Unfortunate, yes, and true, as far as I can see. It bit me recently -- I decided to change something to use binary pickling, and forgot to change the way I was opening the file.
Greg, do you use Windows? If not, I suspect you're mis-remembering what you did -- "binary mode" versus "text mode" doesn't make any difference on Linux.
Greg, do you use Windows? If not, I suspect you're mis-remembering what you did -- "binary mode" versus "text mode" doesn't make any difference on Linux.
No, but I use a Mac (with Classic OS) where it certainly does make a difference! Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+
Martin v. Loewis wrote:
pickle currently puts tuples into the memo on pickling, but only ever uses the position field ([0]), never the object itself ([1]).
I understand that the reference to the object is needed to keep it alive while pickling.
Unfortunately, this means one needs to allocate 36 bytes for the tuple.
I think this memory consumption could be reduced by saving the objects in a list, and only saving the position in the memo dictionary. That would save roughly 32 bytes per memoized object, assuming there is no malloc overhead.
While that may save you some bytes, wouldn't it break pickle subclasses using the memo as well ? -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
Martin v. Loewis wrote:
"M.-A. Lemburg" <mal@lemburg.com> writes:
While that may save you some bytes, wouldn't it break pickle subclasses using the memo as well ?
Yes. Are there such things?
Sure. I use pickle subclasses with hooks for various special object types a lot in my applications... would be nice if I could start subclassing cPickles sometime in the future :-) -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
Martin v. Loewis wrote:
"M.-A. Lemburg" <mal@lemburg.com> writes:
Sure. I use pickle subclasses with hooks for various special object types a lot in my applications...
Can you provide the source of one such subclass?
No, they are closed-source. But the idea should be obvious: I want to pickle the various mx types faster then by relying on the reduce mechanism. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
"M.-A. Lemburg" <mal@lemburg.com> writes:
Sure. I use pickle subclasses with hooks for various special object types a lot in my applications... Can you provide the source of one such subclass?
No, they are closed-source. But the idea should be obvious: I want to pickle the various mx types faster then by relying on the reduce mechanism.
Ok. I think I could making this change without breaking your code: Subclasses won't read the memo; they will only write to it - Pickler.save is the only place that ever reads the memo. So subclasses could safely put tuples into the dictionary; the base class would then look for either tuples or numbers. Regards, Martin
I think this memory consumption could be reduced by saving the objects in a list, and only saving the position in the memo dictionary.
Do you need the list at all? Won't the object be kept alive by the fact that it's a key in the dictionary? Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+
The object's id() (address) is the key. Else only hashable objects could be pickled.
Hmmm, I see. It occurs to me that what you really want here is a special kind of dictionary that uses "is" instead of "==" to compare key values. Or is that what was meant by an "identity dictionary"? Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+
[Tim]
The object's id() (address) is the key. Else only hashable objects could be pickled.
[Greg Ewing]
Hmmm, I see.
It occurs to me that what you really want here is a special kind of dictionary that uses "is" instead of "==" to compare key values.
Possibly. The *effect* of that could be gotten via a wrapper object, like class KeyViaId: def __init__(self, obj): self.obj = obj def __hash__(self): return hash(id(self.obj)) def __eq__(self, other): return self.obj is other.obj but if Martin is worried about two-tuple sizes, he's not going to fall in love with that.
Or is that what was meant by an "identity dictionary"?
Guido asked, but if Martin answered that question I haven't seen it yet.
The *effect* of that could be gotten via a wrapper object ... but if Martin is worried about two-tuple sizes, he's not going to fall in love with that.
Indeed, which is why I suggested it. I was wondering whether it would be worth putting one of these in the standard library. I could have used one in Plex, when I wanted to map a dictionary to something else by identity, and I wanted to do it as fast as possible. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+
Greg Ewing <greg@cosc.canterbury.ac.nz> writes:
It occurs to me that what you really want here is a special kind of dictionary that uses "is" instead of "==" to compare key values.
Or is that what was meant by an "identity dictionary"?
Yes; that would be a dictionary that uses identity instead of equality. Regards, Martin
participants (7)
-
Greg Ewing
-
Guido van Rossum
-
Jeremy Hylton
-
M.-A. Lemburg
-
martin@v.loewis.de
-
Tim Peters
-
Tim Peters