On Fri, Mar 6, 2009 at 10:01 AM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
Antoine Pitrou wrote:
Le vendredi 06 mars 2009 à 13:44 +0100, Michael Haggerty a écrit :
Antoine Pitrou wrote:
Michael Haggerty <mhagger <at> alum.mit.edu> writes:
It is easy to optimize the pickling of instances by giving them __getstate__() and __setstate__() methods. But the pickler still records the type of each object (essentially, the name of its class) in each record. The space for these strings constituted a large fraction of the database size. If these strings are not interned, then perhaps they should be. There is a similar optimization proposal (w/ patch) for attribute names: http://bugs.python.org/issue5084 If I understand correctly, this would not help:
- on writing, the strings are identical anyway, because they are read out of the class's __name__ and __module__ fields. Therefore the Pickler's usual memoizing behavior will prevent the strings from being written more than once.
Then why did you say that "the space for these strings constituted a large fraction of the database size", if they are already shared? Are your objects so tiny that even the space taken by the pointer to the type name grows the size of the database significantly?
Sorry for the confusion. I thought you were suggesting the change to help the more typical use case, when a single Pickler is used for a lot of data. That use case will not be helped by interning the class __name__ and __module__ strings, for the reasons given in my previous email.
In my case, the strings are shared via the Pickler memoizing mechanism because I pre-populate the memo (using the API that the OP proposes to remove), so your suggestion won't help my current code, either. It was before I implemented the pre-populated memoizer that "the space for these strings constituted a large fraction of the database size". But your suggestion wouldn't help that case, either.
Here are the main use cases:
1. Saving and loading one large record. A class's __name__ string is the same string object every time it is retrieved, so it only needs to be stored once and the Pickler memo mechanism works. Similarly for the class's __module__ string.
2. Saving and loading lots of records sequentially. Provided a single Pickler is used for all records and its memo is never cleared, this works just as well as case 1.
3. Saving and loading lots of records in random order, as for example in the shelve module. It is not possible to reuse a Pickler with retained memo, because the Unpickler might not encounter objects in the right order. There are two subcases:
a. Use a clean Pickler/Unpickler object for each record. In this case the __name__ and __module__ of a class will appear once in each record in which the class appears. (This is the case regardless of whether they are interned.) On reading, the __name__ and __module__ are only used to look up the class, so interning them won't help. It is thus impossible to avoid wasting a lot of space in the database.
b. Use a Pickler/Unpickler with a preset memo for each record (my unorthodox technique). In this case the class __name__ and __module__ will be memoized in the shared memo, so in other records only their ID needs to be stored (in fact, only the ID of the class object itself). This allows the database to be smaller, but does not have any effect on the RAM usage of the loaded objects.
If the OP's proposal is accepted, 3b will become impossible. The technique seems not to be well known, so maybe it doesn't need to be supported. It would mean some extra work for me on the cvs2svn project though :-(
Talking it over with Guido, support for the memo attribute will have to stay. I shall add it back to my patches. Collin