save tuple of simple data types to disk (low memory foot print)
Gelonida N
gelonida at gmail.com
Sat Oct 29 12:44:14 EDT 2011
On 10/29/2011 03:00 AM, Steven D'Aprano wrote:
> On Fri, 28 Oct 2011 22:47:42 +0200, Gelonida N wrote:
>
>> Hi,
>>
>> I would like to save many dicts with a fixed amount of keys tuples to a
>> file in a memory efficient manner (no random, but only sequential
>> access is required)
>
> What do you mean "keys tuples"?
Corrected phrase:
I would like to save many dicts with a fixed (and known) amount of keys
in a memory efficient manner (no random, but only sequential access is
required) to a file (which can later be sent over a slow expensive
network to other machines)
Example:
Every dict will have the keys 'timestamp', 'floatvalue', 'intvalue',
'message1', 'message2'
'timestamp' is an integer
'floatvalue' is a float
'intvalue' an int
'message1' is a string with a length of max 2000 characters, but can
often be very short
'message2' the same as message1
so a typical dict will look like
{ 'timetamp' : 12, 'floatvalue': 3.14159, 'intvalue': 42,
'message1' : '', 'message2' : '=' * 1999 }
>
> What do you call "many"? Fifty? A thousand? A thousand million? How many
> items in each dict? Ten? A million?
File size can be between 100kb and over 100Mb per file. Files will be
accumulated over months.
I just want to use the smallest possible space, as the data is collected
over a certain time (days / months) and will be transferred via UMTS /
EDGE / GSM network, where the transfer takes already for quite small
data sets several minutes.
I want to reduce the transfer time, when requesting files on demand (and
the amount of data in order to not exceed the monthly quota)
>> As the keys are the same for each entry I considered converting them to
>> tuples.
>
> I don't even understand what that means. You're going to convert the keys
> to tuples? What will that accomplish?
>> As the keys are the same for each entry I considered converting them
(the before mentioned dicts) to tuples.
so the dict { 'timetamp' : 12, 'floatvalue': 3.14159, 'intvalue': 42,
'message1' : '', 'message2' : '=' * 1999 }
would become
[ 12, 3.14159, 42, '', ''=' * 1999 ]
>
>
>> The tuples contain only strings, ints (long ints) and floats (double)
>> and the data types for each position within the tuple are fixed.
>>
>> The fastest and simplest way is to pickle the data or to use json. Both
>> formats however are not that optimal.
>
> How big are your JSON files? 10KB? 10MB? 10GB?
>
> Have you tried using pickle's space-efficient binary format instead of
> text format? Try using protocol=2 when you call pickle.Pickler.
No. This is probably already a big step forward.
As I know the data types if each element in the tuple I would however
prefer a representation, which is not storing the data types for each
typle over and over again (as they are the same for each dict / tuple)
>
> Or have you considered simply compressing the files?
Compression makes sense but the inital file format should be already
rather 'compact'
>
>> I could store ints and floats with pack. As strings have variable length
>> I'm not sure how to save them efficiently (except adding a length first
>> and then the string.
>
> This isn't 1980 and you're very unlikely to be using 720KB floppies.
> Premature optimization is the root of all evil. Keep in mind that when
> you save a file to disk, even if it contains only a single bit of data,
> the actual space used will be an entire block, which on modern hard
> drives is very likely to be 4KB. Trying to compress files smaller than a
> single block doesn't actually save you any space.
>
>
>> Is there already some 'standard' way or standard library to store such
>> data efficiently?
>
> Yes. Pickle and JSON plus zip or gzip.
>
pickle protocol-2 + gzip of the tuple derived from the dict, might be
good enough for the start.
I have to create a little more typical data in order to see how many
percent of my payload would consist of repeating the data types for each
tuple.
More information about the Python-list
mailing list