save tuple of simple data types to disk (low memory foot print)

Sat Oct 29 12:44:14 EDT 2011

On 10/29/2011 03:00 AM, Steven D'Aprano wrote:
> On Fri, 28 Oct 2011 22:47:42 +0200, Gelonida N wrote:
> 
>> Hi,
>>
>> I would like to save many dicts with a fixed amount of keys tuples to a
>> file  in a memory efficient manner (no random, but only sequential
>> access is required)

> 
> What do you mean "keys tuples"?
Corrected phrase:
I would like to save many dicts with a fixed (and known) amount of keys
in a memory efficient manner (no random, but only sequential access is
required) to a file (which can later be sent over a slow expensive
network to other machines)

Example:
Every dict will have the keys 'timestamp', 'floatvalue', 'intvalue',
'message1', 'message2'
'timestamp' is an integer
'floatvalue' is a float
'intvalue' an int
'message1' is a string with a length of max 2000 characters, but can
often be very short
'message2' the same as message1

so a typical dict will look like
{ 'timetamp' : 12, 'floatvalue': 3.14159, 'intvalue': 42,
 'message1' : '', 'message2' : '=' * 1999 }

>
> What do you call "many"? Fifty? A thousand? A thousand million? How many
> items in each dict? Ten? A million?

File size can be between 100kb and over 100Mb per file. Files will be
accumulated over months.

I just want to use the smallest possible space, as the data is collected
over a certain time (days / months)  and will be transferred  via UMTS /
EDGE / GSM network, where the transfer takes already for quite small
data sets several minutes.

I want to reduce the transfer time, when requesting files on demand (and
the amount of data in order to not exceed the monthly quota)

>> As the keys are the same for each entry  I considered converting them to
>> tuples.
> 
> I don't even understand what that means. You're going to convert the keys 
> to tuples? What will that accomplish?

>> As the keys are the same for each entry  I considered converting them
(the before mentioned dicts) to tuples.

so the dict { 'timetamp' : 12, 'floatvalue': 3.14159, 'intvalue': 42,
 'message1' : '', 'message2' : '=' * 1999 }

would become
[ 12, 3.14159, 42, '', ''=' * 1999 ]
> 
> 
>> The tuples contain only strings, ints (long ints) and floats (double)
>> and the data types for each position within the tuple are fixed.
>>
>> The fastest and simplest way is to pickle the data or to use json. Both
>> formats however are not that optimal.
> 
> How big are your JSON files? 10KB? 10MB? 10GB?
> 
> Have you tried using pickle's space-efficient binary format instead of 
> text format? Try using protocol=2 when you call pickle.Pickler.

No. This is probably already a big step forward.

As I know the data types if each element in the tuple I would however
prefer a representation, which is not storing the data types for each
typle over and over again (as they are the same for each dict / tuple)

> 
> Or have you considered simply compressing the files?

Compression makes sense but the inital file format should be already
rather 'compact'

> 
>> I could store ints and floats with pack. As strings have variable length
>> I'm not sure how to save them efficiently (except adding a length first
>> and then the string.
> 
> This isn't 1980 and you're very unlikely to be using 720KB floppies. 
> Premature optimization is the root of all evil. Keep in mind that when 
> you save a file to disk, even if it contains only a single bit of data, 
> the actual space used will be an entire block, which on modern hard 
> drives is very likely to be 4KB. Trying to compress files smaller than a 
> single block doesn't actually save you any space.

> 
> 
>> Is there already some 'standard' way or standard library to store such
>> data efficiently?
> 
> Yes. Pickle and JSON plus zip or gzip.
> 

pickle protocol-2 + gzip of the tuple derived from the dict, might be
good enough for the start.

I have to create a little more typical data in order to see how many
percent of my payload would consist of repeating the data types for each
tuple.