save tuple of simple data types to disk (low memory foot print)
Tim Chase
python.list at tim.thechases.com
Sat Oct 29 13:47:42 EDT 2011
On 10/29/11 11:44, Gelonida N wrote:
> I would like to save many dicts with a fixed (and known) amount of keys
> in a memory efficient manner (no random, but only sequential access is
> required) to a file (which can later be sent over a slow expensive
> network to other machines)
>
> Example:
> Every dict will have the keys 'timestamp', 'floatvalue', 'intvalue',
> 'message1', 'message2'
> 'timestamp' is an integer
> 'floatvalue' is a float
> 'intvalue' an int
> 'message1' is a string with a length of max 2000 characters, but can
> often be very short
> 'message2' the same as message1
>
> so a typical dict will look like
> { 'timetamp' : 12, 'floatvalue': 3.14159, 'intvalue': 42,
> 'message1' : '', 'message2' : '=' * 1999 }
>
>
>>
>> What do you call "many"? Fifty? A thousand? A thousand million? How many
>> items in each dict? Ten? A million?
>
> File size can be between 100kb and over 100Mb per file. Files will be
> accumulated over months.
If Steven's pickle-protocol2 solution doesn't quite do what you
need, you can do something like the code below. Gzip is pretty
good at addressing...
>> Or have you considered simply compressing the files?
> Compression makes sense but the inital file format should be
> already rather 'compact'
...by compressing out a lot of the duplicate aspects. Which also
mitigates some of the verbosity of CSV.
It serializes the data to a gzipped CSV file then unserializes
it. Just point it at the appropriate data-source, adjust the
column-names and data-types
-tkc
from gzip import GzipFile
from csv import writer, reader
data = [ # use your real data here
{
'timestamp': 12,
'floatvalue': 3.14159,
'intvalue': 42,
'message1': 'hello world',
'message2': '=' * 1999,
},
] * 10000
f = GzipFile('data.gz', 'wb')
try:
w = writer(f)
for row in data:
w.writerow([
row[name] for name in (
# use your real col-names here
'timestamp',
'floatvalue',
'intvalue',
'message1',
'message2',
)])
finally:
f.close()
output = []
for row in reader(GzipFile('data.gz')):
d = dict((
(name, f(row[i]))
for i, (f,name) in enumerate((
# adjust for your column-names/data-types
(int, 'timestamp'),
(float, 'floatvalue'),
(int, 'intvalue'),
(str, 'message1'),
(str, 'message2'),
))))
output.append(d)
# or
output = [
dict((
(name, f(row[i]))
for i, (f,name) in enumerate((
# adjust for your column-names/data-types
(int, 'timestamp'),
(float, 'floatvalue'),
(int, 'intvalue'),
(str, 'message1'),
(str, 'message2'),
))))
for row in reader(GzipFile('data.gz'))
]
More information about the Python-list
mailing list