Fastest way to store ints and floats on disk

Laszlo Nagy gandalf at shopzeus.com
Thu Aug 7 20:41:18 CEST 2008


  Hi,

I'm working on a pivot table. I would like to write it in Python. I 
know, I should be doing that in C, but I would like to create a cross 
platform version which can deal with smaller databases (not more than a 
million facts).

The data is first imported from a csv file: the user selects which 
columns contain dimension and measure data (and which columns to 
ignore). In the next step I would like to build up a database that is 
efficient enough to be used for making pivot tables. Here is my idea for 
the database:

Original CSV file with column header and values:

"Color","Year","Make","Price","VMax"
Yellow,2000,Ferrari,100000,254
Blue,2003,Volvo,50000,210

Using the GUI, it is converted to this:

dimensions = [
    { 'name':'Color', 'colindex:0, 'values':[ 'Red', 'Blue', 'Green', 
'Yellow' ], },
    { 'name':'Year', colindex:1, 'values':[ 
1995,1999,2000,2001,2002,2003,2007 ], },
    { 'name':'Make', colindex:2, 'value':[ 'Ferrari', 'Volvo', 'Ford', 
'Lamborgini' ], },
]
measures = [
    { 'name', 'Price', 'colindex':3 },
    { 'name', 'Vmax', 'colindex':4 },
]
facts = [
    ( (3,2,0),(100000.0,254.0)  ), # ( dimension_value_indexes, 
measure_values )
    ( (1,5,1),(50000.0,210.0) ),
   .... # Some million rows or less
]


The core of the idea is that, when using a relatively small number of 
possible values for each dimension, the facts table becomes 
significantly smaller and easier to process. (Processing the facts would 
be: iterate over facts, filter out some of them, create statistical 
values of the measures, grouped by dimensions.)

The facts table cannot be kept in memory because it is too big. I need 
to store it on disk, be able to read incrementally, and make statistics. 
In most cases, the "statistic" will be simple sum of the measures, and 
counting the number of facts affected. To be effective, reading the 
facts from disk should not involve complex conversions. For this reason, 
storing in CSV or XML or any textual format would be bad. I'm thinking 
about a binary format, but how can I interface that with Python?

I already looked at:

- xdrlib, which throws me DeprecationWarning when I store some integers
- struct which uses format string for each read operation, I'm concerned 
about its speed

What else can I use?

Thanks,

   Laszlo






More information about the Python-list mailing list