Advice on optimium data structure for billion long list?
Alexandre Fayolle
alf at orion.logilab.fr
Sun May 13 10:38:52 EDT 2001
On Sat, 12 May 2001 17:12:22 +0100, Mark blobby Robinson
<m.1.robinson at herts.ac.uk> wrote:
>I'd just like to pick the best of the worlds python brains if thats ok.
I'm not sure I qualify here, but I'll throw in a few considerations.
>it would take litterally weeks to run.
Well, you want to process 1,4 billion elements. Assuming you can process
1000 elements per second, this still leaves you with roughly 16 days of
processing. There ain't no such thing as a free lunch.
Regarding the storage concern you have, you may want to note the following
things:
* real world DBMS (such as Postgresql, DB2, Oracle...) are made to
handle tables with a size of the order of 10e7 rows. These are considered
'big' databases. Dealing with billions of rows is a really really big
database, for which you generally need special support from the OS, and
and the DB vendor, not mentioning the hardware (RAID anyone) in order to
get decent performance. Oracle and IBM will be happy to sell you support
to help you create and tune the DB.
* there is abolutely no way you'll be able to use gdbm on an average
workstation to store 1.4 billion rows. According to my personal
experience, Gdbm is OK up to about 10000 rows. I'm currently dealing
with 300000 rows for an application and I use 26 gdbm files to
pre-hash the data into reasonable sized files. If you want to go in this
direction, be aware that it means about 150000 files. Using Postgres will
ease things, but you'll have to deal with DMBS specific problems (index
updates, datafile size...)
Nowadays, when you need to process such a huge amount of data, the way
people go is by parallelizing the code and distribute the computation
across several machine. If the lab you're working in has several hundreds
of workstation doing nothing at night, you may want to use them to do the
computation for you. Otherwise, you can talk your manager into buying you
one hundred linux boxes and use them. Or you could try contacting
distributed.net to see if they're interested by your project.
Now whatever the path you choose to follow, it will take time (rewriting
the code, setting it on the machines, running it...) probably several
weeks, so I have to say that you'll probably have to wait for some time
before you get any result.
Alexandre Fayolle
--
http://www.logilab.com
Narval is the first software agent available as free software (GPL).
LOGILAB, Paris (France).
More information about the Python-list
mailing list