Using Python for processing of large datasets (convincing managment)

Sun Jul 7 11:57:18 EDT 2002

Paul Rubin wrote:
> Thomas Jensen <spam at ob_scure.dk> writes:
>>The current job takes about 5 hours to complete! 
> 
> Is 5 hours acceptable, if your data doesn't get any bigger?
> It not, what's the maximum you can accept?

5 hours is about the maximum acceptable. Usually the time is a little 
shorter, but 5 hours happen when much new data is added.
However, this is a case of "faster is better". 5 hours is acceptable but 
1 minute would open up new business oppertunities.

> OK, you're certain you can do it in 30 minutes.  Are you certain
> you CAN'T do it in 5 minutes?  If you can do it in 5 minutes, maybe
> you can stop worrying about scaling.

It would perhaps be possible to reach 5 minutes. However I am quite 
certain that the effort and development time of going from 10 minutes to 
5 minutes would be far greater than writing a simple RPC client/server 
architecture. Since we have 8 (or is it 6, I don't remember) CPU's that 
are mostly idle at nighttime, why not utilize them ?

> In another post you said you wanted to handle 10 times as much data
> as you currently handle.  Now you say it's not known exactly--do you
> have an idea or not?

AFAIR I said at least 10 times (thats what I meant anyways).
Faster calculations  => able to handle more data => new business 
oppertunities.
I it was a big problem making the program distributed, I probably 
wouldn't consider it.

> If it's acceptable for the program to need 3 hours, and you can handle
> the current data size in 10 minutes, then you can handle 10x the data
> size with plenty of speed to spare (assuming no
> seriously-worse-than-linear-time processes).

Most sub-calculations currently scales almost linearly, asuming indices 
are set up correctly.

> I think the bottleneck is going to be the database.  You might not get
> better throughput with multiple client CPU's than with just one.  If
> you do, maybe your client application needs more optimization.

We already have 2 DB Servers, a master replicating changes to a slave.
Our analysis shows that most database operations are/will be SELECTs.
Adding more DB servers is trivial, especially if we migrate to MySQL 
(well, cheaper at least :-)

> What is the application?  What is the data and what do you REALLY need
> to do with it?  How much is there ever REALLY likely to be?  Is an SQL
> database even really the best way to store and access it?  If there's
> not multiple processes updating it, maybe you don't need its overhead.

We have several applications accessing the DB via SQL. Migrating to 
something else is out of the question in the near future.

> Could a 1960's mainframe programmer deal with your problem, and if
> s/he could deal with it at all, why do you need multiple CPU's when
> each one is 1000 times faster than the 1960's computer?

I've never used/programmed a mainframe, so I don't know :-)

> Inside most complicated programs there's a simple program struggling
> to get out.

I agree, I really do. I usually rewrite most of my programs 1 or more 
times to make them less complicated.

Before going on with the distributed approach, I will probably write a 
"proof of concept" demo. Should this demo show, that it is not worth the 
effort, I will put it aside for now.

The reason I mentioned XMLRPC earlier is that I've used it before and in 
my oppinion it is extremly easy and intuitive to use.
The model I'm currently working with is caracterized by having relativly 
few RPC calls, with each call having only one integer as input and output.
Should I require more complex data structures, I'd probably look for a 
binary protocol.

But all that apart - the distributed part is not really the hard or 
complex part about this project. I understand that as soon as the 
calculations take place in more than one thread (be it on one or more 
CPUs/machines) it adds some complexity. However, designing the 
application in such a way that parralell computations are 
possible/plausible, can't be that bad I think.

I really see all this distribution talk as one among several 
optimization strategies.

An extreme example of another strategy: Develop the entire thing in 
assembler, using flat files or entirely bypassing the file-system.
I done correctly, it would probably outperform other strategies by far, 
but it would also be:
* Less maintainable
* Less readable
* a lot harder to use from ASP/PHP
* etc

Sometimes you just have to chose.

If you like I can come up with some less extreme examples :-)

-- 
Best Regards
Thomas Jensen
(remove underscore in email address to mail me)