Using Python for processing of large datasets (convincing managment)
Thomas Jensen
spam at ob_scure.dk
Mon Jul 8 19:51:34 EDT 2002
Cameron Laird wrote:
> In article <3D2A078A.7040502 at ob_scure.dk>,
> I call SQL noodling "scalable" in the sense that good
> SQL queries can be hosted on bigger and bigger servers.
> We know how to do that--it's a commercial reality.
Ok, I understand.
I think it's often a question of choosing the right tool for the job.
Consider the following example: find the average of a series of values
found in a table. Of course(?) doing a "SELECT AVG(value) FROM T_MyTable
WHERE ..." would be much faster that retriving all the values and doing
the calculations on the client/app-server side. However if, for some
reason, the contents of T_MyTable was already in the clients memory
(perhaps it was calculated there), calculating the average on the client
would perhaps be faster.
Be assured though, that for each calculation, both SQL and
Python(/C++/VB or wathever it ends up being) solutions will be written
and the fastest chosen. As it have been noted, the result might be that
the SQL approach is the fastest, only time will tell.
> I *like* distributed computing. I've spent much of the
> last eighteen months promoting SOAP, XML-RPC, and CORBA.
> Your mention of Linda and its descendants, including
> T-Spaces, thrilled me. HOWEVER, I rarely recommend
> distribution for performance objectives, for reasons
> that have mostly appeared already in this thread. Com-
> mercial applications (as opposed to scientific ones)
> just don't find success that way.
Well, you might be rigth, I don't know.
I'm a little scared though about using SQL too extensivly.
I might be too much of an SQL newbie, but there's just some stuff that's
hard to write in (portable) SQL.
For example I've done some quite fancy calculations using multiple
"DECLARE CURSOR", etc in MSSQL. However, trying to run these thru MySQL
is, well problematic.
> Your situation might be an exception. It's hard to know.
> The computations you describe--DB retrievals, elementary
> statistics, ...--sound to me like ones that I've seen
> most successfully hosted on conventional architectures.
I think I'm currently planning on a 90% conventional with possibility of
later expansion to distributed computing :-)
The last 6 months I've been working almost exclusivly on a (commercial)
project heavily based on SOAP (not for performance objectives though :-).
That part really doesn't scare me :-)
--
Best Regards
Thomas Jensen
(remove underscore in email address to mail me)
More information about the Python-list
mailing list