python and very large data sets???

Thu Apr 25 19:25:28 EDT 2002

aahz at pythoncraft.com (Aahz) wrote in message news:<aa7be1$i3v$1 at panix1.panix.com>...
> In article <mailman.1019682346.19715.python-list at python.org>,
> holger krekel  <pyth at devel.trillke.net> wrote:
> >
> >I just don't happen to see the advantages of bringing a database into
> >the picture. It seems like a classical batch job and it 'random access
> >many times' is not needed, so why?
> 
> From the original post:
> 
>     Things would afterwards get more complicated cause I will have to
>     pullout ID's from "sub_file1", remove duplicate ID's create
>     "no_dup_sub_file1", match those to ID's in remaining 3 main files and
>     pullout data linked with those ID's.
> 
> This screams "*JOIN*" to me.  Now, if sub_file1 is less than 100MB,
> *maybe* Python can handle it.  IMO, that is true IIF the records are
> strictly fixed-length.  But IME requirements will change such that joins
> over larger and larger datasets will be needed -- and why re-invent a
> database that's designed precisely for this purpose?

It screams dbm to me, however given the nature of software to expand
it's requirements I agree.  MySQL is definately the approach I would
personally be taking.  It will almost certainly take less time to
learn basic SQL (and for this type of job, you don't need anything
fancy) then to write and debug what is really a mini-database.

I suggest you take a look at 

http://www.mysql.org 
 and 
http://www.dcs.napier.ac.uk/~andrew/sql/

Andrae Muys

P.S. Actually I prefer postgresql, but as in this case speed is likely
to be more useful then referential integrity, I favour MySQL.  If
however the database might end up being used in it's own right, I
would be more then willing to sacrifice the speed advantage of MySQL
for the referential, constraint, and transaction features of
Postgresql.