steve at holdenweb.com
Sun Mar 14 15:16:43 CET 2010
D'Arcy J.M. Cain wrote:
> On Sat, 13 Mar 2010 23:42:31 -0800
> Jonathan Gardner <jgardner at jonathangardner.net> wrote:
>> On Fri, Mar 12, 2010 at 11:23 AM, Paul Rubin <no.email at nospam.invalid> wrote:
>>> "D'Arcy J.M. Cain" <darcy at druid.net> writes:
>>>> Just curious, what database were you using that wouldn't keep up with
>>>> you? I use PostgreSQL and would never consider going back to flat
>>> Try making a file with a billion or so names and addresses, then
>>> compare the speed of inserting that many rows into a postgres table
>>> against the speed of copying the file.
> That's a straw man argument. Copying an already built database to
> another copy of the database won't be significantly longer than copying
> an already built file. In fact, it's the same operation.
>> Also consider how much work it is to partition data from flat files
>> versus PostgreSQL tables.
> Another straw man. I'm sure you can come up with many contrived
> examples to show one particular operation faster than another.
> Benchmark writers (bad ones) do it all the time. I'm saying that in
> normal, real world situations where you are collecting billions of data
> points and need to actually use the data that a properly designed
> database running on a good database engine will generally be better than
> using flat files.
>>>> The only thing I can think of that might make flat files faster is
>>>> that flat files are buffered whereas PG guarantees that your
>>>> information is written to disk before returning
>>> Don't forget all the shadow page operations and the index operations,
>>> and that a lot of these operations require reading as well as writing
>>> remote parts of the disk, so buffering doesn't help avoid every disk
> Not sure what a "shadow page operation" is but index operations are
> only needed if you have to have fast access to read back the data. If
> it doesn't matter how long it takes to read the data back then don't
> index it. I have a hard time believing that anyone would want to save
> billions of data points and not care how fast they can read selected
> parts back or organize the data though.
>> Plus the fact that your other DB operations slow down under the load.
> Not with the database engines that I use. Sure, speed and load are
> connected whether you use databases or flat files but a proper database
> will scale up quite well.
A common complaint about large database loads taking a long time comes
about because of trying to commit the whole change as a single
transaction. Such an approach can indeed causes stresses on the database
system, but aren't usually necessary.
I don't know about PostgreSQL's capabilities in this area but I do know
that Oracle (which claims to be all about performance, though in fact I
believe PostgreSQL is its equal in many applications) allows you to
switch off the various time-consuming features such as transaction
logging in order to make bulk updates faster.
I also question how many databases would actually find a need to store
addresses for a sixth of the world's population, but this objection is
made mostly for comic relief: I understand that tables of such a size
are necessary sometimes.
There was a talk at OSCON two years ago by someone who was using
PostgreSQL to process 15 terabytes of medical data. I'm sure he'd have
been interested in suggestions that flat files were the answer to his
Another talk a couple of years before that discussed how PostgreSQL was
superior to Oracle in handling a three-terabyte data warehouse (though
it conceded Oracle's superiority in handling the production OLTP system
on which the warehouse was based - but that's four years ago).
Of course if you only need sequential access to the data then the
relational approach may be overkill. I would never argue that relational
is the best approach for all data and all applications, but it's often
better than its less-informed critics realize.
Steve Holden +1 571 484 6266 +1 800 494 3119
See PyCon Talks from Atlanta 2010 http://pycon.blip.tv/
Holden Web LLC http://www.holdenweb.com/
UPCOMING EVENTS: http://holdenweb.eventbrite.com/
More information about the Python-list