Looking for minimal SQL
Pierre-Frédéric Caillaud
peufeu at free.fr
Sat Jul 3 16:14:13 EDT 2004
> "If"??? I think they are still on the TBD list for MySQL, so at
> the moment it is not an "if", it is an "as". <G>
Take a look at the column types PostgreSQL offers.
Among other interesting things, like a datetime type which works, it has
arrays.
You can define a column as an array of integers for instance.
Example :
Imagine we create a table to hold the index of a book. We should have
this (in pseudo-sql):
table words:
id integer primary key
word varchar
table refs:
page integer not null
word_id integer not null references words(id)
index on all interesting columns.
To find all pages where a word is, we look in "words" (index lookup on
word) and find the word id ; then we look in "refs" all pages with the
corresponding word_id
To find all words on a page, we look in "refs" (index lookup on page),
then we group by word, and we look in words to list the words.
With an array type, we have :
table words:
id integer primary key
word varchar
page integer[]
Then you can use a GIST index on page and ask it "find all rows where the
page list contains this page". This is a lot faster than the above Joins.
Getting the list of pages for a word is also a lot faster (it's only one
select, maximal locality of reference).
The GIST indexes are so optimized in Postgres that the speed for these
lookups is amazing.
I tested this feature by building a list of 100.000 words which appeared
in anything from 1 to 10 different random pages (between 1 and 1000). When
finding all words on a page, the GIST index had acceptable timings (ie. a
few tens of milliseconds to return between a few hundred and a few
thousands records) ; the pivot table had horrendous timings on the verge
of a second (because the Join was looking in the pivot table for the
condition first, then making index lookups for each row in this large
table, which is the only way to do it... and makes a lot of disk seeks !)
If you want to use a "minimal subset of SQL", you'll have to forgo these
interesting features...
Postgresql has indexable geometric types. On a random population of
500.000 points, it takes 90 ms to get the points in a certain bounding box
(this particular query returned 5.000 points)... without clustering on the
index it took about 200 ms...
So, if you want performance, it is difficult to ignore database-specific
points.
I know MySQL has geometric types too. But only for MyISAM tables !
(argh). So you don't get transactions on them. And MyISAM tables have
terrible write concurrency.
More information about the Python-list
mailing list