Looking for minimal SQL

Pierre-Frédéric Caillaud peufeu at free.fr
Sat Jul 3 16:14:13 EDT 2004


> 	"If"???	I think they are still on the TBD list for MySQL, so at
> the moment it is not an "if", it is an "as". <G>

	Take a look at the column types PostgreSQL offers.
	Among other interesting things, like a datetime type which works, it has  
arrays.
	You can define a column as an array of integers for instance.
	Example :

	Imagine we create a table to hold the index of a book. We should have  
this (in pseudo-sql):

	table words:
		id integer primary key
		word varchar

	table refs:
		page integer not null
		word_id integer not null references words(id)

	index on all interesting columns.

	To find all pages where a word is, we look in "words" (index lookup on  
word) and find the word id ; then we look in "refs" all pages with the  
corresponding word_id
	To find all words on a page, we look in "refs" (index lookup on page),  
then we group by word, and we look in words to list the words.

	With an array type, we have :

	table words:
		id integer primary key
		word varchar
		page integer[]

	Then you can use a GIST index on page and ask it "find all rows where the  
page list contains this page". This is a lot faster than the above Joins.  
Getting the list of pages for a word is also a lot faster (it's only one  
select, maximal locality of reference).
	The GIST indexes are so optimized in Postgres that the speed for these  
lookups is amazing.

	I tested this feature by building a list of 100.000 words which appeared  
in anything from 1 to 10 different random pages (between 1 and 1000). When  
finding all words on a page, the GIST index had acceptable timings (ie. a  
few tens of milliseconds to return between a few hundred and a few  
thousands records) ; the pivot table had horrendous timings on the verge  
of a second (because the Join was looking in the pivot table for the  
condition first, then making index lookups for each row in this large  
table, which is the only way to do it... and makes a lot of disk seeks !)



	If you want to use a "minimal subset of SQL", you'll have to forgo these  
interesting features...

	Postgresql has indexable geometric types. On a random population of  
500.000 points, it takes 90 ms to get the points in a certain bounding box  
(this particular query returned 5.000 points)... without clustering on the  
index it took about 200 ms...


	So, if you want performance, it is difficult to ignore database-specific  
points.

	I know MySQL has geometric types too. But only for MyISAM tables !  
(argh). So you don't get transactions on them. And MyISAM tables have  
terrible write concurrency.	








More information about the Python-list mailing list