sqlite INSERT performance

Wed May 30 21:57:40 EDT 2012

Hello,
       I have been attempting to speed up some code by using an sqlite 
database, but I'm not getting the performance gains I expected.

The use case:

I have text files containing data which may or may not include a header 
in the first line. Each line (other than the header) is a record, so all 
lines (when split on the relevant separator) should contain the same 
number of values. I need to generate new files in a very specific 
format; space separated, with header removed and integer codes 
substituted for the values in the parent file. e.g. If the first column 
(i.e. [line.strip('\r\n').split()[0] for line in file]) contained 15 
distinct strings, then I would substitute the values in the parent file 
with integers from 0 to 14 in the new file. The new file would contain a 
non-empty subset of the 'columns' in the original file, and might be 
conditioned on particular values of other columns.

My first effort read the parent file and generated a similar file 
containing integer codes. New files were generated by iterating over the 
lines of the file containing integer codes, splitting them, doing the 
required selection and conditioning via list comprehensions, joining the 
resultant lists, and writing to a new file. My test file has 67 columns 
and over a million records, and just creating the file of integers took 
a few minutes. (I also need to check for empty lines and skip them, and 
check for records of incorrect length.)

I have partially implemented an alternative approach where I write the 
data to an sqlite database. The idea is that I will add extra columns 
for the integer codes and insert the integer codes only when required 
for a new file. But I've been immediately hit with the cost of inserting 
the data into the database. It takes around 80 seconds (compared to the 
35 seconds needed to parse the original file and skip empty lines and 
check the record lengths). I have tried iterating over the records 
(lists of strings generated by csv.reader) and inserting each in turn. I 
have also tried executemany() passing the csv.reader as the second 
argument. I have also tried executing "PRAGMA synchronous=OFF". It still 
takes around 80 seconds.

I'm a bit rusty with SQL, so I'd appreciate any advice on how to speed 
this up. I seem to remember (using MySQL years ago) that there was a way 
of dumping data in a text file to a table very quickly. If I could do 
this and do my data integrity checks afterwards, then that would be 
great. (Dumping data efficiently to a text file from an sqlite table 
would also be handy for generating my new files.) Alternatively, if I 
could substantially speed up the inserts then that would be great. Any 
advice appreciated. TIA.

Duncan