sqlite INSERT performance
duncan smith
buzzard at urubu.freeserve.co.uk
Wed May 30 21:57:40 EDT 2012
Hello,
I have been attempting to speed up some code by using an sqlite
database, but I'm not getting the performance gains I expected.
The use case:
I have text files containing data which may or may not include a header
in the first line. Each line (other than the header) is a record, so all
lines (when split on the relevant separator) should contain the same
number of values. I need to generate new files in a very specific
format; space separated, with header removed and integer codes
substituted for the values in the parent file. e.g. If the first column
(i.e. [line.strip('\r\n').split()[0] for line in file]) contained 15
distinct strings, then I would substitute the values in the parent file
with integers from 0 to 14 in the new file. The new file would contain a
non-empty subset of the 'columns' in the original file, and might be
conditioned on particular values of other columns.
My first effort read the parent file and generated a similar file
containing integer codes. New files were generated by iterating over the
lines of the file containing integer codes, splitting them, doing the
required selection and conditioning via list comprehensions, joining the
resultant lists, and writing to a new file. My test file has 67 columns
and over a million records, and just creating the file of integers took
a few minutes. (I also need to check for empty lines and skip them, and
check for records of incorrect length.)
I have partially implemented an alternative approach where I write the
data to an sqlite database. The idea is that I will add extra columns
for the integer codes and insert the integer codes only when required
for a new file. But I've been immediately hit with the cost of inserting
the data into the database. It takes around 80 seconds (compared to the
35 seconds needed to parse the original file and skip empty lines and
check the record lengths). I have tried iterating over the records
(lists of strings generated by csv.reader) and inserting each in turn. I
have also tried executemany() passing the csv.reader as the second
argument. I have also tried executing "PRAGMA synchronous=OFF". It still
takes around 80 seconds.
I'm a bit rusty with SQL, so I'd appreciate any advice on how to speed
this up. I seem to remember (using MySQL years ago) that there was a way
of dumping data in a text file to a table very quickly. If I could do
this and do my data integrity checks afterwards, then that would be
great. (Dumping data efficiently to a text file from an sqlite table
would also be handy for generating my new files.) Alternatively, if I
could substantially speed up the inserts then that would be great. Any
advice appreciated. TIA.
Duncan
More information about the Python-list
mailing list