Searching for lots of similar strings (filenames) in sqlite3 database
Adam Funk
a24061 at ducksburg.com
Tue Jul 1 07:26:53 EDT 2014
I have some code that reads files in a leafnode2 news spool & needs to
check for new files periodically. The full paths are all like
'/var/spool/news/message.id/345/<123456 at example.com>' with a 3-digit
subdirectory & a Message-ID for the filename itself. I'm using Python
3 & sqlite3 in the standard library.
I have a table of filenames created with the following command:
cursor.execute('CREATE TABLE files (filename TEXT PRIMARY KEY, used INTEGER)')
To check for new files in one of the subdirectories, I run A then
either B or C below (I've tried both).
A.
listing1 = os.listdir(directory)
listing [os.path.join(directory, x) for x in listing1]
B.
cursor = db_conn.cursor()
for filename in listing:
cursor.execute('SELECT filename FROM files WHERE filename IS ?', (filename,))
row = cursor.fetchone()
if not row:
cursor.execute('INSERT INTO files VALUES (?, ?)', (filename, 0) )
files_new += 1
db_conn.commit()
C.
cursor = db_conn.cursor()
subdir_like = directory + '/%'
cursor.execute('SELECT filename FROM files WHERE filename LIKE ?', (subdir_like,))
rows = cursor.fetchall()
known_files = [row[0] for row in rows]
for filename in listing:
if filename not in known_files:
cursor.execute('INSERT INTO files VALUES (?, ?)', (filename, 0) )
files_new += 1
db_conn.commit()
A+B was the first method I came up with, because it looks like the
"keep it simple & let the database do its job" approach, but it was
very time-consuming, so I tested A+C out. A is quick (a second); B
can drag on for over an hour to check 2000 filenames (for example) in
a subdirectory; C always takes less than a minute. So C is much
better than B, but it looks (to me) like one of those attempts to
bypass & ignore the database's built-in optimizations.
Comments?
--
No sport is less organized than Calvinball!
More information about the Python-list
mailing list