Searching for lots of similar strings (filenames) in sqlite3 database
Adam Funk
a24061 at ducksburg.com
Tue Jul 1 11:15:47 EDT 2014
On 2014-07-01, Chris Angelico wrote:
> On Tue, Jul 1, 2014 at 9:26 PM, Adam Funk <a24061 at ducksburg.com> wrote:
>> cursor.execute('SELECT filename FROM files WHERE filename IS ?', (filename,))
>
> Shouldn't this be an equality check rather than IS, which normally I'd
> expect to be "IS NULL" or "IS NOT NULL"?
Oh, it probably should be in "heavy" SQL. In SQLite, '==', '=', &
'IS' are interchangeable.
http://www.tutorialspoint.com/sqlite/sqlite_operators.htm
Looking at that page again, I see that 'GLOB' is a case-sensitive
version of 'LIKE'. I can't help but wonder if that makes it faster.
;-)
> As to your actual question: Your two database lookups are doing
> distinctly different things, so there's no surprise that they perform
> very differently. B asks the database "Do you have this? Do you have
> this?" for every file you have, and C asks the database "What do you
> have?", and then comparing that against the list of files. By the way
> - the A+C technique could be done quite tidily as a set difference:
>
> # assume you have listing1 and cursor set up
> # as per your above code
> listing = {os.path.join(directory, x) for x in listing1}
> cursor.execute(...) # as per above
> known_files = {row[0] for row in cursor} # cursors are iterable
> needed_files = listing - known_files
> cursor.executemany('INSERT INTO files VALUES (?, ?)', ((filename,
> 0) for filename in needed_files))
Oh, even better:
add_files = listing - known_files
delete_files = known_files - listing
and then I can remove files that have disappeared off the spool from
the table. Thanks very much!
> Anyway. The significant thing is the performance of the database on
> two different workloads: either "give me everything that matches this
> pattern" (where the pattern ends with a percent sign), or "do you have
> this? do you have this? do you have this?". Generally, database
> indexing is fairly efficient at handling prefix searches, so the first
> query will basically amount to an index search, which is a lot faster
> than the repeated separate searching; it takes advantage of the fact
> that all the strings you're looking at will have the same prefix.
>
> There is one critical consideration, though. What happens if the
> directory name contains an underscore or percent sign? Or can you
> absolutely guarantee that they won't? You may need to escape them, and
> I'm not sure how SQLite handles that. (Possibly \_ will match literal
> _, and \\ will match literal \, or something like that.)
I can guarantee that the directory names are all
'/var/spool/news/message.id/' then 3 digits. (The filenames are
pretty wild, since they are MIDs.) AIUI, using the '?' substitution
in the sqlite3 library is supposed to be safe.
> This is not bypassing the database's optimization; in fact, it's
> working tidily within it.
That's reassuring!
...
> But doing the set difference in Python is just as good a way of doing the job.
I like it. Thanks very much.
--
Specifications are for the weak & timid!
--- Klingon Programmer's Guide
More information about the Python-list
mailing list