Searching for lots of similar strings (filenames) in sqlite3 database

Chris Angelico rosuav at gmail.com
Tue Jul 1 17:57:21 CEST 2014


On Wed, Jul 2, 2014 at 1:15 AM, Adam Funk <a24061 at ducksburg.com> wrote:
> On 2014-07-01, Chris Angelico wrote:
>
>> On Tue, Jul 1, 2014 at 9:26 PM, Adam Funk <a24061 at ducksburg.com> wrote:
>>>         cursor.execute('SELECT filename FROM files WHERE filename IS ?', (filename,))
>>
>> Shouldn't this be an equality check rather than IS, which normally I'd
>> expect to be "IS NULL" or "IS NOT NULL"?
>
> Oh, it probably should be in "heavy" SQL.  In SQLite, '==', '=', &
> 'IS' are interchangeable.
>
> http://www.tutorialspoint.com/sqlite/sqlite_operators.htm

Ah, okay. In that case, I'd advise going with either == for
consistency with the rest of Python, or (preferably) = for consistency
with other SQL engines. You wouldn't use "is" to test if two Python
strings are equal, so there's no particular reason to use it here :)

> Oh, even better:
>
>     add_files = listing - known_files
>     delete_files = known_files - listing
>
> and then I can remove files that have disappeared off the spool from
> the table.  Thanks very much!

Ah! Didn't know that was a valuable feature for you, but getting that
"for free" is an extra little bonus, so that's awesome!

>> There is one critical consideration, though. What happens if the
>> directory name contains an underscore or percent sign? Or can you
>> absolutely guarantee that they won't? You may need to escape them, and
>> I'm not sure how SQLite handles that. (Possibly \_ will match literal
>> _, and \\ will match literal \, or something like that.)
>
> I can guarantee that the directory names are all
> '/var/spool/news/message.id/' then 3 digits.  (The filenames are
> pretty wild, since they are MIDs.)  AIUI, using the '?' substitution
> in the sqlite3 library is supposed to be safe.

This is nothing to do with question-mark substitution. There are two
separate levels of character significance here - it's like a quoted
string with a regex. Suppose you want to make a regex that searches
for an apostrophe. If you try to define that in a single-quoted
string, you need to escape it:

regex = '^\'$'

However, if you ask the user to enter a regex, that wouldn't be necessary:

regex = input("Enter a pattern: ") # raw_input in Python 2
Enter a pattern: ^'$

This is what the question mark substitution is like - it avoids the
need to carefully manage string delimiters and so on. However, if you
want to make a regex that searches for a backslash, then you need to
escape it, because the backslash is important to the regex itself. In
the same way, the underscore and percent sign are significant to the
LIKE operator. If it were possible to have a directory name with a
percent sign in it, it would match far too much - because you'd
construct a LIKE pattern something like (ahem)
"/var/spool/news/message%20id/142/%" - and as you can see, the percent
sign at the end is no different from the percent sign in the middle.

But you're safe because you know your data, unrelated to your
substitution method. Possibly merits a comment... but possibly not
worth it.

ChrisA



More information about the Python-list mailing list