Fixing and improving statement caching in sqlite3

Hi Python-Ideas, I was browsing the source code to the sqlite3 standard library module (as one does over the Christmas holidays) and was surprised to find that the eviction mechanism for cached prepared statements is fairly broken ... and has been for the last 15 years! This is just from reading the source code, so hopefully I've not made a fool of myself by misreading it. I've reported the problem to the upstream pysqlite repository: https://github.com/ghaering/pysqlite/issues/141 This got me thinking about potential fixes and improvements, which I thought might be worth discussing here, especially as maybe not all changes from that repo are still merged into mainline Python. 1. My first suggestion is to replace the odd cache logic with a straightforward classic LRU cache. I'm hoping this is uncontroversial, but I can imagine there might be programs that somehow depend on the current behaviour. But my suspicion is that many more suffer and just no one looked hard enough to notice. 2. My next suggestion is to allow disabling the cache (removing it entirely isn't an option for backwards compatibility) and instead cache one statement per cursor so long as consecutive calls to `execute()` or `executemany()` use identical SQL. This neatly avoids changing the interface while still allowing devs complete control over how they cache statements: you can put them in your own LRU cache structure, or cache specific ones by reusing just those cursors, or do anything else that you like. This API would also cleanly mirror how prepared statements work in the underlying SQLite C library. My first thought was to have a separate class representing a prepared statement, and allow that to be used as a factory for creating cursors; but SQLite prepared statements can only be executed and iterated once at a time before being reused, so using the Cursor object itself for caching the statement makes more sense. 3. My final suggestion is a small backwards-compatible change to the interface to make the previous idea particularly clear and easy to use: allow the SQL parameter to `execute()` and `executemany()` to be omitted (or None) to mean the same query as last used on that cursor, which guarantees it will use the cached prepared statement. Even better, also allow SQL to be passed in when creating the cursor. This leads to particularly straightforward statement reuse code: conn = sqlite3.connect(path_to_db, isolation_level=None) cur = conn.cursor(sql="SELECT * FROM Foo WHERE i > ?") print(cur.execute(parameters=(3,)).fetchall()) for row in cur.execute(parameters=(2,)): print(row) What do you think? Thanks, James

On Thu, Dec 31, 2020 at 8:43 AM James Oldfield <james.oldfield@cantab.net> wrote:
1. My first suggestion is to replace the odd cache logic with a straightforward classic LRU cache. I'm hoping this is uncontroversial, but I can imagine there might be programs that somehow depend on the current behaviour. But my suspicion is that many more suffer and just no one looked hard enough to notice.
LRU has worst scenario too. And implementing LRU requires some memory. For example, functools.lru_cache() uses doubly-linked list internally. There are some LRU-like algorithms, like clock or double chance. But all algorithms have worst scenario anyway. Another option is a random eviction algorithm. It chose random entry to evict. * Frequently used keys have a high chance for cache hit, like LFU (least frequently used). * When frequently used keys are changed in time, old frequently used keys are eventually evicted. (behaves like least recently frequently used?) * Minimum overhead for managing keys. So I think a random eviction algorithm is better than LRU. Off topic: Can we add random_cache to functools, like lru_cache? Regards, -- Inada Naoki <songofacandy@gmail.com>
participants (2)
-
Inada Naoki
-
James Oldfield