Fast lookup of bulky "table"
Thomas Passin
list1 at tompassin.net
Sun Jan 15 10:38:22 EST 2023
On 1/15/2023 6:14 AM, Peter J. Holzer wrote:
> On 2023-01-14 23:26:27 -0500, Dino wrote:
>> Hello, I have built a PoC service in Python Flask for my work, and - now
>> that the point is made - I need to make it a little more performant (to be
>> honest, chances are that someone else will pick up from where I left off,
>> and implement the same service from scratch in a different language (GoLang?
>> .Net? Java?) but I am digressing).
>>
>> Anyway, my Flask service initializes by loading a big "table" of 100k rows
>> and 40 columns or so (memory footprint: order of 300 Mb)
>
> 300 MB is large enough that you should at least consider putting that
> into a database (Sqlite is probably simplest. Personally I would go with
> PostgreSQL because I'm most familiar with it and Sqlite is a bit of an
> outlier).
>
> The main reason for putting it into a database is the ability to use
> indexes, so you don't have to scan all 100 k rows for each query.
I have an (inherited) server program that uses about 30 MB of data in a
MySQL database. It services queries received over the network. It too
had performance problems, to which adding indexes and smarter joins
helped but not enough.
I changed the program so that at startup it imports much of the data
into Python dictionaries that are structured to support the kinds of
queries that need the help. Response time to queries dropped
dramatically. Some kinds of queries needed more help, and I collected
auxiliary collections of (usually highly pre-processed) data into
ordinary files, and those too get imported into dictionaries during startup.
Note that these dictionaries do not always match the table structures.
Some of them change the structure to make queries easier to process. You
may be able to do that with Python code, or by creating SQL views in the
database and importing directly from the views (database views take
almost no database memory).
The drawback is that all that data is now stored in memory while the
program is running. In my case, many hundreds of MB. But if it would
be too much memory for you - you would need to prototype it to know -
you should let the database engine do the work. It is more highly
optimized and efficient for searches than your code could ever be. But
there will be a price to pay. The price is in denormalizing the database
table design. This means to include redundant data, organized to match
the kinds of queries that will be made. No more 3rd normal form! Your
sql queries will need to be designed to take advantage of this new
structure. This will be a cost because the database will be larger, but
also because the redundancies will make it much harder to update the
data correctly. Fortunately you do not need to do that during normal
operation (my program's data was also static like yours).
PostgreSQL would probably be a better choice than Sqlite, since it
supports features such as foreign keys, and has a function definition
capability.
More information about the Python-list
mailing list