Fast lookup of bulky "table"
Edmondo Giovannozzi
edmondo.giovannozzi at gmail.com
Mon Jan 16 13:18:57 EST 2023
Il giorno domenica 15 gennaio 2023 alle 05:26:50 UTC+1 Dino ha scritto:
> Hello, I have built a PoC service in Python Flask for my work, and - now
> that the point is made - I need to make it a little more performant (to
> be honest, chances are that someone else will pick up from where I left
> off, and implement the same service from scratch in a different language
> (GoLang? .Net? Java?) but I am digressing).
>
> Anyway, my Flask service initializes by loading a big "table" of 100k
> rows and 40 columns or so (memory footprint: order of 300 Mb) and then
> accepts queries through a REST endpoint. Columns are strings, enums, and
> numbers. Once initialized, the table is read only. The endpoint will
> parse the query and match it against column values (equality,
> inequality, greater than, etc.) Finally, it will return a (JSON) list of
> all rows that satisfy all conditions in the query.
>
> As you can imagine, this is not very performant in its current form, but
> performance was not the point of the PoC - at least initially.
>
> Before I deliver the PoC to a more experienced software architect who
> will look at my code, though, I wouldn't mind to look a bit less lame
> and do something about performance in my own code first, possibly by
> bringing the average time for queries down from where it is now (order
> of 1 to 4 seconds per query on my laptop) to 1 or 2 milliseconds on
> average).
>
> To be honest, I was already able to bring the time down to a handful of
> microseconds thanks to a rudimentary cache that will associate the
> "signature" of a query to its result, and serve it the next time the
> same query is received, but this may not be good enough: 1) queries
> might be many and very different from one another each time, AND 2) I am
> not sure the server will have a ton of RAM if/when this thing - or
> whatever is derived from it - is placed into production.
>
> How can I make my queries generally more performant, ideally also in
> case of a new query?
>
> Here's what I have been considering:
>
> 1. making my cache more "modular", i.e. cache the result of certain
> (wide) queries. When a complex query comes in, I may be able to restrict
> my search to a subset of the rows (as determined by a previously cached
> partial query). This should keep the memory footprint under control.
>
> 2. Load my data into a numpy.array and use numpy.array operations to
> slice and dice my data.
>
> 3. load my data into sqlite3 and use SELECT statement to query my table.
> I have never used sqllite, plus there's some extra complexity as
> comparing certain colum requires custom logic, but I wonder if this
> architecture would work well also when dealing with a 300Mb database.
>
> 4. Other ideas?
>
> Hopefully I made sense. Thank you for your attention
>
> Dino
As a comparison with numpy. Given the following lines:
import numpy as np
a = np.random.randn(400,100_000)
ia = np.argsort(a[0,:])
a_elem = a[56, ia[0]]
I have just taken an element randomly in a numeric table of 400x100000 elements
To find it with numpy:
%timeit isel = a == a_elem
35.5 ms ± 2.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
And
%timeit a[isel]
9.18 ms ± 371 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
As data are not ordered it is searching it one by one but at C level.
Of course it depends on a lot of thing...
More information about the Python-list
mailing list