memory consumption
Cameron Simpson
cs at cskk.id.au
Tue Mar 30 22:45:07 EDT 2021
Since everyone is talking about vague OS memory use and not at all about
working set size of Python objects, let me ...
On 29Mar2021 03:12, Alexey <zen.supagood at gmail.com> wrote:
>I'm experiencing problems with memory consumption.
>
>I have a class which is doing ETL job. What`s happening inside:
> - fetching existing objects from DB via SQLAchemy
Do you need to? Or do you only need to fetch their ids? Or do you only
need to fetch a subset of the objects?
It is easy to accidentally suck in way too many db session entity
objects, or at any rate, more than you need to.
> - iterate over raw data
Can you prescan the data to determine which objects you care about,
reducing the number of objects you need to obtain?
> - create new/update existing objects
Depoending what you're doing, you may not need to "create new/update
existing objects". You could collate changes and do an UPSERT (the
incantation varies a little depending on the SQL dialect behind
SQLAlchemy).
> - commit changes
Do you discard the SQLAlchemy session after this? Otherwise it may lurk
and hold onto the objects. Commit doesn't forget the objects.
For my current client we have a script to import historic data from a
legacy system. It has many of the issues you're dealing with: the naive
(ORM) way consumes gads of memory, and can be very slow too (udating
objects in an ad hoc manner tends to do individual UPDATE SQL commands,
very latency laden).
I wrote a generic batch UPSERT function which took an accrued list of
changes and prepared a PostgreSQL INSERT...ON CONFLICT statement. The
main script hands it the accrued updates and it runs batches (which lets
up do progress reporting). Orders of magnitude faster, _and_ does not
require storing the db objects.
On the subject of "fetching existing objects from DB via SQLAchemy": you
may not need to do that, either. Can you identify _which_ objects are of
interest? Associate with the same script I've go a batch_select
function: it takes an terable if object ids and collects them in
batches, where before we were really scanning the whole db because we
had an arbitrary scattering of relevant object ids from the raw data.
It basicly collected ids into batches, and ran a SELECT...WHERE id in
(batch-of-ids). It's really fast considering, and also scales _way_ down
when the set of arbitrary ids is small.
I'm happy to walk through the mechanics of these with you; the code at
this end is Django's ORM, but I prefer SQLAlchemy anyway - the project
dictated the ORM here.
>Before processing data I create internal cache(dictionary) and store all existing objects in it.
>Every 10000 items I do bulk insert and flush. At the end I run commit command.
Yah. I suspect the session data are not being released. Also, SQLAlchemy
may be caching sessions or something across runs, since this is a celery
worker which survives from one task to the next.
You could try explicitly creating a new SQLAlchemy session around your
task.
>Problem. Before executing, my interpreter process weighs ~100Mb, after first run memory increases up to 500Mb
>and after second run it weighs 1Gb. If I will continue to run this class, memory wont increase, so I think
>it's not a memory leak, but rather Python wont release allocated memory back to OS. Maybe I'm wrong.
I don't know enough about Python's "release OS memory" phase. But
reducing the task memory footprint will help regardless.
Cheers,
Cameron Simpson <cs at cskk.id.au>
More information about the Python-list
mailing list