memory consumption

Wed Mar 31 05:25:02 EDT 2021

среда, 31 марта 2021 г. в 05:45:27 UTC+3, cameron... at gmail.com:
> Since everyone is talking about vague OS memory use and not at all about 
> working set size of Python objects, let me ...
> On 29Mar2021 03:12, Alexey <zen.su... at gmail.com> wrote: 
> >I'm experiencing problems with memory consumption. 
> > 
> >I have a class which is doing ETL job. What`s happening inside: 
> > - fetching existing objects from DB via SQLAchemy
> Do you need to? Or do you only need to fetch their ids? Or do you only 
> need to fetch a subset of the objects? 

I really need all the objects because I'm performing update and create 
operations. If I'll be fetching them on the go, this will take hours or even days
to complete.

> It is easy to accidentally suck in way too many db session entity 
> objects, or at any rate, more than you need to.
> > - iterate over raw data
> Can you prescan the data to determine which objects you care about, 
> reducing the number of objects you need to obtain?

In this case I still need to iterate over raw and old data. As I said before
if I'll try it without caching it'll take days

> > - create new/update existing objects
> Depoending what you're doing, you may not need to "create new/update 
> existing objects". You could collate changes and do an UPSERT (the 
> incantation varies a little depending on the SQL dialect behind 
> SQLAlchemy). 
Good advice.

> > - commit changes 
> 
> Do you discard the SQLAlchemy session after this? Otherwise it may lurk 
> and hold onto the objects. Commit doesn't forget the objects. 
I tried expire_all() and expunge_all. Should I try rollback ?

> For my current client we have a script to import historic data from a 
> legacy system. It has many of the issues you're dealing with: the naive 
> (ORM) way consumes gads of memory, and can be very slow too (udating 
> objects in an ad hoc manner tends to do individual UPDATE SQL commands, 
> very latency laden). 
> 
> I wrote a generic batch UPSERT function which took an accrued list of 
> changes and prepared a PostgreSQL INSERT...ON CONFLICT statement. The 
> main script hands it the accrued updates and it runs batches (which lets 
> up do progress reporting). Orders of magnitude faster, _and_ does not 
> require storing the db objects. 
> 
> On the subject of "fetching existing objects from DB via SQLAchemy": you 
> may not need to do that, either. Can you identify _which_ objects are of 
> interest? Associate with the same script I've go a batch_select 
> function: it takes an terable if object ids and collects them in 
> batches, where before we were really scanning the whole db because we 
> had an arbitrary scattering of relevant object ids from the raw data. 

I'll try to analyze if it's possible to rewrite code this way

> It basicly collected ids into batches, and ran a SELECT...WHERE id in 
> (batch-of-ids). It's really fast considering, and also scales _way_ down 
> when the set of arbitrary ids is small. 
> 
> I'm happy to walk through the mechanics of these with you; the code at 
> this end is Django's ORM, but I prefer SQLAlchemy anyway - the project 
> dictated the ORM here.
> >Before processing data I create internal cache(dictionary) and store all existing objects in it. 
> >Every 10000 items I do bulk insert and flush. At the end I run commit command.
> Yah. I suspect the session data are not being released. Also, SQLAlchemy 
> may be caching sessions or something across runs, since this is a celery 
> worker which survives from one task to the next. 

I tried to dig in this direction. Created a few graphs with "objgraph"
 but it has so much references under the hood. I'll try to measure size of session
 object before and after building cache.

> You could try explicitly creating a new SQLAlchemy session around your 
> task.
> >Problem. Before executing, my interpreter process weighs ~100Mb, after first run memory increases up to 500Mb 
> >and after second run it weighs 1Gb. If I will continue to run this class, memory wont increase, so I think 
> >it's not a memory leak, but rather Python wont release allocated memory back to OS. Maybe I'm wrong.
> I don't know enough about Python's "release OS memory" phase. But 
> reducing the task memory footprint will help regardless. 

Definitely. I'll think about it.
Thank you!