memory consumption
Alexey
zen.supagood at gmail.com
Wed Mar 31 05:25:02 EDT 2021
среда, 31 марта 2021 г. в 05:45:27 UTC+3, cameron... at gmail.com:
> Since everyone is talking about vague OS memory use and not at all about
> working set size of Python objects, let me ...
> On 29Mar2021 03:12, Alexey <zen.su... at gmail.com> wrote:
> >I'm experiencing problems with memory consumption.
> >
> >I have a class which is doing ETL job. What`s happening inside:
> > - fetching existing objects from DB via SQLAchemy
> Do you need to? Or do you only need to fetch their ids? Or do you only
> need to fetch a subset of the objects?
I really need all the objects because I'm performing update and create
operations. If I'll be fetching them on the go, this will take hours or even days
to complete.
> It is easy to accidentally suck in way too many db session entity
> objects, or at any rate, more than you need to.
> > - iterate over raw data
> Can you prescan the data to determine which objects you care about,
> reducing the number of objects you need to obtain?
In this case I still need to iterate over raw and old data. As I said before
if I'll try it without caching it'll take days
> > - create new/update existing objects
> Depoending what you're doing, you may not need to "create new/update
> existing objects". You could collate changes and do an UPSERT (the
> incantation varies a little depending on the SQL dialect behind
> SQLAlchemy).
Good advice.
> > - commit changes
>
> Do you discard the SQLAlchemy session after this? Otherwise it may lurk
> and hold onto the objects. Commit doesn't forget the objects.
I tried expire_all() and expunge_all. Should I try rollback ?
> For my current client we have a script to import historic data from a
> legacy system. It has many of the issues you're dealing with: the naive
> (ORM) way consumes gads of memory, and can be very slow too (udating
> objects in an ad hoc manner tends to do individual UPDATE SQL commands,
> very latency laden).
>
> I wrote a generic batch UPSERT function which took an accrued list of
> changes and prepared a PostgreSQL INSERT...ON CONFLICT statement. The
> main script hands it the accrued updates and it runs batches (which lets
> up do progress reporting). Orders of magnitude faster, _and_ does not
> require storing the db objects.
>
> On the subject of "fetching existing objects from DB via SQLAchemy": you
> may not need to do that, either. Can you identify _which_ objects are of
> interest? Associate with the same script I've go a batch_select
> function: it takes an terable if object ids and collects them in
> batches, where before we were really scanning the whole db because we
> had an arbitrary scattering of relevant object ids from the raw data.
I'll try to analyze if it's possible to rewrite code this way
> It basicly collected ids into batches, and ran a SELECT...WHERE id in
> (batch-of-ids). It's really fast considering, and also scales _way_ down
> when the set of arbitrary ids is small.
>
> I'm happy to walk through the mechanics of these with you; the code at
> this end is Django's ORM, but I prefer SQLAlchemy anyway - the project
> dictated the ORM here.
> >Before processing data I create internal cache(dictionary) and store all existing objects in it.
> >Every 10000 items I do bulk insert and flush. At the end I run commit command.
> Yah. I suspect the session data are not being released. Also, SQLAlchemy
> may be caching sessions or something across runs, since this is a celery
> worker which survives from one task to the next.
I tried to dig in this direction. Created a few graphs with "objgraph"
but it has so much references under the hood. I'll try to measure size of session
object before and after building cache.
> You could try explicitly creating a new SQLAlchemy session around your
> task.
> >Problem. Before executing, my interpreter process weighs ~100Mb, after first run memory increases up to 500Mb
> >and after second run it weighs 1Gb. If I will continue to run this class, memory wont increase, so I think
> >it's not a memory leak, but rather Python wont release allocated memory back to OS. Maybe I'm wrong.
> I don't know enough about Python's "release OS memory" phase. But
> reducing the task memory footprint will help regardless.
Definitely. I'll think about it.
Thank you!
More information about the Python-list
mailing list