In order to report usage on our PBSPro compute cluster at work I wrote a simple set of python scripts to dump the accounting information into a MySQL database. This has been working fine for the last year churning out reports every month.
This week I had cause to generate some statics aggregated across the whole three years of the data in the database. I’m using a mixture of Elixir and SQLalchemy to talk to the database. Normally I would do something like this:
mybigtablequery = MyBigTable.query() for job in mybigtablequery: if job.attribute = "thing": dosomething()
Which worked fine when the database was quite small. I was horrified to see that as this loop went on my python process used more and more memory because the database connection object never throws away a row once it has been loaded. Fortunately I found an answer on stackoverflow.
So I ended up doing the following:
def batch_query(query,batch=10000): offset = 0 while True: for elem in query.limit(batch).offset(offset): r = True yield elem offset += batch if not r: break r = False mybigtablequery = MyBigTable.query() for job in batch_query(mybigtablequery,50000): if job.attribute = "thing": dosomething()
“batch” is just an integer defining how many rows will be fetched by each query. The larger this is the more memory the python interpreter will use but the more efficiently the code will run.