This is just a quick post since I realized this morning that not every infrastructure developer knows how to optimize their database calls but it’s something that they need to be aware of so they can scale out a web application.
SQLAlchemy and other Object Relational Mappers provide two new ways for developers to view their interactions with a database.
- They can see it as transparently persisting data in an object into a database.
- They can look at data in a database as though they were simply objects in their programming language.
These two views are very powerful for quickly writing code that touches the database. However, there is a drawback — the ORM imposes a certain amount of overhead to accessing the data. When you’re dealing with bulk selects of large amounts of data, this overhead can add up unacceptably. When that happens, one of the ways to fix it is to get rid of the overhead of the ORM by dropping down to the SQL level. This may seem intimidating at first but if you know SQL then it’s actually quite straightforward:
- Identify what the current code does and how it formats the data being output. If you have other apps using the json data returned from the URL, be sure to look at the json data to see that you preserve the same data structure.
- Write an sql query that retrieves all of the information needed to reproduce that data structure (hopefully in one query to the db). You can do this directly against the database. In general, it’s okay to duplicate data in your query if you can use that to make less queries (for instance, retrieving the same information about a person for every address that they live at rather than making a query for the people data and another query for the address data). The slow down in querying the db on a LAN is usually in the number of queries that you make rather than the size of the data that is returned.
- Translate that query into SQLAlchemy calls. This can be the tricky part as the python method calls aren’t a simple one to one mapping to SQL. They attempt to be more intelligent than that, doing things like figuring out
from clausesby what rows you request and the columns to perform joins on by where the foreign key relations are. When SQLAlchemy gets these right, your code may be smaller than the original SQL query. When SQLAlchemy gets these wrong, you have to figure out what the syntax is for overriding the default values.
- Do any filtering and reformatting of the data to match the old API in python afterwards. Usually this means looping through your rows of data and constructing a nested structure of dicts that hold the information; deduplicating the data as you go.
There are two reasons that dropping to the raw SQL level ends up being faster in most cases.
- Less queries to the database. What you’re doing here is eliminating the latency of querying the database. Querying the database and waiting for the response from it take time. Even if it’s small, that time can add up if you have to do it 10,000 times. The network connection between your application and the database likely has enough badwidth to be a little wasteful of how much data you query if it will save you the latency of making those extra 9,999 queries.
- The ORM layer has its own overhead. Although SQLAlchemy is highly optimised, the ORM does impose a performance penalty to give you the easy abstraction of using objects to refer to data in the database. Removing that penalty for each of the 10,000 objects that you were processing before can be a large win.
I prefer a hybrid approach: write queries using the ORM, but use eager loading:
This solves what I call the 1+N problem, where you issue a query, then N further queries, one per row of the initial query result (fetching attributes of the result objects).
With eager loading, SQLAlchemy can take the first query, and turn it into something that prefetches an entire graph of objects into RAM.
That way you have a single trip to the database, all of your db access is in python and the ORM, and you’re using the ORM to do the grunt work of reconstructing objects with attributes from SQL results.
eagerload has proven to be an inconsistent performance enhancer for me. Some queries ended up faster. Others ended up taking much more time. My theory is that eagerload fixes the problem of latency when you are trying to load 10,000 records and it takes 10,000 queries with lazy loading but when a large number of tables, each with a large amount of data are involved, eagerloading ends up doing a crossproduct that’s very, very large. A handwritten SQL query can avoid some of those joins and yield a much faster result.