To a child, a door is just a door, a hinge is just what holds it up, and the difference between brass and copper is less important than whether you’re going to get ice cream afterwards.
Archive for September, 2010
We’ve just deployed the Fedora PackageDB 0.5.4 to production. This is primarily a bugfix release but thanks to Frank Chiulli we have a few user visible changes. The package source links now point to the git repositories instead of the old cvs repos (where the web interface was broken) and the Package Acl pages no longer displays EOL releases by default.
This has been a couple weeks of releases starting with Mike McGrath spearheading a new Fedora Account System release with the invite-only group feature from Jason Tibbitts and an SQLAlchemy-0.5 port, getting a new version of kitchen out the door, and now the PackageDB. Getting everything that infrastructure depends on updated before the Fedora beta release puts us into a change freeze I suppose .
This is just a quick post since I realized this morning that not every infrastructure developer knows how to optimize their database calls but it’s something that they need to be aware of so they can scale out a web application.
SQLAlchemy and other Object Relational Mappers provide two new ways for developers to view their interactions with a database.
- They can see it as transparently persisting data in an object into a database.
- They can look at data in a database as though they were simply objects in their programming language.
These two views are very powerful for quickly writing code that touches the database. However, there is a drawback — the ORM imposes a certain amount of overhead to accessing the data. When you’re dealing with bulk selects of large amounts of data, this overhead can add up unacceptably. When that happens, one of the ways to fix it is to get rid of the overhead of the ORM by dropping down to the SQL level. This may seem intimidating at first but if you know SQL then it’s actually quite straightforward:
- Identify what the current code does and how it formats the data being output. If you have other apps using the json data returned from the URL, be sure to look at the json data to see that you preserve the same data structure.
- Write an sql query that retrieves all of the information needed to reproduce that data structure (hopefully in one query to the db). You can do this directly against the database. In general, it’s okay to duplicate data in your query if you can use that to make less queries (for instance, retrieving the same information about a person for every address that they live at rather than making a query for the people data and another query for the address data). The slow down in querying the db on a LAN is usually in the number of queries that you make rather than the size of the data that is returned.
- Translate that query into SQLAlchemy calls. This can be the tricky part as the python method calls aren’t a simple one to one mapping to SQL. They attempt to be more intelligent than that, doing things like figuring out
from clausesby what rows you request and the columns to perform joins on by where the foreign key relations are. When SQLAlchemy gets these right, your code may be smaller than the original SQL query. When SQLAlchemy gets these wrong, you have to figure out what the syntax is for overriding the default values.
- Do any filtering and reformatting of the data to match the old API in python afterwards. Usually this means looping through your rows of data and constructing a nested structure of dicts that hold the information; deduplicating the data as you go.
There are two reasons that dropping to the raw SQL level ends up being faster in most cases.
- Less queries to the database. What you’re doing here is eliminating the latency of querying the database. Querying the database and waiting for the response from it take time. Even if it’s small, that time can add up if you have to do it 10,000 times. The network connection between your application and the database likely has enough badwidth to be a little wasteful of how much data you query if it will save you the latency of making those extra 9,999 queries.
- The ORM layer has its own overhead. Although SQLAlchemy is highly optimised, the ORM does impose a performance penalty to give you the easy abstraction of using objects to refer to data in the database. Removing that penalty for each of the 10,000 objects that you were processing before can be a large win.
Those of you who hang out in the same IRC channels as me may have heard me mention the python module I’ve been working on with some other Fedora python programmers. For those who haven’t, the 0.2.1a1 release seems like the perfect time for me to invade the blogosphere!
Kitchen is a module that aims to collect small, useful pieces of code into one python package for install on your machine. It’s a kind of a library of miscellaneous, small python functions. Why is that something special? Well, what those of us working on kitchen were realizing is that small pieces of code are a strange beast. They’re so small that it feels like the overhead of creating a python package just for them will result in a
setup.py that’s larger than the module. On the other hand, the code is so useful that you end up reimplementing it in every project that you work on. And of course, once you start cutting and pasting and reimplementing between projects you have the problem of keeping those copies in sync; making sure that bugs that were fixed in the copy in one project are fixed in all of them.
What’s needed is to have one larger umbrella package that pulls all of those functions together. Then you can fix problems in a single place and also only have the overhead of writing setup.py files and making releases once for all of those functions. This is exactly what kitchen is.
So what do you get with the 0.2.1-alpha1 release? You get a bunch of code that originated in yum, python-fedora, a few snippets from the ActiveState Python Recipe Collection, and some backports from python-2.7 for earlier python. You get API documentation and a few tutorials on using kitchen to ease your programming burden.
What can you do with these shiny new tools? Let’s take a really brief tour of the modules included in the release:
- Functions and objects in here will help you setup gettext for internationalizing your application’s messages
- Have you ever been stumped by the difference between
strtypes in python? how many times have you written your own
to_bytesfunctions? Well, you can stop cutting and pasting that because this kitchen module has both of those functions builtin for you! As a bonus, we also throw in a
- Currently this only provides a dictionary implementation that keeps strict separation between str and unicode keys. ie: ‘a’ and u’a’ are different keys.
- Contains two functions at the moment:
- Currently has one function for assembling a PEP-386 compliant version string from a set of tuples (format of which is also in PEP-386)
- Compatibility modules for things added in Python-2.4, Python-2.5, and Python-2.7.
- The compatibility modules implement functionality that was merged into python at a later version than you might have. So if you need defaultdict that appeared in python-2.5, it’s here. The nice thing about these modules is that we take care to import from the python standard library if it’s available and only use the copied compat library if it’s not available.
As with any open source project, this is an evolving code base. We’re doing our best to have a long deprecation cycle for functions that we remove and clearly document how to replace the functionality with other functions in kitchen. If you have a favourite useful piece of code that you’d like to see merged, feel free to send us a message on the mailing list or open up a ticket in our trac instance.