Security FAD

All packed up and waiting for my plane to Raleigh. Going there to work on enabling two-factor authentication for the hosts that give shell access inside of Fedora’s Infrastructure. For the first round, I think we’re planning on going for simple and minimal to show what we can do. Briefly, the simplest and minimalist is:

* Server to verify a one time password (we already have one for yubikeys)
* CGI to take a username, password, and otp to verify in fas and the otp server
* pam module for sudo that verifies the user via the cgi
* database to store the secret keys for the otp generation and associate them with the fas username

We’re hoping to go a little beyond the minimal at the FAD:

* Have a web frontend to configure the secret keys that are stored for an account.
* Presently we’re thinking that this is a FAS frontend but we may end up re-evaluating this depending on what we decide to do for web apps and what to require for changing an auth source.
* Allow both yubikey and google-authenticator as otp

I’m also hoping that since we’ll have most of the sysadmin side of infrastructure present that we’ll get a chance to discuss and write down a few OTP policies for the future:

* Do we want to make two-factor optional for some people and required for others?
* How many auth sources do we require in order to change a separate auth source (email address, password, secret for otp generation, phone number, gpg key, etc)?

If we manage to get through all of that work, there’s a few other things we could work on as well:

* Design and implement OTP for our web apps

Account System 0.8.11 Release

We’ve just deployed a new Fedora Account System to production. This release just pulls a few new features that didn’t quite make the 0.8.10 release:

  • Ian Cole (icole) Added a feature to allow for email address to be used instead of login name for logging in. Because of the way we do authentication, this means that email addresses can also be used on the other applications on admin.fedoraproject.org as well.
  • Pierre-Yves Chibon (pingou) Implemented an audio captcha for people signing up for a new account. It generates a wav file that gets downloaded to your computer that you can listen to and then type in the proper answer to the captcha.
  • Adam M. Dutko (addutko) Standardized some of the errors that can be returned from our JSON API.
  • Our translation team pointed out a few areas where we weren’t loading translations correctly and I fixed them. Look forward to more complete translations in the future.

That’s it for this minor update.

/me goes to play with the audio captcha some more.

My last release of the week

We’ve just deployed the Fedora PackageDB 0.5.4 to production. This is primarily a bugfix release but thanks to Frank Chiulli we have a few user visible changes. The package source links now point to the git repositories instead of the old cvs repos (where the web interface was broken) and the Package Acl pages no longer displays EOL releases by default.

This has been a couple weeks of releases starting with Mike McGrath spearheading a new Fedora Account System release with the invite-only group feature from Jason Tibbitts and an SQLAlchemy-0.5 port, getting a new version of kitchen out the door, and now the PackageDB. Getting everything that infrastructure depends on updated before the Fedora beta release puts us into a change freeze I suppose :-).

Optimizing an SQLAlchemy call

This is just a quick post since I realized this morning that not every infrastructure developer knows how to optimize their database calls but it’s something that they need to be aware of so they can scale out a web application.

SQLAlchemy and other Object Relational Mappers provide two new ways for developers to view their interactions with a database.

  1. They can see it as transparently persisting data in an object into a database.
  2. They can look at data in a database as though they were simply objects in their programming language.

These two views are very powerful for quickly writing code that touches the database. However, there is a drawback — the ORM imposes a certain amount of overhead to accessing the data. When you’re dealing with bulk selects of large amounts of data, this overhead can add up unacceptably. When that happens, one of the ways to fix it is to get rid of the overhead of the ORM by dropping down to the SQL level. This may seem intimidating at first but if you know SQL then it’s actually quite straightforward:

  1. Identify what the current code does and how it formats the data being output. If you have other apps using the json data returned from the URL, be sure to look at the json data to see that you preserve the same data structure.
  2. Write an sql query that retrieves all of the information needed to reproduce that data structure (hopefully in one query to the db). You can do this directly against the database. In general, it’s okay to duplicate data in your query if you can use that to make less queries (for instance, retrieving the same information about a person for every address that they live at rather than making a query for the people data and another query for the address data). The slow down in querying the db on a LAN is usually in the number of queries that you make rather than the size of the data that is returned.
  3. Translate that query into SQLAlchemy calls. This can be the tricky part as the python method calls aren’t a simple one to one mapping to SQL. They attempt to be more intelligent than that, doing things like figuring out from clauses by what rows you request and the columns to perform joins on by where the foreign key relations are. When SQLAlchemy gets these right, your code may be smaller than the original SQL query. When SQLAlchemy gets these wrong, you have to figure out what the syntax is for overriding the default values.
  4. Do any filtering and reformatting of the data to match the old API in python afterwards. Usually this means looping through your rows of data and constructing a nested structure of dicts that hold the information; deduplicating the data as you go.

There are two reasons that dropping to the raw SQL level ends up being faster in most cases.

  1. Less queries to the database. What you’re doing here is eliminating the latency of querying the database. Querying the database and waiting for the response from it take time. Even if it’s small, that time can add up if you have to do it 10,000 times. The network connection between your application and the database likely has enough badwidth to be a little wasteful of how much data you query if it will save you the latency of making those extra 9,999 queries.
  2. The ORM layer has its own overhead. Although SQLAlchemy is highly optimised, the ORM does impose a performance penalty to give you the easy abstraction of using objects to refer to data in the database. Removing that penalty for each of the 10,000 objects that you were processing before can be a large win.

Mailman, Fedora infrastructure, and involving non-software developers in open source (Part I)

Mizmo came to #fedora-admin yesterday to see about getting drupal with a specific plugin that puts a more web-forum type interface on top of mailman. This spawned a big discussion about a wide array of things. I’m posting a bit about it here to get it more exposure and also to try and separate out the different threads that ran through it.

Part 1: Fedora Infrastructure

First, something that has become very apparent within Fedora Infrastructure but isn’t so apparent to people outside.  Infrastructure is starved for people.  And unfortunately, simply throwing more people at infrastructure doesn’t help as much as in other parts of the project.  Here’s what happens:  Within infrastructure, we have a very few people who are trusted to do work on all of the infrastructure boxes (the so called, sysadmin-main group).  These people can log in to all but one of the servers, make changes as needed, access the database, vote on change requests during release freeze, and basically have rights to fix any problems that may crop up.  With great power comes great responsibility and new members to this group need to be present in the general Fedora sysadmin community for a good while doing a lot of things before they come to be in this group.

Outside of this group we have several others that have varying degrees of power over varyious critical items.  We have the sysadmin-noc group that monitors all the servers and has a limited ability to diagnose and help fix routine issues that may crop up (although they often need to call on someone with more access to perform actual fixes), the sysadmin-hosted group that can work on the servers related to keeping fedorahosted up and running, sysadmin-web group that can work on the main app servers that make infrastructure services go, syadmin-cvs that deals with the cvs server for fedora packages, sysadmin-db (which in practice is the same as sysadmin-main due to having access to sensitive information).  We also have satellite groups in the form of committers to the applications that we write (the packagedb, fas, python-fedora, bodhi, elections, and mirrormanager committers).  These applications are written by infrastructure coders to meet needs identified by the infrastructure group for Fedora. In addition, there are a few groups that interact very closely with infrastructure — the releng team deploys the release and needs to coordinate closely with infrastructure on mirror space and times that we can make changes, some of the groups that develop applications that we run (members of the transifex, fedoracommunity/moksha, and zikula community) work with us to help solve issues and bugs in our deployments and to greater and lesser extents, help us to maintain the apps.

So, where’s the bottleneck in infrastructure getting new people?  There’s actually three places, two of which are related:

  1. We need more people involved who want to solve specific coding problems for infrastructure.  These people would need to be willing to be a jack-of-all-trades.  They need to be members of infrastructure that get involved with upstream projects.  Sometimes there might be a performance issue that we need to have addressed.  Sometimes we might identify a security problem and need to get a fix out quickly.  Other times we might identify a high value feature that would help fedora contributors and need someone to develop it.  The people doing any of this work would need to be able to sit down and involve themselves with both infrastructure and any upstreams to get commit rights or, at least, be trusted enough to get their patches looked at and added.  They’d need to be able to dive into unfamiliar code, get an idea of how it works, and produce working patches.
  2. We need more people to maintain (not just deploy) applications.  In many ways these people end up doing the same sorts of jobs as the people in #1.  However, the emphasis is slightly different.  Where the first set of people are primarily coders, these people are primarily system administrators.  Now, in Fedora Infrastructure we do have a lot of system administrators who come by to be a part of the team.  Where we end up with problems is that most of them aren’t able to commit to being part of the team over a long period.  This is difficult for us because we end up being able to deploy more projects but as we do, the people who are committed to maintaining the services get more and more stretched.  For a non-sysadmin, a question that often comes up here is — well, but isn’t deploying the application the hard part?  After it’s deployed, it should just work, right?  The answer to that is almost always no.  All software has bugs — so there’s always going to be the need to do updates.  Updates are not always backwards compatible so there’s always going to be the need to test updates and update configuration and code that we’ve built to help us manage the software when we do them.  Critical bugs (often, security related) do get discovered after a piece of software has been deployed which makes for some late nights rushing around fixing an issue that must be applied to the production instance ASAP.  As the service gets used more (or other software running on the same hsot gets used more) we can run into scaling issues that weren’t apparent when we first deployed.  All of these things contribute to the maintainance burden of deploying a new service to be used and all of them are helped immensely by having someone who can maintain the software as a long term commitment.
  3. We need to get more people who can gateway changes to many things.  These are the members of the core teams, sysadmin-web, sysadmin-db, cvsadmins, and ultimately, sysadmin-main.  This ties in heavily with #2.  in order to be sponsored into one of these groups you need to build up trust within the sysadmin community.  You’ll be given access to services that can bring down Fedora in any number of ways from simply making mistakes that cause outages to maliciously causing problems for core services so we have to trust that you’ll do the right thing with your responsibilities.  Building trust is not an instant thing.  It takes many man-hours of hard work, being in the right place to do something helpful, and generally showing that you are not just someone with valuable talents, but also someone that is responible enough to use their talents for the benefit of everyone and not just a few.

Over the past couple years we’ve tried a variety of things to alleviate these issues, none with a great deal of success.  Commitment is hard when you have mouths to feed.  It’s hard to be effective at working on issues when you don’t have access to deploy your apps in a production environment.  Feel free to bring some suggestions (the best suggestions come with prototypes! :-) to #fedora-admin or the infrastructure@fp.o mailing list and we’ll see if next year we can look back and say this was the year we figured out how to grow infrastructure at a sustainable rate.