Mizmo came to #fedora-admin yesterday to see about getting drupal with a specific plugin that puts a more web-forum type interface on top of mailman. This spawned a big discussion about a wide array of things. I’m posting a bit about it here to get it more exposure and also to try and separate out the different threads that ran through it.
Part 1: Fedora Infrastructure
First, something that has become very apparent within Fedora Infrastructure but isn’t so apparent to people outside. Infrastructure is starved for people. And unfortunately, simply throwing more people at infrastructure doesn’t help as much as in other parts of the project. Here’s what happens: Within infrastructure, we have a very few people who are trusted to do work on all of the infrastructure boxes (the so called, sysadmin-main group). These people can log in to all but one of the servers, make changes as needed, access the database, vote on change requests during release freeze, and basically have rights to fix any problems that may crop up. With great power comes great responsibility and new members to this group need to be present in the general Fedora sysadmin community for a good while doing a lot of things before they come to be in this group.
Outside of this group we have several others that have varying degrees of power over varyious critical items. We have the sysadmin-noc group that monitors all the servers and has a limited ability to diagnose and help fix routine issues that may crop up (although they often need to call on someone with more access to perform actual fixes), the sysadmin-hosted group that can work on the servers related to keeping fedorahosted up and running, sysadmin-web group that can work on the main app servers that make infrastructure services go, syadmin-cvs that deals with the cvs server for fedora packages, sysadmin-db (which in practice is the same as sysadmin-main due to having access to sensitive information). We also have satellite groups in the form of committers to the applications that we write (the packagedb, fas, python-fedora, bodhi, elections, and mirrormanager committers). These applications are written by infrastructure coders to meet needs identified by the infrastructure group for Fedora. In addition, there are a few groups that interact very closely with infrastructure — the releng team deploys the release and needs to coordinate closely with infrastructure on mirror space and times that we can make changes, some of the groups that develop applications that we run (members of the transifex, fedoracommunity/moksha, and zikula community) work with us to help solve issues and bugs in our deployments and to greater and lesser extents, help us to maintain the apps.
So, where’s the bottleneck in infrastructure getting new people? There’s actually three places, two of which are related:
- We need more people involved who want to solve specific coding problems for infrastructure. These people would need to be willing to be a jack-of-all-trades. They need to be members of infrastructure that get involved with upstream projects. Sometimes there might be a performance issue that we need to have addressed. Sometimes we might identify a security problem and need to get a fix out quickly. Other times we might identify a high value feature that would help fedora contributors and need someone to develop it. The people doing any of this work would need to be able to sit down and involve themselves with both infrastructure and any upstreams to get commit rights or, at least, be trusted enough to get their patches looked at and added. They’d need to be able to dive into unfamiliar code, get an idea of how it works, and produce working patches.
- We need more people to maintain (not just deploy) applications. In many ways these people end up doing the same sorts of jobs as the people in #1. However, the emphasis is slightly different. Where the first set of people are primarily coders, these people are primarily system administrators. Now, in Fedora Infrastructure we do have a lot of system administrators who come by to be a part of the team. Where we end up with problems is that most of them aren’t able to commit to being part of the team over a long period. This is difficult for us because we end up being able to deploy more projects but as we do, the people who are committed to maintaining the services get more and more stretched. For a non-sysadmin, a question that often comes up here is — well, but isn’t deploying the application the hard part? After it’s deployed, it should just work, right? The answer to that is almost always no. All software has bugs — so there’s always going to be the need to do updates. Updates are not always backwards compatible so there’s always going to be the need to test updates and update configuration and code that we’ve built to help us manage the software when we do them. Critical bugs (often, security related) do get discovered after a piece of software has been deployed which makes for some late nights rushing around fixing an issue that must be applied to the production instance ASAP. As the service gets used more (or other software running on the same hsot gets used more) we can run into scaling issues that weren’t apparent when we first deployed. All of these things contribute to the maintainance burden of deploying a new service to be used and all of them are helped immensely by having someone who can maintain the software as a long term commitment.
- We need to get more people who can gateway changes to many things. These are the members of the core teams, sysadmin-web, sysadmin-db, cvsadmins, and ultimately, sysadmin-main. This ties in heavily with #2. in order to be sponsored into one of these groups you need to build up trust within the sysadmin community. You’ll be given access to services that can bring down Fedora in any number of ways from simply making mistakes that cause outages to maliciously causing problems for core services so we have to trust that you’ll do the right thing with your responsibilities. Building trust is not an instant thing. It takes many man-hours of hard work, being in the right place to do something helpful, and generally showing that you are not just someone with valuable talents, but also someone that is responible enough to use their talents for the benefit of everyone and not just a few.
Over the past couple years we’ve tried a variety of things to alleviate these issues, none with a great deal of success. Commitment is hard when you have mouths to feed. It’s hard to be effective at working on issues when you don’t have access to deploy your apps in a production environment. Feel free to bring some suggestions (the best suggestions come with prototypes! 🙂 to #fedora-admin or the infrastructure@fp.o mailing list and we’ll see if next year we can look back and say this was the year we figured out how to grow infrastructure at a sustainable rate.