There’s much talk about DevOps these days, with its differing interpretations. One aspect is to automate the configuration of your server infrastructure, by writing configuration code that is version controlled and tested. But how can this be applied to existing infrastructure, in particular, poorly maintained legacy systems that are already in production? This challenge was one that the Content Discovery Team at 7digital (which I lead), overcame.
The Challenge
We had inherited a legacy production server with no UAT version and no tests. Every day the system generated mission critical files which were populated from a database running queries that took around eight hours to complete.
There was a requirement to change the application, but doing so immediately would mean a perilous deploys to production without any tests, which I decided was not acceptable. We were only to deploy to production, once we had built a UAT environment and tested the code there, but first step was to get the code running on a developer machine.
Our first challenge was understand what exactly the program was doing. This was made more tricky since it ran out of date versions of ruby, on old versions of Linux.
After some frustrating period attempting to run the current code with it’s ancient ruby gems, we decided that “what the software did” trumped the “how it did it”. A new production server with the latest versions on ruby and gems would ultimately be created, with a matching UAT environment.
Replacing manual configuration with automated configuration
7digital have been moving to single configuration management system and CFEngine was the tool of choice. Rather than manually installing and configuring the new server, we wrote a CFEngine promise file with the bare minimum requirements for the new application, starting with the requirement that the correct version of ruby was installed.
As a team (myself, Dan Kalotay, Dan Watts & Matt Bailey) we were familiar with CFEngine, but not with testing configuration code. We were helped by fellow 7digital developers Sam Crang and James Lewis, who got us up and running with Test Kitchen, and Anna Kennedy whose CFEngine knowledge helped us greatly.
To test the configuration we used Test Kitchen. It uses Vagrant to spin up a new Debian server on the development machine. The Debian image was one of our production server images which had CFEngine agent pre-installed. Using the Vagrant file we slaved it to use the locally edited CFEngine file as its source of configuration, instead of a remote CFEngine hub. Once the virtual machine had been created it would call the Test Kitchen kitchen converge command, which would apply the CFEngine promises defined in the local promise file.
So far, so DevOps. But how does one automatically assert if the promises set the desired state?
Writing Infrastructure Code in a TDD Manner
Using Test Kitchen it is possible to run a suite of tests against the newly created virtual machine. We used Serverspec as it allows us to write Rspec style tests. Serverspec uses SSH to connect to the virtual machine and then it can run any command to assert that it was configured as desired.
We then started to write tests for the state of the server in a TDD fashion. We would for example:
- Write a test to assert that a cron job will run at a certain time,
- Run the test and see it fail, since the cron job has not been created yet
- Create a CF-Engine promise to create the cron job,
- Run kitchen converge && kitchen verify to apply the configuration and run the test again
- See the test pass, or if they fail, go back to step 3
In this way we added more configuration, by repeating this Red-Green-Refactor process, which is familiar to most modern programmers. Running the Serverspec tests allowed us to drive out configuration; accreting functionality of the virtual machine and building up the configuration file that set up the server state.
Deploying to UAT
Once we were happy with the configuration, we committed and pushed the CFEngine promise file up to the CFEngine Policy hub for our UAT environment. It then was straightforward to request that our infrastructure team create the new UAT server slaved to thay same policy hub. Once in UAT we could run more detailed tests overnight, since the SQL queries we were running took around eight hours to complete. Our QA team worked with us to assert that the product worked in accordance with our acceptance criteria.
Deploying to production, now a low-ceremony, low risk event
Once every party was happy with the end result, it was time for our infrastructure team to spin up a new production virtual machine slaved to our CFEngine hub. Within minutes our replacement server was in production, along with the changes our clients required. Within a few days the old server had been permanently retired.
This was consistent with the idea that organisations should treat their servers as cattle, not pets. In the event of a server failure, spinning up a new virtual machine with little human intervention which works well in cases where the server does not store mission critical data. We had replaced a delicate and temperamental “pet-like” server and replaced it with a more disposable “cattle-like” one. In case of server problems, prod.server0001 which could be replaced within minutes with an identically configured prod.server0002.
What we learnt along the way
An early test failure we experienced was around installing ruby gems. We attempted to use CFEngine to run a shell command gem install <gemname> but this always failed. It turned out to be simpler to create it as a deb package and install it, rather than using CFEngine to execute certain shell commands to install gems. This was due to the way CFEngine executes shell commands; the permissions were not appropriate for the application user.
Another problem we overcame was how the cp command on CFEngine not synonymous with Linux cp command.
Conclusion
Whatever your particular interpretation of what DevOps means, in this project I learnt that what allowed us to succeed was close collaboration between developers and infrastructure engineers. The developers learnt much about the infrastructure and infrastructure team members became proficient in understanding how the developer application worked and comfortable with using version control for infrastructure code. Clear communication was key and we all learnt how to get the job done.
We also learnt that sometimes it’s worth starting from scratch rather than attempting to retrofit CFEngine promises to an existing server. The retrofit idea was abandoned due to it being almost as risky as editing configuration files manually on the production server.
In future I’ll always try to get all existing production infrastructure I’m responsible for configured this way, as well as using it for new infrastructure.