Using TDD to write Infrastructure Configuration Code for Legacy Servers

There’s much talk about DevOps these days, with its differing interpretations. One aspect is to automate the configuration of your server infrastructure, by writing configuration code that is version controlled and tested. But how can this be applied to existing infrastructure, in particular, poorly maintained legacy systems that are already in production? This challenge was one that the Content Discovery Team at 7digital (which I lead), overcame.

The Challenge

We had inherited a legacy production server with no UAT version and no tests. Every day the system generated mission critical files which were populated from a database running queries that took around eight hours to complete.

There was a requirement to change the application, but doing so immediately would mean a perilous deploys to production without any tests, which I decided was not acceptable. We were only to deploy to production, once we had built a UAT environment and tested the code there, but first step was to get the code running on a developer machine.

Our first challenge was understand what exactly the program was doing. This was made more tricky since it ran out of date versions of ruby, on old versions of Linux.

After some frustrating period attempting to run the current code with it’s ancient ruby gems, we decided that “what the software did” trumped the “how it did it”. A new production server with the latest versions on ruby and gems would ultimately be created, with a matching UAT environment.

Replacing manual configuration with automated configuration

7digital have been moving to single configuration management system and CFEngine was the tool of choice. Rather than manually installing and configuring the new server, we wrote a CFEngine promise file with the bare minimum requirements for the new application, starting with the requirement that the correct version of ruby was installed.

As a team (myself, Dan Kalotay, Dan Watts & Matt Bailey) we were familiar with CFEngine, but not with testing configuration code. We were helped by fellow 7digital developers Sam Crang and James Lewis, who got us up and running with Test Kitchen, and Anna Kennedy whose CFEngine knowledge helped us greatly.

To test the configuration we used Test Kitchen. It uses Vagrant to spin up a new Debian server on the development machine. The Debian image was one of our production server images which had CFEngine agent pre-installed. Using the Vagrant file we slaved it to use the locally edited CFEngine file as its source of configuration, instead of a remote CFEngine hub. Once the virtual machine had been created it would call the Test Kitchen kitchen converge command, which would apply the CFEngine promises defined in the local promise file.

So far, so DevOps. But how does one automatically assert if the promises set the desired state?

Writing Infrastructure Code in a TDD Manner

Using Test Kitchen it is possible to run a suite of tests against the newly created virtual machine. We used Serverspec as it allows us to write Rspec style tests.  Serverspec uses SSH to connect to the virtual machine and then it can run any command to assert that it was configured as desired.

We then started to write tests for the state of the server in a TDD fashion. We would for example:

  1. Write a test to assert that a cron job will run at a certain time,
  2. Run the test and see it fail, since the cron job has not been created yet
  3. Create a CF-Engine promise to create the cron job,
  4. Run kitchen converge && kitchen verify to apply the configuration and run the test again
  5. See the test pass, or if they fail, go back to step 3

In this way we added more configuration, by repeating this  Red-Green-Refactor process, which is familiar to most modern programmers. Running the Serverspec tests allowed us to drive out configuration; accreting functionality of the virtual machine and building up the configuration file that set up the server state.

Deploying to UAT

Once we were happy with the configuration, we committed and pushed the CFEngine promise file up to the CFEngine Policy hub for our UAT environment. It then was straightforward to request that our infrastructure team create the new UAT server slaved to thay same policy hub. Once in UAT we could run more detailed tests overnight, since the SQL queries we were running took around eight hours to complete. Our QA team worked with us to assert that the product worked in accordance with our acceptance criteria.

Deploying to production, now a low-ceremony, low risk event

Once every party was happy with the end result, it was time for our infrastructure team to spin up a new production virtual machine slaved to our CFEngine hub. Within minutes our replacement server was in production,  along with the changes our clients required. Within a few days the old server had been permanently retired.

This was consistent with the idea that organisations should treat their servers as cattle, not pets. In the event of a server failure, spinning up a new virtual machine with little human intervention which works well in cases where the server does not store mission critical data. We had replaced a delicate and temperamental “pet-like” server and replaced it with a more disposable  “cattle-like” one. In case of server problems, prod.server0001 which could be replaced within minutes with an identically configured prod.server0002.

What we learnt along the way

An early test failure we experienced was around installing ruby gems.  We attempted to use CFEngine to run a shell command gem install <gemname> but this always failed. It turned out to be simpler to create it as a deb package and install it, rather than using CFEngine to execute certain shell commands to install gems. This was due to the way CFEngine executes shell commands; the permissions were not appropriate for the application user.

Another problem we overcame was how the cp command on CFEngine not synonymous with Linux cp command.

Conclusion

Whatever your particular interpretation of what DevOps means, in this project I learnt that what allowed us to succeed was close collaboration between developers and infrastructure engineers. The developers learnt much about the infrastructure and infrastructure team members became proficient in understanding how the developer application worked and comfortable with using version control for infrastructure code. Clear communication was key and we all learnt how to get the job done.

We also learnt that sometimes it’s worth starting from scratch rather than attempting to retrofit CFEngine promises to an existing server. The retrofit idea was abandoned due to it being almost as risky as editing configuration files manually on the production server.

In future I’ll always try to get all existing production infrastructure I’m responsible for configured this way, as well as using it for new infrastructure.

No Comments

Using Real Time Application Metrics To Optimise Solr Indexing Throughput

In my previous blog post I wrote about 7digital’s use of real time application monitoring. In this post I will expand upon how the team used application metrics to greatly improve upon a product that powered a critical part of 7digital’s infrastructure.

I’m a member of the Content Discovery Team, who are responsible for making 7digital’s catalogue of 25 million tracks available via the 7digital API.

One of the requirements of making the catalogue accessible is that the lack of a universal unique identifier in music industry that represents artists, tracks and album releases, means that full text searching is important for discovering content. At 7digital, to cope with the increase in both traffic and catalogue size, full text searches were moved over from relational database to Lucene powered Solr database back in 2011

Currently we extract the metadata for the music catalogue from a relational database, transform it to json DTOs and then and post it into our Solr servers, which perform full text searches for music releases, tracks and artists. By placing the catalogue into Solr, this enables it  to be searchable, and makes it possible to retrieve releases and artist metadata when supplying a unique 7digital ID.

The datasets are relatively large; creating an index of the entire dataset of 25 million tracks with the schema used would take many days, so it was necessary to  append the hourly changes of the catalogue to an existing index.

There was a relational database containing many millions of updates to the catalogue in time order. This was the typical amount of meta-data that is generated per day. Each update would be represented as a single row in a “changes” table in the database.

I.e. a row storing metadata for a track  would contain track title, track artist, track price, as some of it’s columns. Each row would have a column denoting if that metadata was being inserted, updated or deleted from the catalogue.

Solr acts as an http wrapper for a Lucene index, so creating an index of the data was done via json posts to the Solr server. One piece of metadata was read from the relational database, mapped into a json document and then posted via HTTP to the Solr server. Solr would then add this document to it’s Lucene index which could be searched. This entire process we refer to as indexing.

The Problem

With the old code base, indexing this data often took hours. This meant that end to end testing in the UAT environment involved lengthy multi hour “code, deploy, run tests” cycles. Typically a change would be made early in the morning so the results could be inspected by lunch time. One could look at the log files to check upon progress, but it became awkward.

The Goal

In early summer 2013 we rewrote the legacy code which performed this Extract, Transform and Load process from the relational database to the Solr server. It was no longer fit for purpose and was causing data inconsistencies which had come about in part from the long feedback cycle between making a code change and seeing the test result.

The goal was improve the metadata indexing throughput to Solr whilst ensuring data correctness.  After writing extensive unit and integration tests to verify data consistency, we then turned to performance tuning in our acceptance tests. We wanted to index as many documents to solr as quickly as possible. Log files were available but it wasn’t straightforward to parse meaning from them when looking for patterns across a time span.

This arrangement was clearly suboptimal so we decided to incorporate real-time monitoring. We already were using StatsD to log and Graphite to visualise the web responses on our internal API endpoints and realised we could put it to use to help us here.

After improving the SQL queries, we discovered that the largest factor of indexing throughput was how many documents to post for each commit.

Each document contains a change which could be a delete, an insert or an update of an existing piece of metadata (i.e. a price change of a track). In our scenario, Lucene treats an update and an insert as the same thing; any document indexed that has a matching ID of an existing one will overwrite the original. Changes could therefore be boiled down to either an delete or an “upsert”.

After posting a certain number of documents, a commit command is sent,  whereby the index commits the recent changes to disk. Once committed, the new documents are in the index and are searchable. While the commit phase is happening, no new documents can be posted to the Solr server. Depending on the size of the index and the number of documents,the commit phase can take many seconds, or even minutes.

Lucene works effectively when only deletes are committing in one batch and only upserts are committed in another batch. This is much faster than interweaving deletes and upserts within the same period between commits. Commits are when the Lucene index is writing the new documents to the index on the disk, and they come with a minimum time to complete, during which posting new documents is not possible.

Number of rows read mapped directly to the number of documents indexed.

Using the StatsD c# client we counted how long it took to read from the relational database, post the documents to Solr, and then how long it took to commit them, and we sent this timings to a statsD server. We then used Graphite to visualise this data.

The graph below shows:

time taken to read the metadata changes denoting deleted tracks/releases from the relational database, within the given range of rows
time taken to post the delete documents to Solr
time taken for Solr to enact the commit (during which period no new documents may be posted)

time taken to read the metadata changes denoting the upserted tracks/release from the relational database within the table range given range of rows
time taken to post the new documents to Solr
time taken to for Solr to enact the commit (which prevents any new documents from being posted)

Chart showing indexing timings

As we can see, commiting updates takes upon 72 seconds and is by far the greatest proportion of time.

Large ranges would make the SQL query component take a long time, occasionally timing out and needing to be repeated. Additionally, commits would take longer.

Using smaller range sizes resulted in more frequent posts and commits to Solr, but commits seemed to have a minimum duration, so there came a point where very small ranges would mean the commit times would be the largest component of the indexing process. and document throughput dropped.

How to measure indexing throughput?

We added some more statsD logging which shows the indexing throughput over the same time period as the chart above.

The below chart shows the row id on the y axis. The each row represents a change of metadata. The blue line shows what is newest change in the relational database. The Green line shows how many of the row ID of the changes that have been posted to Solr. They follow each other with a small delay. A measure of indexing throughput is how closely the green line follows the blue.

Indexing throughput

The right of this graph showed us that there were periods of time when no updates are written into the relation database, but when there are writes, the indexer can index them into Solr in less than two minutes

In the C# code that did the Extract, Transform and Load, we added additional statsD metrics, which posted the max rowID of changes table in the relational database, and what the latest rowID that had been posted to Solr. This graph is still used in production to monitor the application in real time. It lets us  observe when many metadata changes are made, (for instance, when new content is added)  and diagnose any issues with the database we read from, and how long it may take for the content to become available in the Solr servers.

The Result

By being able to monitor how many new changes were being written to the database, and how quickly the indexer can commit it to Solr, we discovered that over a certain rate, increasing throughput would not have a noticeable effect, since there was no benefit in attempting to read more rows per hour than there were being created. This level of monitoring allowed us to effectively decide upon the direction to take when developing the product. The monitoring gave us additional unexpected benefits of having visibility of a business critical database process which was previously unmonitored, the rate of changes recorded was now known.

Ultimately in order to maximise end to end indexing throughout, we found a sweet spot of 100k documents, and by using the the metrics we only optimised where we needed to, thus saving valuable development time.

No Comments

7 things I’ve learnt in 7 months at 7digital

In January this year I started working at 7Digital, the leading global open music platform, in the Search & Content Discovery Team.

My role as a developer was quite a change from my previous jobs and I have relished the challenge of something new.

I’ve learnt not only new technologies but also better methodologies. Here are the 7:

1. Monitoring

Real time application monitoring with New Relic. Install the New Relic agent upon the server you wish to monitor, and with a little configuring, your web app will now start logging a wealth of metrics such as web request and response times, database query time and error rates. The visualisation of this data is very slick too.

StatsD & Graphite, which allow monitoring of any custom metric and presenting it in easy to read graphs. 7digital take a similar philosophy to Etsy, and favour monitoring production applications over trawling through log files. StatsD has a C# client library which we make extensive use of. Written by fellow 7digital developer, Goncalo Pereira it allows us to monitor any metric by sending UDP packets to a StatsD server. Unlike New Relic, metrics must be explicitly defined within the application by writing code that calls the library. This lets us monitor the areas that New Relic doesn’t, along with monitoring very specific events. Being able to see Zabbix stats is often a benefit too.

2. Continuous Delivery

TeamCity and continuous delivery, which makes deploying to production a simple case of clicking on a few buttons on the Team City project page, which allows 7digital to deploy to live dozens of times a day. Coupled with the extensive monitoring that we have, live deploys are low drama affairs.

3. Linux

Use of Raspberry Pis for displaying the Graphite monitoring graphs encouraged me to have a tinker with a Pi at home too. 7digital now use Pis for monitoring of all their production statistics, which are affixed to monitors hanging from the ceiling around the office.

I now work with Linux servers in production. I’d played with Linux in my own time at home, but now that our team is responsible for ensuring performance of the Solr infrastructure, I’ve learnt an awful lot.

4. Solr

Solr is an HTTP accessible search platform powered by Lucene. The 7digital’s search and catalogue API endpoints are powered by Solr servers. A large part of my day to day activities relate to improving the search performance in the 7digital API, understanding how best to insert documents into the Lucene index and understanding what makes a good schema. My colleague James Atherton has given a presentation to the SolrLucene Revolution conference in San Diego earlier this year, discussing what has been learnt in the process of migrating from a SQL Server text search system to a scalable Solr infrastructure system. As part of managing the infrastructure, I’ve just begun to learn about configuration management using CFEngine to manage server configuration in an automated fashion.

5. RESTful APIs

7digital’s API a RESTful one. Each endpoint on the public API routes to an internal API which is owned by a development team. OpenRasta is the web framework used in my team, which was quite different from working with ASP.Net MVC powered websites.

6. Methodologies inspired by The Goal, and The Phoenix Project

The Goal: A Process of Ongoing Improvement and The Phoenix Project: A Novel About IT, DevOps, and Helping Your Business Win
are books which everyone at 7digital has been encouraged to read. Both of which are novels concerning organisations that are struggling to improve their engineering processes, and succeed once they focus on finding the bottlenecks in the processes that are slowing them down. The Goal was originally written in the 1980’s and focussed on a manufacturing plant’s processes, which correlate surprisingly well with software delivery processes. The Phoenix Project is essentially a modern rewrite of The Goal, but focussed on an IT department within a larger organisation. Both books vividly describe the problems many developers, Sys Admins and managers have experienced first-hand. I liken it to Clean Code but for organisations.

7. Team

Working with a smart team who can deliver results is a great experience. Compared my previous workplaces there’s a more collaborative approach to problem solving across the teams. The levels of expertise within the company is inspiring and I always learn something new each time I informally chat with a member of another team.

I’ve been impressed so far and I look forward to what the future projects will bring!

No Comments