Archive for category Cloud

Solving problems of scale using Kafka

At 7digital I was in a team which was tasked with solving a problem created by taking on a large client capable of pushing the 7digital API to its limits.   The client had many users and expected their numbers to exponentially increase. Whilst 7digital’s streaming infrastructure could scale very well, the requirement was that client wanted to send back the logs of the streams back to 7digital via the API. This log data would be proportional to the number of users. 7digital had no facility for logging said data being sent from a client, so this would be a new problem to solve.

We needed to build an Web API which exposed an endpoint for a 7digital client to send large amounts of JSON formatted data, and to generate periodic reports based on such data. The expected volume of data was thought to be much higher than what the infrastructure in the London data centre was capable of supporting. It was very slow, costly and difficult to scale up the London data centre to meet the traffic requirements. It was deemed that building the API in AWS and transporting the data back to the data centre asynchronously would be the best approach.

Kafka was to be used to decouple the AWS hosted web service accepting incoming data from the London database storing it. Kafka was used as a message bus to transfer the data from an AWS region back to the London data centre. It was already operational with the 7digital platform at this time for non real time reporting purposes.

Since there was no need to use the London data centre, there was no advantage in writing another application in C# that could be hosted on the existing Windows webservers running IIS. Given the much faster boot times of Linux EC2 instances and the greater ease of using Docker in Linux, we elected to write the web application in Python. We could use Docker to speed up development.

The application used the Flask web framework. This was deployed in an AWS ECS cluster, along with a nGinx container to proxy requests to the python API container and an DataDog container which was used for monitoring the application

The API was very simple, once the inbound POST request was validated, the application would write the JSON to a topic on a kafka cluster. This topic was later consumed and written into a relational database so reports could be generated from it. The decoupling of the POST requests from the process that that writes to the database meant we could avoid locking the database, by consuming the data from the topic at a rate that was sustainable for the database.

Since reports were only generated once a day, covering the data received during the previous day, there could be a backlog of data on the kafka topic; there was no requirement for Near Real time Data.

In line with the usual techniques of software development at 7digital, we used tests to drive the design of this feature and were able to achieve continuous delivery. By creating the build pipeline early on, we could build the product in small increments and deploy them frequently.

We had the ability to run a makefile on a developer machine which built a docker container running the Python web app. We could then use the Python test frame unittest to run unit, integration and smoke tests for our application.  The integration tests were testing if the app could write to Kafka topics and the smoke tests were end to end tests, which ran after a deploy to UAT or Production to verify a successful deployment.

We successfully completed the project and the web application worked very well with the inbound traffic from the client. Since it was hosted in an EC2 cluster we could scale up both the number and the resources of the instances running our application. The database was able to cope with the import of the voluminous user data too. It served as a good example of how to develop an scalable web API which communicated with a database located in a datacentre. It was 7digital’s first application capable of doing so and remains in use today.

 

No Comments

Moving 7digital’s catalogue into the cloud

7digital provide a web API platform for it’s customers to discover, stream & download music.  Every request to the public platform first is received by the Gateway API, which then redirects the requests to a dozen or so web applications of which have single responsibility, such as search or payments.  These web applications read catalogue data from a database and return it to customers in XML format via HTTP responses.

7digital’s customers are spread around the world yet the API and its data is primarily located in London data centres. This means that latency, resilience and local targeting of content is not optimised for countries like India, where our largest music streaming service is located. In order to reduce latency for customers in Asia, we had the goal to migrate Catalogue Web applications and its corresponding database into an AWS region near the customers. By moving the APIs into the AWS cloud, it became possible to serve a greater volume of traffic by scaling up the number Web application running in parallel. For example, more search servers running in parallel allows for far more searches to be conducted at once.

Breaking down the problem

Cloud migration is such a large task, so we decided to break the work down into smaller units of deliverable features, each of which would provide value to the customer.

The part of the platform which our customers use the most, is the ~/track/details endpoint. It provides the access rights and other details of a given track.

The first part of the migration is to move into AWS this parts of the platform that power this endpoint.

The overall goal is to reduce response time for customers in around the world, particularly in Asia and the USA for all requests for this endpoint.

As a prerequisite it was also required to discover if the Gateway API, which governs access to the 7digital API services, is required to move into the same AWS cloud too.

What is the Gateway API?

All requests to the 7digital API are first received by the Gateway API, which governs access to specific features of the platform.

It’s duties include:

  • Identifying the client making the request
  • Checking to see if the client has not exceeded their daily request limit
  • Ascertaining if the client is allowed to access the service they are requesting

If all the above conditions are met, it will then redirect the request to the appropriate internal Web application which will provide the requested service.

Although Amazon do provide a service similar to the Gateway API, it was soon discovered that their product would not be sufficient to provide an equivalent functionality, so a bespoke version was required.

The first deliverable feature from a customer perspective

The initial goal is to move ~/track/details endpoint with streaming data to AWS, ultimately for a client in India. This can be broken down into the following steps

  • Deploy a version of the Gateway API, which can authenticate requests from the public web
  • Deploy a portion of the 7digital API into AWS and have it operate in parallel with the existing 7digital API.
  • Web applications that provide data regarding tracks in the music catalogue will be moved in the AWS Cloud
  • Create a new database used solely for storing catalogue data. This database will be queried by the Web applications that need to fetch track details.
  • Deploy the database into AWS with appropriate security and permissions so it can be accessed by 7digital’s web applications
  • Populate the catalogue database with the data from London database.
  • Keep the catalogue database in the AWS region up to date with new content being ingested in the London data centre.

To make this laundry list of requirements into a even smaller slice of functionality, we decided to use our CDN which sits in front of the Gateway API as part of the experiment to test the APIs in the cloud.

It was possible to redirect a specific url from one customer for one track so it is served from our cloud based infrastructure instead of London data centre based infrastructure.

Only a specific url would be redirected, the remainder of the current API Web traffic would continue to the London datacentre as normal.

This meant a request to the specific url would use the following parts of infrastructure which we successfully deployed into AWS.

The MVP

A stub Gateway API in AWS using nginx, which merely redirected to

A Metadata API
An Availability API
A Catalogue API
A MySQL database which can be accessed by aforementioned APIs

A service to listen for messages containing catalogue data and save it to a MySQL database (Catalogue Persister). This is a bespoke programme which listened for the 7digital formatted JSON messages and saved it into the 7digital specific database schema
A Kafka messaging system hosted on a server in the AWS cloud

CDN AWS Catalogue API 2016 public

The stub Gateway API was achieved by using nginX software, which could perform most of the functionality the London Gateway API could. The client who is based in India whom we intend to use the AWS hosted version of the platform does not have a usage limit, so the stub of the Gateway API does not need to count requests.

Deploying the Metadata, Availability and Catalogue APIs to the AWS region proved that it was possible to have functioning APIs in the region. They called the MySQL database which we deployed to the same AWS cloud.

The Catalogue Persister service which listened to messages containing catalogue data worked as a proof of concept but was not production ready.

The idea of the persister was that we could send JSON formatted web requests to this service, and it could persist them into the MySQL database in the AWS cloud. However we soon realised that this would make it difficult to track what had been sent to the service and the feature was then abandoned. It did, however, fulfil its initial goal of populating data sent to it from London, but it was superseded by the Kafka messaging approach.

In the London datacentre there would need to be a service which read the data from the London database and send it the AWS region. We used Kafka service as the data integration point between London and the AWS region.

We opted to use a Kafka messaging system instead of the Catalogue Persister because it allows for every message to be logged and stored in chronological sequence; it became easy to replay all of the messages that had been sent. This is important in order to ensure the databases in London and the AWS region are consistent. The Kafka messaging service consisted of:

A server running Kafka software, inside the AWS VPC (Virtual Private Cloud)
A server running cluster management software called Zookeeper, which was to manage the servers running Kafka
A VPC configuration which allowed for the Zookeeper servers to receive requests from the public web and forward those onto the Kafka servers it was managing. The Kafka instances themselves are not accessible from the public web.

In our London Datacentre we created a programme which could read data from our existing shared database and send it to the Kafka service. We called this programme the Catalogue Producer. This was a bespoke programme which read data from our London database and pulled out the data for a single album. It then encoded it into a JSON format, and finally sent it to the Kafka messaging service in the AWS cloud.

This experiment worked as a proof of concept. To make this read for production we were going to change this programme so it could read the entire catalogue from the London database, convert it into many JSON formatted messages, and send those into the Kafka instance in the AWS cloud. Those messages could then be received and saved into the catalogue database in the AWS cloud.

It was proposed that we create a service in AWS which can read messages from Kafka an persist them into the AWS database, thus allowing data from the London database to be transmitted into the cloud database. However this was not started in 2015, and thus uncertainty #7 remained an open question.

Once the data was in the catalogue AWS cloud database, the portion of the 7digital API that was hosted in the AWS region could then display the data stored, thus fulfilling the project goals.

AWS Catalogue API June 2016 public

Summary of Outcomes

Catalogue API and supporting Metadata and Availability API’s deployed to an AWS region. A MySQL database was deployed there too, which the correct security access settings, so the three APIs could query that database. A stub of a Gateway API which could redirect external web requests to the internal Catalogue API was created. This allowed for the Catalogue API to be accessible via the public Web. We then made a change to 7digital’s CDN provider, Fastly, to allow a specific request to be redirected to the AWS version of the Catalogue API, instead of the London based API.

We also set up a Kafka instance in the AWS region, which was to act as a place where messages containing catalogue data were to be sent. This Kafka instance  only accepted traffic from the 7digital London office.

A service which could read a single track from the existing London database and push a message containing the track data to the Kafka instance was also created. This was called the Catalogue Producer. It’s goal was to send catalogue updates to the AWS database via the Kafka service.

Goal of moving the Gateway API was partially achieved when the scope of the functionality was reduced, but questions remain surrounding authentication and request limiting.

Although it was possible to serve live traffic, there remains questions of how effectively the platform would work. A/B testing would help use see what levels of traffic the new infrastructure could handle

The challenge of moving catalogue data from London to the AWS database was partially solved, whereby it was shown that is is possible to transmit a message containing the data to represent a album into a Kafka messaging service in an AWS cloud. This project served as an useful proof of concept and starting point for the migration of 7digital’s API into an AWS region.

No Comments

Using TDD to write Infrastructure Configuration Code for Legacy Servers

There’s much talk about DevOps these days, with its differing interpretations. One aspect is to automate the configuration of your server infrastructure, by writing configuration code that is version controlled and tested. But how can this be applied to existing infrastructure, in particular, poorly maintained legacy systems that are already in production? This challenge was one that the Content Discovery Team at 7digital (which I lead), overcame.

The Challenge

We had inherited a legacy production server with no UAT version and no tests. Every day the system generated mission critical files which were populated from a database running queries that took around eight hours to complete.

There was a requirement to change the application, but doing so immediately would mean a perilous deploys to production without any tests, which I decided was not acceptable. We were only to deploy to production, once we had built a UAT environment and tested the code there, but first step was to get the code running on a developer machine.

Our first challenge was understand what exactly the program was doing. This was made more tricky since it ran out of date versions of ruby, on old versions of Linux.

After some frustrating period attempting to run the current code with it’s ancient ruby gems, we decided that “what the software did” trumped the “how it did it”. A new production server with the latest versions on ruby and gems would ultimately be created, with a matching UAT environment.

Replacing manual configuration with automated configuration

7digital have been moving to single configuration management system and CFEngine was the tool of choice. Rather than manually installing and configuring the new server, we wrote a CFEngine promise file with the bare minimum requirements for the new application, starting with the requirement that the correct version of ruby was installed.

As a team (myself, Dan Kalotay, Dan Watts & Matt Bailey) we were familiar with CFEngine, but not with testing configuration code. We were helped by fellow 7digital developers Sam Crang and James Lewis, who got us up and running with Test Kitchen, and Anna Kennedy whose CFEngine knowledge helped us greatly.

To test the configuration we used Test Kitchen. It uses Vagrant to spin up a new Debian server on the development machine. The Debian image was one of our production server images which had CFEngine agent pre-installed. Using the Vagrant file we slaved it to use the locally edited CFEngine file as its source of configuration, instead of a remote CFEngine hub. Once the virtual machine had been created it would call the Test Kitchen kitchen converge command, which would apply the CFEngine promises defined in the local promise file.

So far, so DevOps. But how does one automatically assert if the promises set the desired state?

Writing Infrastructure Code in a TDD Manner

Using Test Kitchen it is possible to run a suite of tests against the newly created virtual machine. We used Serverspec as it allows us to write Rspec style tests.  Serverspec uses SSH to connect to the virtual machine and then it can run any command to assert that it was configured as desired.

We then started to write tests for the state of the server in a TDD fashion. We would for example:

  1. Write a test to assert that a cron job will run at a certain time,
  2. Run the test and see it fail, since the cron job has not been created yet
  3. Create a CF-Engine promise to create the cron job,
  4. Run kitchen converge && kitchen verify to apply the configuration and run the test again
  5. See the test pass, or if they fail, go back to step 3

In this way we added more configuration, by repeating this  Red-Green-Refactor process, which is familiar to most modern programmers. Running the Serverspec tests allowed us to drive out configuration; accreting functionality of the virtual machine and building up the configuration file that set up the server state.

Deploying to UAT

Once we were happy with the configuration, we committed and pushed the CFEngine promise file up to the CFEngine Policy hub for our UAT environment. It then was straightforward to request that our infrastructure team create the new UAT server slaved to thay same policy hub. Once in UAT we could run more detailed tests overnight, since the SQL queries we were running took around eight hours to complete. Our QA team worked with us to assert that the product worked in accordance with our acceptance criteria.

Deploying to production, now a low-ceremony, low risk event

Once every party was happy with the end result, it was time for our infrastructure team to spin up a new production virtual machine slaved to our CFEngine hub. Within minutes our replacement server was in production,  along with the changes our clients required. Within a few days the old server had been permanently retired.

This was consistent with the idea that organisations should treat their servers as cattle, not pets. In the event of a server failure, spinning up a new virtual machine with little human intervention which works well in cases where the server does not store mission critical data. We had replaced a delicate and temperamental “pet-like” server and replaced it with a more disposable  “cattle-like” one. In case of server problems, prod.server0001 which could be replaced within minutes with an identically configured prod.server0002.

What we learnt along the way

An early test failure we experienced was around installing ruby gems.  We attempted to use CFEngine to run a shell command gem install <gemname> but this always failed. It turned out to be simpler to create it as a deb package and install it, rather than using CFEngine to execute certain shell commands to install gems. This was due to the way CFEngine executes shell commands; the permissions were not appropriate for the application user.

Another problem we overcame was how the cp command on CFEngine not synonymous with Linux cp command.

Conclusion

Whatever your particular interpretation of what DevOps means, in this project I learnt that what allowed us to succeed was close collaboration between developers and infrastructure engineers. The developers learnt much about the infrastructure and infrastructure team members became proficient in understanding how the developer application worked and comfortable with using version control for infrastructure code. Clear communication was key and we all learnt how to get the job done.

We also learnt that sometimes it’s worth starting from scratch rather than attempting to retrofit CFEngine promises to an existing server. The retrofit idea was abandoned due to it being almost as risky as editing configuration files manually on the production server.

In future I’ll always try to get all existing production infrastructure I’m responsible for configured this way, as well as using it for new infrastructure.

No Comments