Using Real Time Application Metrics To Optimise Solr Indexing Throughput

In my previous blog post I wrote about 7digital’s use of real time application monitoring. In this post I will expand upon how the team used application metrics to greatly improve upon a product that powered a critical part of 7digital’s infrastructure.

I’m a member of the Content Discovery Team, who are responsible for making 7digital’s catalogue of 25 million tracks available via the 7digital API.

One of the requirements of making the catalogue accessible is that the lack of a universal unique identifier in music industry that represents artists, tracks and album releases, means that full text searching is important for discovering content. At 7digital, to cope with the increase in both traffic and catalogue size, full text searches were moved over from relational database to Lucene powered Solr database back in 2011

Currently we extract the metadata for the music catalogue from a relational database, transform it to json DTOs and then and post it into our Solr servers, which perform full text searches for music releases, tracks and artists. By placing the catalogue into Solr, this enables it  to be searchable, and makes it possible to retrieve releases and artist metadata when supplying a unique 7digital ID.

The datasets are relatively large; creating an index of the entire dataset of 25 million tracks with the schema used would take many days, so it was necessary to  append the hourly changes of the catalogue to an existing index.

There was a relational database containing many millions of updates to the catalogue in time order. This was the typical amount of meta-data that is generated per day. Each update would be represented as a single row in a “changes” table in the database.

I.e. a row storing metadata for a track  would contain track title, track artist, track price, as some of it’s columns. Each row would have a column denoting if that metadata was being inserted, updated or deleted from the catalogue.

Solr acts as an http wrapper for a Lucene index, so creating an index of the data was done via json posts to the Solr server. One piece of metadata was read from the relational database, mapped into a json document and then posted via HTTP to the Solr server. Solr would then add this document to it’s Lucene index which could be searched. This entire process we refer to as indexing.

The Problem

With the old code base, indexing this data often took hours. This meant that end to end testing in the UAT environment involved lengthy multi hour “code, deploy, run tests” cycles. Typically a change would be made early in the morning so the results could be inspected by lunch time. One could look at the log files to check upon progress, but it became awkward.

The Goal

In early summer 2013 we rewrote the legacy code which performed this Extract, Transform and Load process from the relational database to the Solr server. It was no longer fit for purpose and was causing data inconsistencies which had come about in part from the long feedback cycle between making a code change and seeing the test result.

The goal was improve the metadata indexing throughput to Solr whilst ensuring data correctness.  After writing extensive unit and integration tests to verify data consistency, we then turned to performance tuning in our acceptance tests. We wanted to index as many documents to solr as quickly as possible. Log files were available but it wasn’t straightforward to parse meaning from them when looking for patterns across a time span.

This arrangement was clearly suboptimal so we decided to incorporate real-time monitoring. We already were using StatsD to log and Graphite to visualise the web responses on our internal API endpoints and realised we could put it to use to help us here.

After improving the SQL queries, we discovered that the largest factor of indexing throughput was how many documents to post for each commit.

Each document contains a change which could be a delete, an insert or an update of an existing piece of metadata (i.e. a price change of a track). In our scenario, Lucene treats an update and an insert as the same thing; any document indexed that has a matching ID of an existing one will overwrite the original. Changes could therefore be boiled down to either an delete or an “upsert”.

After posting a certain number of documents, a commit command is sent,  whereby the index commits the recent changes to disk. Once committed, the new documents are in the index and are searchable. While the commit phase is happening, no new documents can be posted to the Solr server. Depending on the size of the index and the number of documents,the commit phase can take many seconds, or even minutes.

Lucene works effectively when only deletes are committing in one batch and only upserts are committed in another batch. This is much faster than interweaving deletes and upserts within the same period between commits. Commits are when the Lucene index is writing the new documents to the index on the disk, and they come with a minimum time to complete, during which posting new documents is not possible.

Number of rows read mapped directly to the number of documents indexed.

Using the StatsD c# client we counted how long it took to read from the relational database, post the documents to Solr, and then how long it took to commit them, and we sent this timings to a statsD server. We then used Graphite to visualise this data.

The graph below shows:

time taken to read the metadata changes denoting deleted tracks/releases from the relational database, within the given range of rows
time taken to post the delete documents to Solr
time taken for Solr to enact the commit (during which period no new documents may be posted)

time taken to read the metadata changes denoting the upserted tracks/release from the relational database within the table range given range of rows
time taken to post the new documents to Solr
time taken to for Solr to enact the commit (which prevents any new documents from being posted)

Chart showing indexing timings

As we can see, commiting updates takes upon 72 seconds and is by far the greatest proportion of time.

Large ranges would make the SQL query component take a long time, occasionally timing out and needing to be repeated. Additionally, commits would take longer.

Using smaller range sizes resulted in more frequent posts and commits to Solr, but commits seemed to have a minimum duration, so there came a point where very small ranges would mean the commit times would be the largest component of the indexing process. and document throughput dropped.

How to measure indexing throughput?

We added some more statsD logging which shows the indexing throughput over the same time period as the chart above.

The below chart shows the row id on the y axis. The each row represents a change of metadata. The blue line shows what is newest change in the relational database. The Green line shows how many of the row ID of the changes that have been posted to Solr. They follow each other with a small delay. A measure of indexing throughput is how closely the green line follows the blue.

Indexing throughput

The right of this graph showed us that there were periods of time when no updates are written into the relation database, but when there are writes, the indexer can index them into Solr in less than two minutes

In the C# code that did the Extract, Transform and Load, we added additional statsD metrics, which posted the max rowID of changes table in the relational database, and what the latest rowID that had been posted to Solr. This graph is still used in production to monitor the application in real time. It lets us  observe when many metadata changes are made, (for instance, when new content is added)  and diagnose any issues with the database we read from, and how long it may take for the content to become available in the Solr servers.

The Result

By being able to monitor how many new changes were being written to the database, and how quickly the indexer can commit it to Solr, we discovered that over a certain rate, increasing throughput would not have a noticeable effect, since there was no benefit in attempting to read more rows per hour than there were being created. This level of monitoring allowed us to effectively decide upon the direction to take when developing the product. The monitoring gave us additional unexpected benefits of having visibility of a business critical database process which was previously unmonitored, the rate of changes recorded was now known.

Ultimately in order to maximise end to end indexing throughout, we found a sweet spot of 100k documents, and by using the the metrics we only optimised where we needed to, thus saving valuable development time.

No Comments

7 things I’ve learnt in 7 months at 7digital

In January this year I started working at 7Digital, the leading global open music platform, in the Search & Content Discovery Team.

My role as a developer was quite a change from my previous jobs and I have relished the challenge of something new.

I’ve learnt not only new technologies but also better methodologies. Here are the 7:

1. Monitoring

Real time application monitoring with New Relic. Install the New Relic agent upon the server you wish to monitor, and with a little configuring, your web app will now start logging a wealth of metrics such as web request and response times, database query time and error rates. The visualisation of this data is very slick too.

StatsD & Graphite, which allow monitoring of any custom metric and presenting it in easy to read graphs. 7digital take a similar philosophy to Etsy, and favour monitoring production applications over trawling through log files. StatsD has a C# client library which we make extensive use of. Written by fellow 7digital developer, Goncalo Pereira it allows us to monitor any metric by sending UDP packets to a StatsD server. Unlike New Relic, metrics must be explicitly defined within the application by writing code that calls the library. This lets us monitor the areas that New Relic doesn’t, along with monitoring very specific events. Being able to see Zabbix stats is often a benefit too.

2. Continuous Delivery

TeamCity and continuous delivery, which makes deploying to production a simple case of clicking on a few buttons on the Team City project page, which allows 7digital to deploy to live dozens of times a day. Coupled with the extensive monitoring that we have, live deploys are low drama affairs.

3. Linux

Use of Raspberry Pis for displaying the Graphite monitoring graphs encouraged me to have a tinker with a Pi at home too. 7digital now use Pis for monitoring of all their production statistics, which are affixed to monitors hanging from the ceiling around the office.

I now work with Linux servers in production. I’d played with Linux in my own time at home, but now that our team is responsible for ensuring performance of the Solr infrastructure, I’ve learnt an awful lot.

4. Solr

Solr is an HTTP accessible search platform powered by Lucene. The 7digital’s search and catalogue API endpoints are powered by Solr servers. A large part of my day to day activities relate to improving the search performance in the 7digital API, understanding how best to insert documents into the Lucene index and understanding what makes a good schema. My colleague James Atherton has given a presentation to the SolrLucene Revolution conference in San Diego earlier this year, discussing what has been learnt in the process of migrating from a SQL Server text search system to a scalable Solr infrastructure system. As part of managing the infrastructure, I’ve just begun to learn about configuration management using CFEngine to manage server configuration in an automated fashion.

5. RESTful APIs

7digital’s API a RESTful one. Each endpoint on the public API routes to an internal API which is owned by a development team. OpenRasta is the web framework used in my team, which was quite different from working with ASP.Net MVC powered websites.

6. Methodologies inspired by The Goal, and The Phoenix Project

The Goal: A Process of Ongoing Improvement and The Phoenix Project: A Novel About IT, DevOps, and Helping Your Business Win
are books which everyone at 7digital has been encouraged to read. Both of which are novels concerning organisations that are struggling to improve their engineering processes, and succeed once they focus on finding the bottlenecks in the processes that are slowing them down. The Goal was originally written in the 1980’s and focussed on a manufacturing plant’s processes, which correlate surprisingly well with software delivery processes. The Phoenix Project is essentially a modern rewrite of The Goal, but focussed on an IT department within a larger organisation. Both books vividly describe the problems many developers, Sys Admins and managers have experienced first-hand. I liken it to Clean Code but for organisations.

7. Team

Working with a smart team who can deliver results is a great experience. Compared my previous workplaces there’s a more collaborative approach to problem solving across the teams. The levels of expertise within the company is inspiring and I always learn something new each time I informally chat with a member of another team.

I’ve been impressed so far and I look forward to what the future projects will bring!

No Comments

My Ruby on Rails photo site hosted on Heroku

In a previous post that I wrote on the topic of deploying web applications on to Azure cloud I compared that to my experience of deploying Ruby on Rails web apps onto Heroku.

I realised that I’ve not mentioned my experiences with coding in Ruby. My motivation to learn Ruby On Rails was triggered by the fact that I was developing a rather large Asp.Net MVC web app in my day job, and according to one of the books on my reading list it was the inspiration for the Microsoft’s framework.

I’d returned from a holiday in Russia and I wanted to share the photos with my friends in a novel fashion. Why not create my own website to share them and learn Ruby On Rails in process? As a result, I created a few sites using RoR, this one being the public clone of my photos website The rest of this post covers how I launched these sites.

After a little googling, I decided I would do my Ruby development on a Linux based OS as opposed to Windows, since I would be able to keep up my Linux knowledge. Vim was to be my text editor of choice.

To get a grasp on the Ruby language I started working on the Ruby Koans exercises and once I felt sufficiently competent I moved onto understanding how it weaves into Rails to result in the accomplished and popular web framework. Michael Hartl’s Tutorial is the perfect place to start

I dabbled with deploying a Ruby On Rail web app on Phusion Passenger without much luck, so I then decided to take the simplest route to hosting and go with Heroku. I’d already decided that I would learn Git quite some time ago; now would be the time to take the plunge!

Along my path of deploying the web site I went for the option of using a mature open source photo site called Balder. Doing this meant I could get my web site launched in a reasonable time frame, and to also avoid reinventing the wheel

Overall I felt it was a positive learning experience. I started to learn a new language and framework, which allowed me to appreciate MVC.Net and see where Ruby isn’t so great in comparison. For example, Ruby’s concept of Active Record may be useful for rapid development, but it lacks the strengths of Fluent NHibernate. Some have argued that Active Record is itself an anti-pattern.

I became much more familiar with the Linux Terminal and I’m happy to see that Windows Power Shell has Windows analogues to Linux commands I’ve committed to muscle memory. I’m also a big Git fan and I’m comfortable using it at the command line.

With my understanding of git I’ve come to use Github too, and all the great ideas that flow from it.

It was a worthwhile endeavour, even just for the insight gained. You can visit the fruits of my labours and see for yourself: http://publicpics.herokuapp.com/

, , ,

No Comments