Michael Okarimia's Code Blog

RideLondon 2016

Posted by Michael Okarimia in charity on August 14, 2016

On 31 July I participated in the 2016 Prudential RideLondon cycling event, a closed road 100 mile course along the streets of London and the hills of Surrey.

I completed the course in 4 hours 57 minutes, 53 minutes faster than my time last year so I was very happy with my improved performance.

I set out to raise money for Haematology Cancer Care, a charity operating in UCL Hospital in Euston, London.

Recovering from the ride in Green Park, after finishing the ride on the Mall

On my fund raiser page I raised more than Â£800 for Haematology Cancer Care

Boxhill

Here are my cycle stats with all the juicy ride data:

I couldn’t have done it without the great team of fellow cyclists working together in a mini peloton!

The Finishing Team!

No Comments

Improving search results at 7digital

Posted by Michael Okarimia in 7digital on April 14, 2016

Developing the search & catalogue infrastructure for the 7digital API

Technological Objectives

The biggest problem with the old search platform was that in January 2014, the average track per search response time was 4000 milliseconds. In addition to being slow, the search results were often wrong, out of date or would return errors. Customers feedback was that they felt there was a poor user experience and were often irritated by the constant feed of error messages.

The meta data of the tracks was stored in a search index that was 660Gb in size, containing 660 mil documents, which is extremely large compared with a number of search indexes. Various tweaks were made to JVM and memory settings were made, but these failed as there was no permanent improvement. An extensive investigation was carried out on the search platform. It was discovered that the previous schema in production was indexing fields like track price which were never actually searched upon. A prototype was created to come up with a much smaller search index, with a size around 10 Gb. Benchmarked against the original size of 660Gb, this is a clear improvement.

Technological Advancements

This new smaller search index has created a number of technological advances:

being reliant on the ~/track/details endpoint meant we always returned current results, and we were 100% consistent with the rest of the API, which eliminated the catalogue inconsistencies problem;

we could create a brand new index within an hour, meaning up to date data;

much faster average response times for ~/track/search, reduced the response time by 88% from around 2600ms to around 350ms;

no more deleted documents bloating the index, thus reducing the search space;

longer document cache and filter cache durations, which would only be purge at the next full index every 12 hours, this helped performance;

quicker to reflect catalogue updates, as the track details were served from the SQL database which could be updated very rapidly.

Technological Uncertainties

Would the existing hardware cope with current levels of traffic with the new architecture?
Would new architecture provide consistent responses across different endpoints, I.E. search results and catalogue results, chart endpoints?

Innovations

We created a Git repository on Github with public access; anyone can submit to us a pull request to add new synonyms for our search platform. We can choose to accept the search synonym to our platform and the change will be effected on our search API within 12 hours of our acceptance of change. Repository is here: https://github.com/7digital/synonym-list

Some labels deliberately publish tribute, sound-alike and karaoke tracks with very similar names to popular tracks, in the hope that some clients mistakenly purchase them. These tracks are then ingested into our platform, and 7digital’s contract with those labels means we are obliged to make them available. At the same time, consumers of our search services complain that the karaoke and sound-alike artists are returning in the search results above the genuine artists, mostly because of the repeated keywords in their track and release titles.

In order to satisfy both parties, we decided to override default Lucene implementation of search and exclude tracks, releases and artists that contained certain words in their titles, unless the user specifically entered them in as search term. For example, searching for “We are the champions” now returns the tracks by the band Queen, which is what customers expect. To achieve this we tweaked the search algorithm so all searches by default it will purposefully exclude tracks with the text “tribute to” anywhere in their textual description, be it the track title, track version name, release title, release version name or artist name.
The results look like this: https://www.7digital.com/search/track?q=we%20are%20the%20champions%20queen

Prior to the change, all tribute acts would appear in the search results above tracks by the band Queen. To allow tribute acts to still be found, the exclusion rule will not apply if you include the term “tribute to” in your search terms, as evidenced by the results here: https://www.7digital.com/search/track?q=we%20are%20the%20champions%20tribute%20to%20queen

Other music labels send 7digital a sound-alike recording of a popular track, and name it so it’s release title and track tile duplicate the title of a well known track. This would mean that searching for “Umbrella” by Rhianna

Search: Dumb similarity modification of Lucene. Lucene is a capable search engine which specialises in fast full text searches, however the documents it is designed to search across work best when they are paragraph length containing natural prose, such as newspaper articles. The documents that 7digital add to Lucene are models of the metadata of a music track in our catalogue, in the form of as follows:

Standard implementation of Lucene will give documents containing the same repeated terms a higher scoring match than those that contain a single match. This is means when using the search term: “Michael” results such as “The Best of Michael Jackson” by “Michael Jackson”, will score higher than “Thriller” by “Michael Jackson” because the term “Michael Jackson” is repeated in the first document, but not the second.

In terms of matching text values this makes sense, but for a music search we want to factor in popularity of our releases based on sales and streams of it’s tracks.

Ignoring popularity leads to a poor user experience; since “Best of Michael Jackson” release is ranked as the first result, despite being much less popular than “Thriller”which is ranked lower in the search results.

This was achieved by modification of the Lucene’s term frequency weighting in the similarity algorithm

search, SOLR

No Comments

Solving problems of scale using Kafka

Posted by Michael Okarimia in 7digital, API, Automation, Cloud, DevOps on March 1, 2016

At 7digital I was in a team which was tasked with solving a problem created by taking on a large clientÂ capable of pushing the 7digital API to its limits. Â The clientÂ had many usersÂ and expected their numbers to exponentially increase. Whilst 7digital’s streaming infrastructure could scale very well, the requirement was that clientÂ wanted to send back the logs of the streams back to 7digital via the API. This log data would be proportional to the number of users. 7digital had no facility for logging said data being sent from a client, so this would be a new problem to solve.

We needed toÂ build an Web API which exposed an endpoint for a 7digital client to send large amounts of JSON formatted data, and to generate periodic reports based on such data. The expected volume of data was thought to be much higher than what the infrastructure in the London data centre was capable of supporting. It was very slow, costly and difficult to scale up the London data centre to meet the traffic requirements. It was deemed that building the API in AWS and transporting the data back to the data centre asynchronously would be the best approach.

Kafka was to be used to decouple the AWS hosted web service accepting incoming data from the London database storing it. Kafka was used as a message bus to transfer the data from an AWS region back to the London data centre. It was already operational with the 7digital platform at this time for non real time reporting purposes.

Since there was no need to use the London data centre, there was no advantage in writing another application in C# that could be hosted on the existing Windows webservers running IIS. Given the much faster boot times of Linux EC2 instances and the greater ease of using Docker in Linux, we elected to write the web application in Python. We could use Docker to speed up development.

The application used the Flask web framework. This was deployed in an AWS ECS cluster, along with a nGinx container to proxy requests to the python API container and an DataDog container which was used for monitoring the application

The API was very simple, once the inbound POST request was validated, the application would write the JSON to a topic on a kafka cluster. This topic was later consumed and written into a relational database so reports could be generated from it. The decoupling of the POST requests from the process that that writes to the database meant we could avoid locking the database, by consuming the data from the topic at a rate that was sustainable for the database.

Since reports were only generated once a day, covering the data received during the previous day, there could be a backlog of data on the kafka topic; there was no requirement for Near Real time Data.

In line with the usual techniques of software development at 7digital, we used tests to drive the design of this feature and were able to achieve continuous delivery. By creating the build pipeline early on, we could build the product in small increments and deploy them frequently.

We had the ability to run a makefile on a developer machine which built a docker container running the Python web app. We could then use the Python test frame unittest to run unit, integration and smoke tests for our application. Â The integration tests were testing if the app could write to Kafka topics and the smoke tests were end to end tests, which ran after a deploy to UAT or Production to verify a successful deployment.

We successfully completed the project and the web application workedÂ very well with the inbound traffic from the client. Since it was hosted in an EC2 cluster we could scale up both the number and the resources of the instances running our application. The database was able to cope with the import of the voluminous userÂ data too. It served as a good example of how to develop an scalable web API which communicated with a database located in a datacentre. It was 7digital’s first application capable of doing so and remains in use today.

No Comments

Michael Okarimia's Code Blog

RideLondon 2016

Improving search results at 7digital

Solving problems of scale using Kafka

Most Recent Posts

Archives

Categories

Search posts