What I’ve learnt at Skyscanner in my first three years

I joined Skyscanner in Jan 2018 as a Data Engineer in the Data tribe. It’s been a good experience and a massive learning curve. Despite the impact of the Coronavirus Pandemic occurring just as I felt I was getting into the swing of things, this has given me an opportunity to look back.

Here’s a little summary of what I learnt in my first three years:

  • Operational work, toil, addressing errors making on call sane rotations are important, particularly during a time when there was a high staff turnover (due to the aviation sector shutting down during the peaks of the pandemic)
  • Protobuf Schema design is hard. Protobuf is used to define schemas of all events sent to the Data platform and subsequently how they are stored as tables in Hive Metastore for later analytical querying. Whilst one can deprecate a field, one can not delete it, so schema design must ensure it is backward compatible and anticipates future needs.
  • Trying to run a Kafka cluster on EC2 instances is a major undertaking. Avoid doing so where possible and use managed services such as AWS MSK or Kinesis instead).
  • Streaming data is not always necessary. Use ETL batch processes if you can since they are repeatable.
  • Metadata catalogue of datasets is key for Data Governance and also Data Quality. Being able identify the producers and data lineage of a table is very important.
  • I learnt about how to build and run software which is Sarbine Oxley Compliant as well as how to implement GDPR Subject Access Request compliance and applying a relevant data Retention period.
  • Cloud Cost monitoring in AWS is something that engineers need to be aware of to ensure cost are kept under control. This is where I learnt about AWS Cloudhealth.
  • Good Documentation is key skill. Clear READMEs should document how to run the repo, test and build and deploy it. This will speed up onboarding new joiners
  • Pairing remains an important skill to spread knowledge in a team and complete projects faster

Books worth reading:

Designing Data Intensive Applications by Martin Kleppman is excellent and well worth your time:

I would that Web Operations by John Allspaw gave a good overview of how to run highly scaleable web services. This gave me a good intellectual frame work for understanding how use Service Level Agreements, Service Level Indicators, Service Level Objectives to run a highly available service.

Examples of good practise when running a large Web Operation such as documentation, Retrospectives, Blame free Incident debriefs. Ensuring Runbooks exist and are up to date so the people responding to incidents have all the relevant information available to them.

Addressing Technical Debt is always going to a be a balance between new feature work and removing blockers to future changes. I found Michael Feather’s book, “Working Effectively With Legacy Code very insightful in how manage it.

There’s another project in the works so here’s a to a better year in 2022!

No Comments

New Role: Data Engineer at Skyscanner

I was delighted to start a new role as a Data Engineer at Skyscanner in London this January. I’m looking forward to learning how to run data pipelines that are at the Internet Economy scale. This will be definitely Big Data! 😀

No Comments

R&D Kafka Catalogue Cloud write up

 

Avoiding the move of a monolithic database into the cloud

Music labels will send to 7digital not just the audio recordings, but also the data pertaining to the audio, such as the artwork, track listings, performing Artists, release dates, prices, and specific rights to stream or download the music in various territories around the world. Approximately 250,000 tracks per week are received by 7digital and added to its catalogue, which is stored in a database. This process is called ingestion.

At the outset of the project there was a single database that stores catalogue and sales data which is used for multiple, unrelated purposes. This database is written to with new albums sent to 7digital from music labels, along with licensing rules governing who can access the music.

Slow queries cause the web applications which use the database to time out and fail which returns errors to the end users. Changing the database structure would help to resolve the errors and failures, however they would necessitate re-writing nearly every other web application that 7digital own, since the web applications are tightly coupled to the database. The database is very large and uses proprietary, licensed technology that cannot be easily moved to data centres around the world.

By separating the catalogue data from other data in our system, it would become possible to, not only write a much more efficient database schema which failed less often, but to also move this database into a cloud provider’s platform. With a database in cloud we could build part of 7digital’s Web API platform in the same cloud provider’s platform and therefore deploy our platform nearer to our customers in Asia.

Creating a separate catalogue database in London and moving the applications to AWS in Asia might help solve the problems with concurrent reads and writes to the database, but it would not help reduce the latency experienced by customers in Asia.

The key issue becomes one of figuring out how to transport the relevant catalogue data from the ingestion process in London and send this data into an AWS region.

Moving the ingestion process into AWS is far larger piece of work and would not yield produce performance improvements for customers in Asia by itself, so that it was decided not to move it out of the London datacentre in 2015.

Failures of the project

During 2015 we did not complete the final link in the chain, the application which could read messages from the Kafka service and persist their contents into the AWS database. We could not deploy a full fledged version of the London based Gateway API service, as it was too complex. Instead we made a naive implementation of this Gateway using a technology called nginx.

The Catalogue Persister service eventually was abandoned. An instance of Kafka was built in late 2015 and the Catalogue consumer was due to be started early 2016.

No Comments