How we leveraged a forced data migration into a well-oiled and efficient reports delivery machine

Big changes can come with hidden opportunities

A mutual pain point of software/data engineers is reviewing their backlog tasks everyday without having the time to handle them, due to constant prioritization of production issues.
In order to stay ahead of the curve, companies strive to adopt newer technologies which usually comes with heavy technological changes, including; infrastructure migration, new platform implementations or even internal organizational changes.
This is where data/dev teams must plan efficiently how they leverage that opportunity and create an improved operational environment.

In this article, I will review how something such as this affected SimilarWeb over the past year, and how we (The Data Solutions team) have utilized that change in order to build a cleaner, better and more stable analytics infrastructure.

About SimilarWeb:

SimilarWeb is recognized as the industry standard in market intelligence, with the world’s greatest database of digital behavior. Our company provides other companies with global multi-device market intelligence to understand, track and grow digital market share.

The Data Solutions team:
Reports as a service – offering and product evolution

The data solutions team at SimilarWeb is responsible for the delivery of custom reports, based on SW rich datasets. Our products are designated to answer all sorts of mysterious business questions and are comprised of three main report types:

  1. Data Enrichment Report – Enriching domains/apps with SimilarWeb’s data.
  2. Prospecting Report – Building targeted lead lists for sales, biz dev and partnerships teams based on dynamic metrics.
  3. Conversion Reports – The golden metric that indicates the business’ overall performance; our team provides conversion rates ranging from a broad site level to the detailed keyword level.

Migrating to AWS

During 2018, the entire data infrastructure of the company has been migrated to AWS – a very significant move that affected the entire R&D, across all teams.

As part of the migration of our data infrastructure to AWS, the company started using a platform for interactive analytics called Databricks. We were lucky enough to be the first team to use Databricks as our production environment. Our main task was migrating the entire reports operation to be executed within the Databricks platform, which can then connect to SimilarWeb’s Data Lake stored on AWS S3.

Leveraging the company’s migration for our own good

We saw the coming change as a big opportunity to make our ongoing reports delivery operations more efficient, scalable and maintainable.

WHAT we wanted to do was pretty obvious to us – creating a one-stop-shop delivery service, which includes a clear input and output system, allowing the analyst to easily generate reports using a few input parameters and queries  – and let the system do the rest.
In other words, we wanted to get rid of all the redundant components involved in the old process and be able to execute the entire operation on a single platform

The following flow chart provides a bird eye’s view of our work process, before and after the migration:

haddop, rundeck, git, databricks, AWS

So…HOW did we make the WHAT happen? Here’s a breakdown of the changes we’ve made on each of the different steps (Code Management, Report Scheduling and Report Execution) :

  1. Code Management:

Code Management: GIT + DOCKER
Code Management: GIT + DOCKER

Before the change, we used to create the reports locally, then push them to a docker container using Git.

Today, Databricks allows us to simultaneously work on the same codebase in form of Python Notebooks (similarly to Jupyter), monitor revision history and execute the reports, all in one place.

2.Report Scheduling:

Report Scheduling
Report Scheduling: Rundeck

Instead of creating scheduled jobs on Rundeck, our previous scheduling system, which creates a temporary Docker container per execution, we transitioned to scheduling our jobs directly through Databricks.

Running multiple notebooks in one go using Databricks’ Job Scheduler

Databricks’ Job Scheduler feature allows the user to schedule one notebook at a time, set JARs or configure spark-submit command.
Thus, we used the Databricks’ utils library, which allowed us to execute as many notebooks as needed from a single notebook, so that we can now efficiently manage all of the reports we need to deliver in one scheduled notebook.

The list of reports we deliver on an ongoing basis is being iterated in a few simple commands and executes the notebooks one by one:

The job which runs that notebook can be configured to run on a daily/weekly/monthly basis, sending alerts, setting # of retries, timeout and more:

Job scheduler Databricks
Job scheduler Databricks

 3.Report Execution:

Queries on Hadoop
Queries on Hadoop

Before the migration, we used to execute a bash script within the Docker container that ran a Python script using the .hql file (with one or more SQL queries) as an input and took the following 3 steps:

  • Queries execution
  • Saving the output to S3
  • Sending an email with a download link to the file location on S3

So… here comes the cool part!
We have created different classes to achieve each of the above goals, so instead of three different serial actions (and file formats!), everything will actually occur within a single notebook.
Practically, The ‘User’ inserts some report parameters:

Next, The user inputs the queries to be sent through Athena’s API (the sheetname part is used for the tab naming in the Excel output):

Importing classes in DataBricks is as easy as running an external notebook:



What does each class do?

  • AthenaRunnerClass – Sends the queries to Athena service, which then quickly returns a Pandas dataframe, and also helps save costs. Here is more on How to reduce Databricks costs by optimizing queries using Amazon’s Athena.
  • ExcelGeneratorClass – The returned Pandas dataframes are being iterated on a different class that beautifies each tab (query output) into the deliverable version of Excel, as well as add a cover and legend, based on the user’s input and the report types. This class is also adding a configuration Json file which will allow our SaaS platform to link the report to the paying account and enable downloading the report directly from the platform. At last, both files are uploaded to an S3 bucket, on the account’s reports folder.
  • EmailClass – An alert is sent once the entire notebook has finished, notifying the team/customer that the report is ready and available.

Execution:

The above flow works by executing each class separately and sending its returned variable to the next one:

Summary

Infrastructure changes and migrations can bring- along with hours of planning, trial and errors and headaches – big opportunities to create efficient and easy-to-use automations that will enable the scaling of an operation or a product.

It also comes with many learning opportunities of new tools, which in our case, leveraged the teams’ technical knowledge to a way higher level.

Planning in advance how to cope with future challenges, as well as reducing the amount of work and troubleshooting per every single task – allowed us to create a well-oiled delivery machine.

 


 

About the author:

Nimrod Fisher is a Senior Data Analyst in the Investor Solutions business at SimilarWeb. He works relentlessly to provide the most insightful and valuable data solutions in order to help customers gain a clear understanding of the digital world.