Motivation and Introduction
If you work for a software organization, there is a good chance that you have piles of data lying around. Interacting with this data may be interesting to your users, but it’s unstructured, unreachable, and maybe querying it takes too long.
In this post we will discuss some principles that will allow you to set up a user-facing near real-time query engine over your data with relatively little effort using AWS Athena. Setting up a working POC with AWS Athena will only take a few days and will be fairly inexpensive.
AWS Athena is an interactive serverless query service that allows you to directly access data stored in S3. Under the hood it utilizes a variant of Presto so you can use standard SQL syntax in your queries. Indeed, Presto as-a-service over S3 is one way of looking at it, (AWS does in fact has a separate offering for Presto on EMR ). Although when Athena was first released it was mostly intended to be used internally by organizations as a BI tool, we are going to use Athena directly as a serving layer for your users in this post.
Athena’s ease of use and unique pricing makes it a convenient big data playground. Athena is serverless and employs the AWS pay-as-you-go model, so you only pay for the amount of data scanned.
Perhaps one day you may decide that you do want to spin up your own Hadoop cluster, and that it makes financial sense, but that is a decision for another post.
From Big Data to Small-Optimized Big Data
The first step towards an efficient user-facing query system is flattening (May also be referred to as denormalization) all relevant data, and discarding the rows and columns that are known to be irrelevant.
It’s easy to see why throwing away unnecessary data is good for performance( I will go as far as to say that if your data looks something like a long tail distribution, in the sense that a small percentage of rows makes a big percentage of hits, it may be worth in some cases to discard a part of the tail in favor of better performance and greater savings.) but to understand why flat data is better, we need to consider how data is stored in the kind of columnar database that we will use.
In a regular database records are stored in rows (a row here means the bytes on the disc that are being read together in a single operation), and the indexes are sorted while the data is not. The index is mapped to the primary key which is the id of the row. This optimizes retrieval of records.
In a columnar database, columns are stored in rows, and the data is sorted while the indexes are not. The data is the primary key which is mapped to the index. This optimizes set-wide operations such as scanning. But, in case the column you want to search on is not available in the record and you need a join operation, this property no longer holds, as there is no way of scanning efficiently. In short, if you want to search over a certain column, you need to have it available in every record, hence the flat table.
Although denormalizing the data might inflate it, since we’re optimizing performance at the expense of storage, we will be happy to pay this price. Storage is relatively cheap, and this tradeoff we make here is one we will make often.
The above is also true in case your data continuously flows in. Consider the possibility of recording it twice; once for posterity, and once in an optimized form as described above.
Let’s consider an example of a flight search engine, only instead of looking for cheap flights in the future we will be looking at historical data. Our imaginary database is comprised of many files representing several tables that contain information on every commercial flight in existence. Every day at 00:00 we get new raw data files with information on the flights made on the previous day. We would like to build a system that allows users to answer questions such as “What flights were made on the 20th of April, 2010, from London to Paris?”
Obviously the raw data we would start with is immense. Luckily, some columns such as `max_altitude`, are of little interest to most users, so we can throw those out. In addition, we make the decision to exclude all domestic flights because those are less interesting to our users. Hopefully, after discarding irrelevant rows and columns we will be left with something more manageable.
Our made-up database has two tables within it: the first is a very big table with flight data and the second is a smaller table that isn’t updated often with airport data. Since we want to allow users to search for flight origin and destination not only by airport codes that we have in the flight table, but also using the city names we have in the airport table, we will have to join some columns to each row from this table as well. It will add some data redundancy, but allow for very fast queries on those fields.
How to Store the Small-Optimized Big Data
We’ve touched on some of the changes you’ll need to make to the structure of your data before you can query it efficiently. However, before you can spill it all in an S3 bucket of your choice there are two more things to consider: schema and format.
As mentioned, you will want to use a column-oriented storage format. If you don’t have many nested data types, you should go with ORC format. Otherwise, consider using Parquet instead.
As to your schema, Athena supports Hive partitions meaning you can further optimize scans. A Partition is a mapping from a range of values on a certain column to a physical location of the relevant data. In a sense, a partition is a virtual column as it’s not part of the table, but can be queried nonetheless. The classic example is partitioning by date when analyzing time-related data such as logs. When scanning for logs from a certain day, it is sufficient to only read the files in the partition for that day. What it means in practice for Athena is that your partitions are going to be physical folders in your S3 bucket. It is up to you to decide how to partition the data, based on how you wish your users interact with your data.
If we go back to our example, I expect users will be interested to search for flights by date, i.e. every query will include a specific day. In that case I can partition my table on an ISO date and my bucket would look as follows:
Contrary, I might expect users to want to look at flights by airline, and then my partitions would look something like this:
You can combine several partitions together, and for even greater efficiency combine partitions with ORC indexing. Because in Athena you pay for the data you scan, partition pruning and indexing are key to keeping costs down as well as improving performance.
The set of results the user now has is filtered and potentially sorted according to some default order, but still may be too big to be displayed all at once. Unless the results are immediately discarded, the user might want to be able to paginate, filter and sort the results. While it is possible to query the big data engine each time for each operation, it makes sense to copy the results aside to a “user-space” and perform all operations on this much smaller batch of data. Again, this improves query speed and cost.
The user-space data could live in S3 and be accessed via Athena, similarly to the original data, but this time in a completely separate set of tables. Or if, for example, you expect the set of results to be small enough it perhaps makes more sense to have those query results reside in a classic relational database, or a key-value store such as redis. These show much better performance, and depend on your specific use-case and personal taste.
To tie this all together, we will need a back-end service in a framework of your choice that will receive requests from users, produce SQL, interact with Athena, and send back the results. This service may be logically split into a dedicated service for querying the big data and creating cached results or user-space data, and a service for querying user-space data. The first is only concerned with translating the user’s request made in your syntax of choice to an efficient query that will get the right set of results to the user-space. The second service handles pagination, filtering, and sorting.
SimilarWeb Case Study
Allowing our users to query our data interactively in any way they want is a longstanding goal. One of the first major steps in that direction was building a product that allows users to get the top ‘X’ amount of websites according to a custom set of filters and criteria. For example, a user might choose to see the top 20,000 US websites in the Health industry, ranked by number of visits, with filters available for most of the metrics we have and even regex on.
The need originated from recurring requests by users to generate custom reports according to specific criteria. Generation of those reports would consume a lot of time and require manual labor.
We started from building an optimized flat table of the relevant data by removing websites that were too small to be relevant in any query, as well as removing columns that would never be used for filtering or display. We also added computed columns that combined data from several other metrics. The resulting Parquet files would reside in S3 and were partitioned by month and country, the parameters that every query has to pass.
This ETL was written as a Python notebook in Databricks and was initially executed manually, but would later be part of a larger workflow that is managed in Airflow.
We wrote a simple Flask service that would accept requests from the user facing system and queried Athena. This service would save the results in a separate S3 path / Athena table that was partitioned by user ID. The same service would also query Athena for the user-space data and handle pagination, filtering, and sorting.
The resulting product was a powerful near real-time user-facing query engine. Queries that would usually take a long time to complete and would be impossible to execute in real-time would now take seconds.
In a future post we might go in-depth and explore more technical details and ways to make such systems robust and production-ready.
about the author
Alex (32) is a backend engineer from Tel-Aviv and has been with SimilarWeb for over a year. Currently, he works on improving the Pro platform and writes about himself in the third person.