Lakehouse and Warehouse: sharing data between environments

Introduction

In Adevinta Spain, we want to get the most out of the data that we manage. To achieve this goal, and due to the huge amounts of data to be stored, we had to implement storage systems based on Lakehouses and Data Warehouses.

Once the storage systems were in place we faced another problem: Data Engineers, Data Analysts, Data Scientists and other users want to explore and experiment with data located in the production environments in an easy way. We, on the other hand, want to isolate the production environments from the users so that they cannot damage the production processes running there.

In this article we will explain the solution that we have implemented to resolve this problem: protecting the production environments while still allowing users to use the data located there.

Components for processing and consuming data

In this diagram we show all the basic components for processing and consuming data. Other diagrams may have the same components with different names — but the functionality is the same.

Let’s focus on the storage layer where we can find the Lakehouse and Data Warehouse.

Storage layer

In the diagram you can see three different elements:

Lakehouse: at Adevinta Spain we use AWS S3.
Data Warehouse: following our AWS platform choice, we use AWS Redshift.
Metastore: in our case we are using Hive Metastore (we could also have used Glue).

Current situation (environment isolation)

Our production environment must be isolated from other environments because we do not want to break production processes. This creates different, isolated environments containing the same processing and storage components. So notebooks in the sandbox environment can not break data located in the integration or production environments.

The problem with data

The problem with having different and isolated environments is that, in pre-production environments, the amount of data available will probably be much lower than that which is generated in production.

Data Analysts, Data Scientists, Data Engineers, Machine Learning Engineers, and anyone working with big data, will need to work with enough data to develop and test their pipelines. Ideally, the development environment should offer the same amount of data that they would find in the production environment.

So now, we face the following problem: users want to be able to work with huge amounts of data in an easy and fast way. However, we want them to work in isolated environments because we do not want them to break anything in the production environment.

The solution

We need to provide a solution where applications like notebooks running in isolated environments, such as integration, can access production storage.

Lakehouse, AWS S3

When using a notebook (for example a Databricks notebook) and a metastore, the first thing it will do is ask the metastore where the data is physically located. Once the metastore responds, the notebook will go to the path in AWS S3 where the data is stored using the permissions given by the IAM Role.

Lakehouse, sharing data with Waggle Dance

Waggle Dance is a request routing Hive metastore proxy that allows tables to be concurrently accessed across multiple Hive deployments.

In short, Waggle Dance provides a unified endpoint with which you can describe, query and join tables that may exist in multiple distinct Hive deployments. Such deployments may exist in disparate regions, accounts, or clouds (security and network permitting).

Find out more here.

The process breaks down like this:

A laptop located in the integration environment requests a table
The Waggle Dance, also located in the integration environment, checks its configuration and decides whether the relevant metastore is located in the production or integration environment
The table request is routed to the correct metastore according to the Waggle Dance determination

For example, this configuration could be based on some prefix. In the below example, the pro_ prefix. When using this prefix the data to be retrieved will be located in the production environment instead of the integration one.

Data Warehouse on AWS Redshift

The Adevinta Spain Data Warehouse is implemented on top of the newer AWS Redshift RA3. What makes RA3 different from the old Redshift is that, in the new implementation, computation and storage are separated.

Data Warehouse, sharing data

AWS Redshift RA3 includes a feature called Data Sharing. Using Data Sharing we can access data located on other Redshift servers and even in different accounts or environments with read-only permissions.

Data Sharing provides instant, granular and high-performance access without copying data or data movement. You can query live data constantly across all consumers on different RA3 clusters in the same AWS account, in a different AWS account or in a different AWS Region. Queries accessing shared data use the compute resources of the consumer Amazon Redshift cluster and don’t impact the performance of the producer cluster.

You can read more about Amazon Redshift RA3 here:

Data Sharing

With Data Sharing, we can configure AWS Redshift in the integration environment for accessing storage of the AWS Redshift instance located in the production environment.

You can find more information about how this works here.

Data Sharing, implementation

The following steps show how to enable Data Sharing between integration and production AWS Redshift servers:

In AWS Redshift RA3, production environment, run the following statements:

CREATE DATASHARE example_sharing
GRANT USAGE ON DATASHARE example_sharing TO ACCOUNT ‘INTEGRATION’
ALTER DATASHARE example_sharing ADD SCHEMA schema
ALTER DATASHARE example_sharing SET INCLUDENEW = TRUE FOR SCHEMA schema

2. In AWS Redshift RA3, integration environment, run the following statements:

CREATE DATABASE example_pro FROM DATASHARE exampe_sharing OF ACCOUNT ‘PRODUCTION’
CREATE EXTERNAL SCHEMA IF NOT EXISTS pro_schema FROM REDSHIFT DATABASE ‘example_pro’ SCHEMA ‘schema’;
GRANT USAGE ON SCHEMA pro_schema TO schema;

With the above configuration, when using the pro_ prefix in the integration environment, we will be accessing data located in production. This access is read only, so we cannot modify production data in any way.

Conclusion

Through this article we have covered how to resolve the following problems in a Lakehouse and Data Warehouse implemented in AWS S3 and Redshift:

Users (data engineers, data analysts, data scientists, etc, etc) need to work in pre-production environments with the same volume of data as in production.
We want to have different and isolated environments: integration, production etc.
Users need to work with the data in the easiest possible way — without compromising production systems or data

In Adevinta Spain, we have implemented these and other solutions in our Lakehouse and Data Warehouse with the primary objective of enhancing the user experience and reducing the workload required for creating data as products with success. However, we don’t intend to stop here; our goal is to provide our users with the best possible experience. To achieve this, we plan to introduce additional solutions in areas such as cost control, access management, and more.