Integrating Apache Kyuubi with Elastic MapReduce (EMR)

Background

Adevinta, a leading online classifieds group, has been running its big data platform on AWS for years. (Click this link to find more articles from Adevinta Data and ML.) Recently, the data product and platform engineering team started to evaluate the possibility of providing a reliable endpoint for various types of workloads to facilitate the use of Spark SQL. This endpoint could then potentially be connected from different client applications (Python notebooks, Tableau, Hue, DBT etc.) for a fully integrated user experience.

A High Level Illustration of a Multi-Purpose Query Endpoint on AWS

Apache Kyuubi immediately caught our attention because it is capable of making Apache Spark, as well as other data processing frameworks, become a Lakehouse entry point. After some preliminary investigation and proof-of-concept testing, we came up with the idea of combining Kyuubi with Elastic MapReduce (EMR) so that we could quickly bootstrap a modern, reliable yet performant gateway for executing Spark SQL on AWS.

One proven way to unlock EMR Serverless as the backend engine

We reviewed three ways for Apache Kyuubi to integrate with all existing EMR types (EC2, EKS and Serverless). This article will help you understand what happens under the hood while Kyuubi engines bootstrap themselves.

To make life even easier, a comprehensive summary of the pros and cons among the three different EMR integration methods is included at the end of this article. I hope I can shed some light on the situation for you if you are also considering integrating Apache Kyuubi with your cloud based big data platform to construct a highly responsive Lakehouse entrypoint service with low overhead.

What is Apache Kyuubi?

Kyuubi is an enterprise-grade SQL gateway designed to facilitate interactive visual analytics on large-scale data. Powered by popular computing frameworks like Apache Spark, Apache Flink and Trino, it offers efficient querying capabilities. Users can access Kyuubi through JDBC/ODBC, enabling them to run SQL queries directly or through BI tools. Kyuubi promotes resource sharing and delivering rapid responses. It seamlessly integrates with various sources, including traditional data warehouses like Apache Hive/HDFS and modern Lakehouse standards such as Apache Iceberg, Apache Hudi and Delta Lake.

In addition, Kyuubi’s multi-catalogue meta APIs offer a centralised view of all data sources, simplifying exploration and fostering innovation. With support for ANSI standard SQL syntax, Kyuubi enables querying disparate data sources through a single entry point while maintaining robust authentication and authorization mechanisms for data security. Read more articles about Apache Kyuubi on Medium.

What is Elastic MapReduce?

AWS EMR, which stands for Amazon Web Services Elastic MapReduce, is a fully managed big data processing service provided by Amazon. It enables businesses to process large amounts of data quickly and efficiently using popular open-source frameworks such as Apache Spark, Apache Hadoop and Presto/Trino. With EMR, organisations can easily launch and scale clusters, allowing them to analyse and derive insights from vast datasets.

EMR handles all the underlying infrastructure management, including provisioning, monitoring and automatic scaling, so users can focus on analysing their data and extracting valuable information. It offers a flexible and cost-effective solution for businesses of all sizes, enabling them to tackle complex data processing tasks with speed and ease. (See our previous Medium post on building a cost-efficient EMR environment.)

Ways to Integrate Kyuubi with EMR

It is worth noting that this article is based on Kyuubi 1.7.1 and EMR 6.12.0.

Server Side Deployment of Kyuubi on AWS

The easiest way to deploy Kyuubi is using the provided helm chart to quickly pull up the pods and services with minimum frustration. Alternatively you can do it the hard way (if you prefer customising) using a handy image maker to build your own Kyuubi image from an EMR public image in a few minutes.

On the Kyuubi server side, both high availability and load balancing are natively supported if an external metadata service (such as Zookeeper or etcd) is specified in the configuration. For simplicity, we placed both the Kyuubi server and Zookeeper on the same EKS cluster so they can be connected internally through domain name resolution. For clients to connect (Kyuubi service discovery), you can either use the EKS load balancers or integrate your own logic by leveraging the information stored in Zookeeper to select the instance to start a session with.

The Tiered Structure of a Working Kyuubi Deployment with Spark

In general, a working Kyuubi environment is a multi-tiered deployment. Aside from the end-user, namely the client applications, the Kyuubi service (the frontend) proxies the engine service (the backend) so that it adds features to the original protocol for executing queries and fetching result data with the engine backend.

The Different Types of EMR to Work With

Option 1: EMR on EKS

As the name suggests, Amazon EMR on EKS is a deployment option that allows you to run open-source big data frameworks on Amazon EKS. This approach enhances resource utilisation and streamlines infrastructure management across multiple Availability Zones because you can run your EMR-based applications alongside other types of applications on the same EKS cluster.

Configuration Highlights

The EMR on EKS approach is the most straightforward way to start, especially when you reuse the same EKS cluster where you deployed your Kyuubi services (Kyuubi server + Zookeeper).

First you must set up IAM role for service accounts for your EKS cluster so that your EKS service account is bound to an IAM role with the corresponding access to other AWS services (like S3, Glue etc.). Also don’t forget to set up role and rolebinding in your EKS namespace so the service account has the essential permissions to get and modify pods and services when a Spark engine is requested.

The following Spark configurations should be present in your Kyuubi server’s config paths (either in the Kyuubi or Spark config)

spark.master = k8s://https://kubernetes.default.svc.cluster.local:443 (may differ across cloud providers )
spark.submit.deployMode = cluster (or client)
spark.kubernetes.namespace = <EKS namespace to run your spark workloads>
spark.kubernetes.authenticate.driver.serviceAccountName = <service account for the driver pod>
spark.kubernetes.authenticate.executor.serviceAccountName = <service account for the executors>

In addition to the minimum configuration, you may also include other commonly added Spark configurations to

Adjust the size of driver and executors
Allow dynamic allocation
Specify the ARN of the role to access S3 and Glue
Enable Spark eventlog and specify storage location

Moreover, if you want your EKS cluster to scale automatically, Karpenter might be your next step to build a highly elastic computing cluster to react to actual demand.

Cluster mode vs. Client mode

There are two primary deployment modes in Spark, client mode and cluster mode. It is critical to understand the difference while integrating with Kyuubi because you will need to utilise that knowledge to set the network correctly to allow different components to communicate with each other.

In client mode, the Spark driver programme runs on the machine where the Spark application is submitted from. The driver submits the jobs to the cluster manager, which allocates resources to run the application on the cluster. It is particularly useful for interactive and debugging scenarios where you want to see application output in real-time. While in cluster mode, the Spark driver programme runs on one of the nodes within the cluster (in contrast to the client machine). The client machine submits the Spark application to the cluster manager, which takes care of allocating resources and scheduling tasks on the cluster.

A comparison of Kyuubi with EMR on EKS (Cluster mode vs. Client Mode)

Since Apache Kyuubi is shipped with a well-built kyuubi-spark-sql-engine runtime, both client or cluster mode are supported to serve an interactive SQL interface (as opposed to the Spark SQL CLI). However, I would recommend using cluster mode for production deployments as it would reduce the risk of server pods being killed and it allows the driver programme to be placed on the more scalable cluster instead of merely inside the server pods.

Option 2: EMR on EC2 (YARN)

EMR on EC2 employs YARN (Yet Another Resource Negotiator) as the resource manager to centrally handle cluster resources for multiple data-processing workloads (including Spark). Therefore, EMR on EC2 would be suitable for those who have working EMR clusters and want to further reuse these clusters’ capacity for serving Kyuubi.

Cluster mode vs. Client mode

Similar to EMR on EKS, there is also an opportunity to choose between cluster mode and client mode on YARN. The diagram illustrates the difference between cluster mode and client mode when multiple Spark applications are running on the same EMR on EC2 cluster. Here we also recommend using cluster mode so that both the lifecycle and resource consumption of the Spark driver program can be fully isolated from the Kyuubi server.

Configuration Highlights

Different from the previously mentioned EMR on EKS mode, deploying a Kyuubi server against EMR on an EC2 cluster requires more effort on the networking side to start a Kyuubi engine. This time, you will need to configure your VPC network to ensure that the traffic is routable from the Kyuubi server to the EMR cluster and more importantly you have to let the traffic from the Spark driver back to the Zookeeper (even trickier when in cluster mode) to successfully register an engine.

In addition to the network settings, the Kyuubi server also needs to know how to start an application on a YARN cluster. This requires the Kyuubi image creator to copy/mount several Hadoop related configuration files (i.e. yarn-site.xml, hadoop-site.xml, core-site.xml etc.) from the target EMR cluster. And finally in the Spark configuration file to set

spark.master = yarn

to allow the Kyuubi server to submit the application to the YARN cluster.

Option 3: EMR Serverless

Here comes the most exciting part! Amazon EMR Serverless introduces a brand-new deployment paradigm for big data platforms. EMR Serverless provides a more simplified runtime environment for analytics applications. It helps you utilise leading open-source frameworks like Apache Spark and Hive without the hassle of configuring, optimising, securing and managing clusters for running these frameworks.

One of the key advantages of EMR Serverless is its ability to prevent resource over- or under-provisioning for data processing jobs. The runtime environment automatically assesses the required resources for processing jobs, provisions CPUs and memory accordingly, and releases them automatically once the jobs are completed. The customers are only charged when computing resources are allocated. For scenarios where applications require near-instantaneous responses, such as interactive data analysis, you can also pre-allocate the necessary resources to accelerate responses. EMR Serverless is particularly powerful for customers who prioritise the ease of operation for applications built on open-source frameworks. Specifically for Apache Kyuubi users, it is even more enticing to integrate with this serverless offering of EMR to truly modernise the deployment.

Configuration Highlights

An Illustration of Kyuubi Server to Connect EMR Serverless Applications

After studying how EMR serverless deploys applications and its networking features, I found it absolutely possible (and exciting) to work as an engine choice for Apache Kyuubi. However, there are two prerequisites to satisfy before they can work together. Like integrating with a YARN cluster, you again need to configure the networking so that the Kyuubi server and Spark applications on the serverless infrastructure can talk to each other. You can co-locate your EMR serverless application with your Kyuubi server (i.e the EKS cluster) in the same VPC network. Then you whitelist two-way traffic by setting up the respective security group policies along with an internal load balancer to expose your Zookeeper instance for engine registration (very important!).

However, you will need to do some serious Scala programming this time to let Kyuubi submit the application according to the EMR serverless way. I suggest implementing your own EMR serverless process builder by extending this trait:

org.apache.kyuubi.engine.ProcBuilder

This will trigger a backend process that calls the corresponding AWS SDK or CLI commands to spin up the Spark job within the AWS managed zone.

Here are a few tips once you have successfully rewritten your Kyuubi to boost EMR serverless applications.

Consider building your own EMR image to bring in other desired features
Adding some pre-initialised capacity to your EMR Serverless application allows your Spark engine to start in an amazingly fast manner
Configure logging so you will have a centralised place for all application history

Summary

The following table shows the pros and cons of different Kyuubi + EMR deployment options side-by-side so you can easily compare one to the other. I have also added my personal preference for each deployment option based on their relative complexity and management overhead. My rating is purely subjective however and could change as components evolve over time.

The Pros and Cons of different Kyuubi + EMR combinations

Last but not least, Apache Kyuubi is an extremely versatile open-source framework and its scope is far beyond just providing a reliable endpoint for operating Spark SQL. For other exciting features such as its connectivity to different engines and data access control plugins, I highly recommend you explore the project repo and web page.

Integrating Apache Kyuubi with Elastic MapReduce (EMR)

Background

What is Apache Kyuubi?

What is Elastic MapReduce?

Ways to Integrate Kyuubi with EMR

Server Side Deployment of Kyuubi on AWS

The Different Types of EMR to Work With

Summary

Related techblogs

A pragmatic evaluation of software engineering AI tooling

OPA memory usage considerations and lessons from our transition to Kyverno

Make data migration easy with Debezium and Apache Kafka