From Lakehouse architecture to data mesh

Over the past three years, the Data Platform team in Adevinta Spain (CompaaS), which I have had the pleasure of leading, has undergone a profound transformation of its data infrastructure. This process has been based on the implementation of a lakehouse architecture using Databricks as well as the development of several key initiatives such as data contracts, data product frameworks etc. These initiatives have transformed the platform into an environment that not only adopts the principles of data mesh but also acts as a true catalyst for creating data products.

Data mesh vision in Adevinta with domain driven data ownership architecture and federated computational governance working across the different marketplace verticals i.e. Real estate, jobs, motor, generalist and cross — Data mesh vision in Adevinta

We must place ourselves in the context of a platform that matured into a multimodal, multisite and multidomain architecture some time ago. It is a unique platform used by different teams to cater to various profiles. This structure fits perfectly into the data mesh paradigm. The four pillars of data mesh, as shown in the above image, align with the vision of a common but distributed platform, providing a unique work ecosystem for different domains and granting them greater autonomy.

This mesh has become the link that unifies all our domains into a single platform. However, transforming it into a mesh was not only about seeing the fit but also finding transformative initiatives that would make it happen.

Transforming from a lakehouse architecture to data mesh

The CompaaS team started from an existing layered architecture, which we adapted and evolved to comply with Medallion architecture. This initial architecture already had a layer-based division, but we made a significant effort to transform it into a structure that not only complied with a lakehouse but also served as a foundation for implementing data mesh principles. Thus, each layer became a specialised workspace for different types of data products, facilitating greater autonomy and specialisation within the platform.

Layers in the lakehouse, following the medallion architecture

This effort allowed us to adapt the lakehouse to the principles of data mesh, representing a significant shift in how we managed data and data products. Each layer was redefined to serve as a focal point for different functions:

The Bronze layer as a space for source-aligned data products
The Silver layer as a space for aggregated data products
The Gold layer as a space for consumer-aligned data products

The diagram above shows the evolution of the Medallion architecture to a structure that adheres to data mesh principles.

These layers do not need to be sequential. Instead, they are interconnected, forming a mesh structure that allows greater flexibility and efficiency in data usage. Underneath you can see the various initiatives that have been fundamental to this transformation and how each helps make the lakehouse platform a robust foundation for the different types of data products and specific domains within our company.

Key transformative initiatives

The transformative initiatives described below have been fundamental in converting Adevinta Spain’s lakehouse architecture into a distributed infrastructure aligned with data mesh principles. These initiatives have enabled the redefinition of each layer of the platform for better governance, efficiency and team autonomy, fostering the creation of data products in a decentralised and agile manner.

Following the discovery framework the bronze layer focuses on source aligned data products, the silver layer works on aggregate data products and the gold layer on consumer aligned data products. They also include their corresponding suites. — Layers details with each of the suites populating them

The key initiatives include:

The data ingestion suite with the implementation of data contracts to convert the bronze layer into a container of source-aligned data products
The domain marketplace to transform the silver layer into aggregated data products
The data consuming suite along with the discovery framework to convert the gold layer into a consumer-aligned data products zone

From bronze layer to source-aligned data products zone thanks to data contracts

To achieve a proper bronze layer we created a declarative framework that allows the user to create data contracts, which the process will use to ingest data. These contracts ingest and perform quality checks based on events, creating a raw and event-oriented bronze layer with individualised events.

Bronze layer detail, moving data from landing to history — Bronze layer detail, moving data from landing

This layer provides the foundation for source-aligned data products, ensuring traceability and quality from their origin. This approach facilitates the creation of data products with high consistency and a clear mechanism for monitoring and quality verification.

All the details of the implementation of data contracts as a declarative data entry framework for the lakehouse are detailed in this article. These data contracts have been fundamental in ensuring the quality and consistency of data, facilitating interoperability and reuse between different teams. As seen in the diagram, events, objects or incoming data are consumed and automatically controlled based on the information and instructions of the data contract. The contract can, in real-time, determine if the data is correct, trigger errors and alerts, and also handle this received data for PII control and classification, automating GDPR compliance.

The article details how data contracts provide a structured means to define explicit agreements between data producers and consumers, establishing specifications regarding data quality, structure and usage conditions. This has been crucial in ensuring data management meets quality and traceability standards from the source, guaranteeing consistent and reliable information for all teams.

Producers, consumers and facilitators relationships; they are all interconnected and the information flows in circles in both directions — Producers, consumers and facilitators relationships

The implementation of this technique has allowed us to change data ingestion processes and workflows. We have resolved the classic bottleneck of third-party integrators who don’t know the data well enough to integrate it properly and can’t scale due to the large number of integration sources.

Being a declarative framework that doesn’t require code to integrate new data into the lakehouse democratises access and ensures that each source has someone capable of creating the contract. This person will be a direct owner of the incoming data and the producer will interact directly with the consumer. This allows them to negotiate, agree and discuss the best data to extract from a source and integrate into the lakehouse. This distribution of tasks shifts the integration effort to each source, allowing the integration to scale in a way that centralised approaches could not. These two effects allow the analytical and quality tasks to start even before data integration. This determines the quality of the raw material for all subsequent analytics. Therefore, this structure allows for greater confidence in data quality, improving interoperability between domains and facilitating data reuse – essential elements for successfully implementing data mesh architecture.

From silver layer to aggregated data products zone thanks to semantic one big tables (or big entities)

We transformed the silver layer into an aggregated data products area to later convert it into a semantic layer, where data aggregation is based on domain semantics rather than specific use cases. This approach makes data classification and organisation follow the business model, so that each aggregation corresponds to a specific domain entity. We adopted the concept of one big table, or big entities, to ensure that aggregations align with the entities of the business domain.

Silver layer details, moving from bronze to silver

These one big tables are expansive. They can contain multiple columns and be quite sparse, as not all columns are always populated. Each table contains all the objects of a specific domain in a single structure, aggregating multiple source-aligned data products, which are more granular, into a domain entity. In a distributed architecture, this approach optimises data access by avoiding the need for complex joins as all relevant information for a domain is found in a single table. This provides greater flexibility in data management while also posing additional privacy and governance challenges as the table grows.

Aggregating data from source to big entity

The use of one big table greatly facilitates data discovery and reuse. The integration of the semantic layer as an intermediary between complex data structures and end users allows the use of terminology aligned with the business, making it easier for both technical and non-technical users. This semantic layer helps improve decision-making and ensures effective governance, providing a single source of truth for the entire organisation.

Regarding the concept of the semantic layer, the creation of big entities has enabled the development of a business ontology that maps major domain concepts with data modelling. Instead of a low-granularity ontology, which would correspond to the more specific objects of the bronze layer, we work with a medium or high-level ontology. This represents major business concepts aligned with the organisational structure of Adevinta. Each one big table maps one of these domain concepts, providing a clear and coherent view of the most relevant entities for the organisation, such as user, client, user behaviour, buyer, seller, publications etc.

The creation of big entities has enabled the development of a business ontology that maps major domain concepts with data modelling. Instead of a low-granularity ontology, which would correspond to the more specific objects of the bronze layer, we work with a medium or high-level ontology. — Modeling and domains

This combination of one big table and the semantic layer forms a data marketplace oriented towards domains within the organisation. This facilitates data sharing and availability across domains, providing a structured environment for creating new data products and promoting decentralised governance. The data marketplace also acts as a catalyst for innovation, offering a framework in which teams can experiment in a safe and controlled environment. This functionality is inspired by an online marketplace, where teams can easily find and reuse available data products, promoting transparency and accountability within Adevinta.

For the creation of big entities, a declarative framework has been developed that allows defining and populating these tables similarly to data contracts. This domain mapping framework specifies in a configuration file which tables and attributes from the bronze layer (source-aligned data products) belong to which domain and how they are integrated. For example, all tracking object tables for behaviour in applications and websites have a configuration file that maps them to a behaviour entity, where all the attributes defining behaviour tracking can be found in a single table.

This framework also automates governance aspects, with automatic alerts to validate quality rules and other checks that incoming data must meet. Additionally, it allows for the automation of maintenance, indexing and table optimisation tasks. making its use more efficient, secure and cost-effective. This article describes in detail the optimisation strategies applied and the benefits to performance and security.

This architecture creates a cohesive data ecosystem that facilitates access and usability at different user levels. Users can make data-driven decisions independently and efficiently, within a governed environment aligned with Adevinta’s objectives.

From gold layer to consumer-aligned data products zone through to the consuming suite

The gold layer is the most used and diverse because it needs to support all analytical use cases from BI to AI. For this reason, we transformed it into a hosting layer for consumer-aligned data products by implementing a suite of frameworks known as the consuming suite. This suite creates an agile environment where data products can be developed with high governance, for data product production and prototyping.

Instead of a workspace for analytics assets, we have provided a platform for creating data products with all its attributes [DAUTNIVS]: Discoverable, addressable, understandable, trustworthy, natively accessible, interoperable, valuable (on its own) and secure.

Self serve platform vs. data products platform

The consuming suite is composed of several frameworks divided into two main categories: data product manufacturing and industrialisation frameworks, and prototyping frameworks.

Frameworks for data product manufacturing and industrialisation

We have three key frameworks to create data products efficiently and with automated governance: Poseidon, Hermes and Transform It Yourself (TIY):

Poseidon: Extraction and construction of distributed pipelines based on Databricks. It includes integrated governance, support and quality components.
Hermes: Creation of non-distributed pipelines. It’s ideal for cases where processing needs do not require a distributed architecture.
Transform It Yourself (TIY): Designed to work with data stored in data warehouses, TIY allows teams to create transformation pipelines using SQL/DBT. This facilitates governance and guided data transformation.

Gold layer detail, moving from silver to make the data consumable

The different frameworks have the capabilities and permissions to move data between layers. The key advantage of providing complete frameworks, rather than just tools and configurations, is that these frameworks already incorporate governance rules defining which data is accessible and by whom, along with user identification. This allows governance as code, automating the enforcement of governance policies.

Additionally, these frameworks are adapted to different user experiences and profiles, providing flexibility that ensures different working methods yield a consistent result. When deployed to production, the final output has the same appearance and operation, with unified monitoring, support and troubleshooting., This allows the same support team to maintain various applications with economies of scale, increasing production deployment speed with security and reducing incidents.

These articles “How we moved from local scripts and spreadsheets shared by email to Data Products” and “Lakehouse and Warehouse: sharing data between environments“, show how these frameworks have democratised the use of applications that were previously considered Shadow IT and brought them into production conditions. Additionally, they explain how the consuming suite acts as a bridge between the lakehouse and the data warehouse, facilitating interoperability between these two curated analytics work areas.

Framework for data product prototyping

The prototyping framework within the consuming suite, known as Atenea, is designed to facilitate the creation of data product prototypes with a relaxed focus on quality standards, allowing for great agility during the testing phase. Unlike production environments, Atenea offers flexibility for high-speed data exploration, accessing all system layers (bronze, silver and gold), including both productive and non-productive data. Prototype results can be persisted, but always in a pre-production or playground environment, without directly impacting production. This environment allows projects to be scalable and operational temporarily, helping to validate ideas without requiring all data to be fully industrialised. This is ideal for analytical scenarios such as dashboards, Machine Learning (ML), or Artificial Intelligence (AI), where ideas can be tested quickly and efficiently.

One of the key advantages of Atenea is that, if the prototypes use existing productised assets and data, the transition to production will be smoother. However, if the data used is not sufficiently industrialised, the framework still allows for proof of concept, but additional time will be required to bring the prototype to production.

Prototyping in the data products platform looking at all the layers of the architecture — Prototyping in the data products platform

Atenea can access all layers of the system, from the initial data sources through to the lakehouse and data warehouse. It stores its results in the gold layer (consumer-aligned data products). This allows Atenea to offer a centralised interface for data discovery and exploration.

Atenea is primarily based on Databricks technology, using Notebooks and Workflows that enhance user experience and democratise access to data prototyping for everyone. While in production the company only works with pipelines and jobs, the prototyping environment with Databricks provides a highly accessible and flexible notebook interface, which is particularly useful for the quick creation and validation of prototypes. This integration with Databricks makes the framework easily usable for teams with varying technical abilities, thereby fostering the creation of innovative data products without high entry barriers.

The impact of Atenea has been particularly notable when it is necessary to accelerate development or test ideas before investing in the industrialisation of specific data that may not be used recurrently. This framework allows for one-time analysis (“one-shot”) or rapid hypothesis validation, thus speeding up the initial development phases. Although Atenea’s design is temporary (projects cannot be permanent), and support is limited (it lacks production-level support and cannot write directly to production), these restrictions ensure that prototypes demonstrating value undergo an appropriate industrialisation and production process. This allows them to scale, reduce costs and increase reuse.

Conclusion: Automation of governance and scalability

The automation of governance has been one of the central pillars of this transformation. The Data Platform team at Adevinta Spain (CompaaS) has worked to ensure that quality, security and compliance policies are applied automatically, reducing the need for manual interventions and minimising the risk of human errors.

Integrated diagram of the whole data products platform including Atenea, all the layers and their corresponding suites and the data product lab and builders — Integrated diagram of the whole data products platform

In the diagram, all the layers and data flows of the data platform are represented, as well as the initiatives that have been key in this transformation such as the implementation of a lakehouse and subsequent incorporation of data mesh concepts. This has allowed the lakehouse to function as a “hosting” environment for data products, accompanied by a complete set of tools to build them in various phases: ingestion, exploitation, prototyping and production. Together, these initiatives provide a robust proposal for implementing data mesh and combining a Lakehouse, a data warehouse, and a set of tools that facilitate decentralised, data product-oriented implementation.

Data movements and transformations through the data products platform

One of the most important innovations of this architecture is the ability to transform the sequential data flows of the medallion architecture of the lakehouse into a mesh structure with nodes and connections distributed in all directions. Instead of viewing data flows as a unidirectional process, we see them as a mesh in which each area has a specialisation. For example, there is a data input specialised for ingestion, a curated data layer for domain aggregation and data flows focused on consuming and building specific use cases.

This combination of different flows utilises standard tools like the lakehouse and data warehouse, enabling the construction of a data mesh implementation that supports the four essential pillars: Data as a product, federated automated governance, self-serve platform and domain-driven data ownership architecture.

This article further details how the initiatives implemented in this platform have facilitated the evolution from a Data platform to a data products platform.

Data platform evolution through its users: from no platform to data platform to data products platform — Data platform evolution through its users

In conclusion, we observe the differences between having no platform, having technology without governance and having a data products platform. In the scenario without a platform, the company faces a large amount of Shadow IT and a lack of control. With a centralised platform, technical divergence is avoided, but there may still be shortcomings in flexibility and team autonomy. Finally, with the data products platform, not only is technical divergence avoided, but processes, working methods and culture are standardised.

This journey has shown that it’s not only necessary to provide good technology and service, but that the data platform must also be treated as a complete and integral product. The analytical workspace becomes a common, scalable, democratic, secure, economical and high-quality environment, enabling the company to innovate and grow in its data management in a coherent and sustainable manner.

From Lakehouse architecture to data mesh

Transforming from a lakehouse architecture to data mesh

Key transformative initiatives

From bronze layer to source-aligned data products zone thanks to data contracts

From silver layer to aggregated data products zone thanks to semantic one big tables (or big entities)

From gold layer to consumer-aligned data products zone through to the consuming suite

Frameworks for data product manufacturing and industrialisation

Framework for data product prototyping

Conclusion: Automation of governance and scalability

Related techblogs

Why would I need an AI search assistant?

From Filters to Phrases: Our AI Revolution in Car Search

Creating source-aligned data products in Adevinta Spain