How we moved from local scripts and spreadsheets shared by email to Data Products -Part 1

Adevinta is a global online classifieds platform with a presence in over 10 countries worldwide, facilitating a secure and substantial volume of transactions to our users (2.5 billion monthly visits worldwide). In Spain, we proudly stand as the largest online classifieds group, where 1 out of 2 internet users connects to our platforms daily. Consequently, we are entrusted with managing a vast amount of data (with an input of more than 2Tb of data and 43 million events per day), bearing a significant responsibility to handle this data in a secure, responsible and sustainable manner.

We view our data platform not merely as a service but also as a product that acts as the crucial bridge between business strategy and technological execution.

“A product mindset is a strategic and customer-centric approach to problem-solving and innovation that prioritises continuous learning, iteration and delivering value to users to achieve long-term success.”

Our product vision for CompaaS (the Data Platform team in Adevinta Spain) is centred on empowering every Adevintan in Spain with scalable access to high-quality data through a Common Platform and Data as a Product. It aims to maximise the value generated from data throughout the entire value chain, promoting autonomous value extraction and expediting the generation of data products through robust quality processes and governance automation.

To realise this vision, we concentrate on maximising the value of data across the entire data value chain, with data products as the minimal unit. We prioritise reusability, security and scalability as the primary drivers for maximising data value.

Supporting this product vision and goals is one of our foundational pillars which we term “self-serve data product builders.” These are frameworks comprising infrastructure, computational governance and services. Our frameworks provide users with a unified, scalable, understandable and trusted means of generating valuable data products. By offering tools and processes to our Data Analysts, Data Scientists and data enthusiasts, we enable them to focus on extracting value while leaving operational aspects to our data product builders.

Our journey in discovering these data product builders with users begins with quick wins, guided by a clear product mindset. One notable quick win involved identifying scripts that were running locally and also replacing manual processes which had low reusability potential and trust. This approach ensures that our users can concentrate on value extraction and leave behind tasks that can be more efficiently handled through automation.

Building our new data product always begins with hands-on collaboration with end users, learning from them at every stage from conception through planning, implementation and subsequent iterations. Rather than viewing problem-solving as a checklist of user requirements, we advocate for a collaborative approach where we work together towards a common goal of addressing identified outcomes. We believe in co-creating solutions that align with the company’s strategy, maximising the value of data throughout the entire data value chain.

Where we came from

As stated before, we spent a lot of time collaborating with users so we can build our data products to match their needs and help the company extract value from data. We found that some Analysts and Data Scientists were using a workflow involving local environments and we (both the data platform and the users themselves) started planning a way to improve that.

In the existing workflow, the user acquires data from various sources or even generates it locally to serve as input for their script. Because it can be processed in a local environment, this input data is generally not more than a hundred Mb in size.

Users would manually execute the scripts, and the resulting output data, typically in CSV or Excel format, is either analysed by the users themselves or utilised as input for subsequent processes. In some cases, users may also share the output data via email with others to perform similar analyses.

This approach exhibits several disadvantages, some more apparent than others, here is a list of the ones we found:

illustration3 sheets for Hermes process tech blog

Versioning

Without a centralised repository such as GitHub, tracking changes and accessing previous script versions becomes challenging. Identifying errors and reverting to the last working version may also be difficult in this setup, requiring much manual work.

Illustration rolodex with checkmarks Hermes process tech blog

Scheduling

Since the script is executed in a local environment upon manual request, the entire responsibility for its timely execution falls on the owner. If the owner is dealing with an urgent matter, arrives late to work, or encounters any other issues, no matter if private or work-related, the process won’t run at the intended time, potentially leading to delays in critical business decisions.
illustration knowledge silos hermes process tech blog

Knowledge silos

Storing the “production” version of the script on a local machine creates a challenge in terms of knowledge sharing. Additionally, sharing the script by mail will create an alternative “production” version that may divert from the original, potentially resulting in two distinct output sets of data for the same input data and script.

Furthermore, If a similar process is developed, it will be done independently, starting from scratch rather than leveraging shared knowledge. It may be done from scratch instead of starting from the shared knowledge within the company, leading the company to redundancy in efforts, rework and lack of consistency in processes.

Linked to the previous point regarding the difficulty to develop similar processes, the same applies to reusability. Since the script is stored on a local computer, it is impossible to reuse. Reusability should be one of the main goals in a data product platform.
illustration hermes process police woman tech blog

Security

Sending data via mail possesses inherent risks, raising concerns about confidentiality. There is no way to ensure the email is addressed to the correct person for instance. Moreover, emails are susceptible to interception, potentially leading to data leakage, breaking the trust and our duty as custodians of the customers’ data or even violation of data privacy regulations. This reduces the traceability needed for our data protection commitment with clients and users.

illustration arrow up hermes process tech blog

Scalability and maintenance

Adopting this workflow as the standard approach for Adevinta would see us managing and maintaining a growing number of locally executed scripts and Excel files. We would also have to delegate manual process data governance, which besides being dangerously prone to errors, is increasingly complex and time-consuming.
illustration temple hermes process tech blog

Governance and compliance

Relying on local processes and data makes it impossible to implement computational governance. You simply cannot define policies and processes in an automated way to ensure the company’s regulations are being fulfilled without a manual process involved. Which also means that you cannot liberate the user from heavy and repetitive tasks such as asking for permissions to read input data, or to write data in the proper output path.
illustration bulls eye hermes process tech blog

Accuracy

As stated in the security section, sending a file by mail is prone to errors. There is no way, other than manually checking, to ensure the data sent is the most recent, or to ensure the addressee is reading the proper version, especially considering the typical look of our Downloads folder.

Accidental or even intentional modifications to the data can occur, undermining accuracy. In scenarios where multiple individuals or teams need to review the same data, this can result in varying interpretations, potentially leading to users working with slightly different versions of the data.

Data reliability

Similar to the susceptibility of the output data, input data presents risks to its consistency. When working in a local environment, data can be modified in error, rendering the whole process invalid from the very beginning. Even if the data is not manipulated at all, there are no guarantees that the input is clean enough for downstream processes and models to run smoothly or to give a trustworthy result.

What we wanted to achieve

Our vision for the Data Analyst workflow revolves around providing tools and defining processes so Data Analysts can work in an improved way. We wanted not only to avoid errors or make the daily work easy for the users but also to embed the company’s culture, introducing some engineering mindset into the flow such as automation, testing and scheduling. Ultimately, we wanted to build a workflow that requires as little manual intervention as possible.

We also wanted to focus on reliability, building repeatable, scalable and safe processes through continuous integration and continuous deployment. The process should also be governed by introducing computational governance practices, i.e. governance by design. These are automated processes and flows that allow governance without human interaction, and are the only way of addressing many issues such as security, privacy or data cataloguing before any breach, problem or even redundancy can happen.

We need to establish a process so the users can work seamlessly in an efficient manner, reducing risks of redundancy and data inconsistency.

So, to summarise, we wanted to go from using scripts and Excel to a Data Product. This is why we called this part of our platform, The Data Product Builder.

To achieve this goal, we focused on the following specific points:

Script versioned in a GitHub repository managed by the Data Analysts team.
Scheduling through an Airflow DAG.
Input data must be stored in Adevinta Spain’s data lake to ensure we maintain our data contract guarantees.
Output data also needs to be stored in Adevinta Spain’s data lake.
Both input and output data (like all data in the data lake) must be accessed through governed processes and tools.
Integrate CI/CD practices, creating an automated and testable flow that can be evaluated in a test environment before being promoted to production.
Provide a test environment as easy to work with as the local environment, so both script and input data can be tested quickly.
Provide a way so the process can be promoted to production seamlessly.

If we want to succeed, we should provide tools with computational governance embedded. Ensuring governance practices are introduced from the very beginning in an automated way, without requiring the user to perform manual or costly processes, requests, jiras etc.

Challenges

Culture change

The primary challenge to change lies in navigating and even pushing a cultural shift. Currently, the existing processes are effective for both users and the company. Users can work and fulfil their responsibilities within the company, which is operating successfully. In light of this, a question arises: Why change?

Presenting the need for change could potentially be interpreted as a critique, leading users to resist the intended change. So first of all, we need to work hand-in-hand with our users.

Product mindset: it’s not about technology, it’s about a common goal

Code migration

As part of the transition from the local environment to an automated setup in the Adevinta Spain Data Platform, which uses Databricks, certain code adjustments will be necessary. In the current scenario, users read and write data directly using Python. However, in a Databricks environment, this process is more efficient if executed through Spark.

While the changes required are not big and affect only the input/output sections, they must be addressed. This implies that the migration will not be as seamless as we wanted.
illustration magnifying glas and graph hermes process tech blog

Data migration

Considering that the entire process is set to operate a production environment with automatic scheduling and minimal human interaction, it’s imperative that both input and output datasets reside in our data lake. That means that the processes that create input data need to be planned, built and scheduled.
This is the only way to ensure data integrity from the very beginning. Unlike a straightforward script migration, the whole process needs to be migrated, thus Data Engineers will play a crucial role in assisting analysts/scientists to architect and implement pipelines that feed input data.

illustration computer setting icon hermes process tech blog

Distributed vs non-distributed

Another issue has emerged as we address the challenges and complexities of such a cultural change and process enhancements. While using Spark to read data enables users to handle vast amounts of data, the specific process targeted in this project typically involves working with smaller input datasets.

On one hand, users possess the capability to read a substantial table as input, applying filters to load only the smaller subset relevant to their processing requirements. This flexibility proves advantageous, especially given the data storage context in the data lake as discussed earlier. On the other hand, the challenge lies in the misalignment between the typical user-profiles and the general usage of Spark. The average user profile does not align with the use of Spark, therefore, as a team, we need to shift our focus towards training and instilling confidence in users, enabling them to use Spark effectively, and overcoming any apprehensions or challenges they may encounter.

illustration woman with icons hermes process tech blog

Developing

The users of this workflow are accustomed to swiftly developing and testing in their local environments, both in the sense of the script code itself and in cleaning or fixing input data. To successfully integrate them into our Data Product Builder Platform it is imperative that we provide users with equivalent capabilities.

illustration 4 people hermes process tech blog

People involved

Finally, it’s crucial to consider the diverse needs of all individuals and teams involved. When working to build a tool or framework, we tend to focus on the developer, in this case, the Analysts that develop the script, who are the main users. We also talked about Data Engineers, who will play a pivotal role in constructing data pipelines to schedule and produce the input data needed.

We also need to consider a broader spectrum of users, including those responsible for analysing and processing the output data. Their workflows are also bound to undergo significant changes, moving from receiving data via mail to reading it directly from the Adevinta Spain Data Platform. Recognising and accommodating the varied requirements of all user roles is essential for successfully adapting the entire workflow.

Learn how we tackle these challenges in part two.