In part I we talked about how some analysts and scientists were working in their local environments, creating potential problems with duplicated work, data protection and general inefficiency. In this article, we will talk about how we implement a solution so the user can move from their old, local workflow to create a Data Product with integrated governance, security and CI/CD.
Building & Testing Environment
In the previous workflow, users could develop and test their scripts quickly. We need to provide the same capability if we want them to migrate to the new workflow. Simultaneously, we need to completely avoid local environments.
The work environment we will use a Databricks Notebook, which is the standard environment in Adevinta Spain.We also have our own product called Data Product Lab, which allows users to quickly test their developments in a safe and secure environment with real data. We do this through:
- Data Access: Using a tool called waggle dance, we enable users to read production data without the risk of writing to it. This tool allows users to use a prefix to route their request to the production Hive Metastore. Eliminating the need to download data from production, ensures both security and efficiency.
- Access Control: Users can only access the tables their team has access to. This applies both to the development environment and the production if they choose to use it.
However, as mentioned earlier, users often create their own data as input for their scripts. When this happens, data won’t always be available in either the development or production environment. To address this, users can upload a CSV or other data format to an S3 folder and use it as input for their development phase. They cannot upload to the production environment this way.
Additionally, the Data Product Lab also allows users to load libraries if required.
Versioning
After thoroughly testing the development in the data products lab, the user is required to download the script and store it in a repository. We’ve opted for a GitHub repository, consistent with our company coding policies. GitHub is a widely recognised platform which provides not only versioning, storage and continuous integration capabilities but also an intuitive user interface.
Considering that our users will be working with git, and recognising that many of them may not be familiar with it, we invested some time training them to use basic git commands and workflow such as creating branches, initiating pull requests, managing merging, committing, pushing etc..
In the first versions of the new system, users were instructed to raise a Jira request to the CompaaS team so that we could create the repository on their behalf. After further development, users have the autonomy to create the repository themselves using a predefined template.
Using this template the user can effortlessly create a ready-to-go repository containing the following scripts and folders
Folder/File | Purpose |
---|---|
travis.yml |
A file defining and executing deployment steps |
build_tools |
Folder storing scripts used in the build process |
build_tools/deploy.sh |
Script responsible for uploading scripts and libraries to S3. It is called from travis.yml |
build_tools/test.sh |
A script designed to run tests if present. It is also called from travis.yml |
helloworld |
A sample project allowing users to replicate the folder structure for both code and tests for their actual scripts |
helloworld/helloworld |
A folder containing the actual script |
helloworld/ pyproject.toml |
A file specifying dependencies and some other project metadata (creator etc..) Some important metadata here is the script version. |
helloworld/tests |
A folder with unit tests |
helloworld/tests/test_helloworld.py |
Actual script containing tests |
Once the repository is created, the user simply copies the helloworld project folder structure, inserting their script, development tests and any dependencies in the pyproject.toml file.
After everything is set in the repository, the user can commit and push to a feature branch. Subsequently, Travis will deploy both the script and the wheel file with the dependencies (if required) to an S3 bucket in the development environment. Given that the source branch is a feature one used for development, both artifacts will be named with a feature suffix adhering to the following pattern:
Users must also create a pull request to the master branch. The responsibility for reviewing and merging the pull request lies with the user team. Upon review and successful merge, Travis will deploy the artifacts to an S3 bucket in both development and production environments.
If Travis detects that a new version is being released, another artifact with ‘latest’ as the version number will be deployed along with the specific version:
The Artifacts’ name includes a version suffix providing users with absolute control over which version to use. For instance, they may choose to conduct a full run in development before transitioning to the newer version in production. This process can be facilitated through the scheduling solution we discuss below.
Scheduling
The final component required for users to run their scripts in production is the scheduling solution. At Adevinta Spain we have opted for Airflow, a well known scheduling solution. An Airflow DAG (Directed Acyclic Graph) is a python file that specifies several steps in a given order. Using python proves advantageous as it aligns with the language proficiency of Adevinta’s Data Analysts.
If the user’s team is a new Airflow user, a new repository is created automatically. Similarly, if they are also a new Hermes user we will also create the required variables for Hermes in that repository.
To enhance usability, we have developed a custom Operator known as HermesOperator. The Operator saves users from unnecessary or repetitive content such as creating the cluster to run the process, manage access control, loading variables etc. It accepts the following parameters as input:
Parameter | Usage |
---|---|
team: str, |
Team name. The team that is running the process. This team will be able to see logs. |
project_name: str, |
Project name. |
script_name=None, |
Optional script name, since a project may contain several scripts we need a way to differentiate them. Usually it matches the project name, hence the optional. |
script_version=’latest’, |
Version of the script. Some users didn’t want fine grain control of the version number, so the default value is ‘latest’. |
library_version=None, |
Same as above, in case the script needs some dependencies. |
parameters=[], |
Parameters that the script may need. |
use_data_sharing=True, |
This indicates if the script will be using the capability to read production data in the development environment. Applies only in development. |
view_access_control_list=[], |
In case the users want any other team to view logs, this is common when the analysts need assistance from the data engineers. |
Once the DAG is developed it can be merged to the master branch of the Airflow repository. Then it can be deployed to the development and production environments through Jenkins.
The DAG can also be tartagged so it can be found quickly in the Airflow console. It should always contain a compaas-service tag with value ‘hermes’.
Deployment Lifecycle
The full lifecycle can be seen in the diagram below.
Cluster management
As illustrated, users can now build and execute their scripts in an automated and remote environment, but there are additional processes occurring behind the scenes to facilitate this. The HermesOperator, for instance, initiates the creation of a job cluster.
A cluster policy is employed for its creation to establish baseline cluster properties. Each user team’s Airflow repository stores the cluster policy Id as a variable. It is created by the compaas team when a new team enrolls to use Hermes. The HermesOperator retrieves this policy Id. This design ensures that users are not burdened with the need to be aware of or manage this information directly.
This policy dictates the type of cluster that users can create. Since Hermes is built for small data scenarios, clusters are consistently configured with a single worker, allowing them to scale up to two workers just in case. Additionally, the cluster policy defines an allowed instance profile, enabling us to enforce the use of Hermes within each user group’s dedicated Databricks workspace.
Data Accuracy and governance
As described, Hermes serves as Adevinta Spain’s framework to create Data Products with non-distributed data. By accommodating small input datasets and models comprising infrastructure, computational governance, and services we can make it easy for our users to get started.
In our data journey, several other frameworks exist, both for ingestion and consumption. Hermes falls within the latter category. A visual representation of the layers in Adevinta Spain’s LakeHouse is presented in the diagram below.
As stated earlier, Hermes is situated within the Data Consuming Suite. Hermes exclusively permits writing data in the Products Zone. This zone utilises S3 and stands out as the most autonomous layer within the LakeHouse. Each data team can design their data products here using Hermes or other frameworks within the Data Consuming Suite).
Data can be used as input from various Layers, with the preferred path being from the primary market, the layer immediately before the target layer (Products). However, data can also be sourced from the very same Products Zone or even from Dataland, which is a layer representing the same goal as Products, but implemented with a different technology (Redshift as opposed to Product’s S3)
These regulations are enforced through the cluster policy, exclusively accessible by the appropriate instance profile and assume the necessary roles to restrict permissions. When running in the development environment, the processes built in Hermes can read from both the development and the production environment using the proper prefixes thanks to the Waggle Dance mechanism explained before. This helps building and testing the product, in the production environment data can be read only from the production environment itself
Monitoring
Users can monitor their processes independently at the scheduling level using the Airflow’s User Interface, where they can check for any failed tasks and review the corresponding error messages. Each task using Hermes Operator will show a link routing to the corresponding Databricks logs.
Following the link, users have the option to inspect detailed logs at a cluster level for more in-depth information. Access to these detailed logs is restricted to users belonging to the appropriate group, which is set when calling the Hermes Operator.
Thanks to the tag in the Airflow DAG, we have the capability to build a Datadog Dashboard enabling the monitoring of user engagement.
There is also a dashboard built to address cost observability, this time built in Grafana that allows users to filter by team, month, day etc… in order to get its cost and properly address and calculate the value generated by data products constructed with Hermes.
Technologies involved
This chapter outlines the amount of technologies involved in our Data Products environment. Even though the user only sees a couple of them (and that is the point, to hide the complexity), we need to build the solution through several other technologies. It is also why the data platform team of Adevinta Spain is composed of people with several profiles and backgrounds.
As shown in the image below, we needed to use technologies such as Terraform or Ansible in order to build the infrastructure where the scripts run. Others include IAM to enforce security, or on the data engineer side, Apache Spark for unified analytics.
This technological diversity underscores the invaluable efforts put forth by the entire Data Platform Team (Ismael Arab, Joel Llacer, Roger Escuder, Enric Martínez, Jaime González, Javier Carravilla, Christian Herra, Marta Díaz, Marc Planamugà and myself)
Conclusion: Hermes as a Data Product Builder
At its core, Hermes stands as a bridge between Python script development and operational deployment, facilitating streamlined processes and amplifying productivity, while enforcing governance criteria through computation governance.
One of Hermes’ distinguishing features lies in its keen focus on cost observability and monitoring, essential pillars in the realm of data operations. Hermes offers visibility into resource consumption and cost as well as in system performance and user engagement.
The process of building Hermes encapsulates a transformative journey involving cultural shifts, code migration, and the dynamic requirements of diverse stakeholders. We spend a lot of time thinking, analysing, building and testing, providing infrastructure, implementing computational governance and service to the users. But we invested way more time talking to users, both to know their needs and to accompany them in their journey to migrate to Hermes, gathering their feedback etc.
Our shift from a local environment to an automated system within the Adevinta Spain Data Platform marks our commitment to efficiency and scalability, with reusability and a strong goal of building reliable and trustworthy data products. This will be impossible without the collaborative energy between data engineers, analysts, and other users.
Finally, it’s crucial to emphasise the importance of focusing on people and processes over technology. TThat’s why we collaboratively developed this Data Product Builder with a number of data scientists (Marina Palma, Araceli Morales, Mateu Busquets & Miquel Escobar), ensuring that the tool serves the needs of both users and processes seamlessly, gathering their feedback and improving the product on the go. We are immensely grateful for their invaluable contributions and expertise throughout this journey. Our aim is to make it easier for the user to extract value from data. If they can focus on that task, without worrying about technical or repetitive tasks, that will be our big win.
Success truly hinges on a human-centric approach.