In this article, we delve into the evolution of the Data Engineer role, exploring how the emergence of Large Language Models (LLMs) has introduced a ‘new’ artificial end-user of data. How is the work of Data Engineers changing with the rise of large language models?
Beyond catering to traditional end-users, Data Engineers now find themselves addressing the growing demand for data from artificial entities, especially LLMs. We will analyse the essential characteristics that these models require to unleash their full potential and examine how Data Engineers play a crucial role in shaping and fulfilling these demands.
In Adevinta, different teams are working on LLM solutions in which data engineers are defining the data architecture and processes. For example, our marketplace Milanuncios is developing a specific copilot for increasing the productivity of data solutions, and Leboncoin is creating an internal chatbot for generic purposes.
First and foremost, let’s discuss how the role of the Data Engineer has evolved over the years. Initially, the Data Engineer focused on building and maintaining data infrastructures, ensuring that data was accessible and processable for human end-users. However, as the demand for advanced analytics and the ability to process large volumes of data grew exponentially, the Data Engineer transformed into a strategic figure. The goal of the Data Engineer has been not only to provide a data architecture but also to structure the data through processes that make it quickly accessible, interpretable and valuable to a variety of end-users. Traditionally, these end-users were analysts, data scientists and decision-makers. A classic example would be creating data pipelines for a sales analytics team, enabling them to gain valuable insights into product or regional performance.
However, in this era of Large Language Models (LLM), a new user has entered the scene: the language models themselves. Models like GPT-4, Claude, LLama, Mixtral, Dolly, Ngrok, Gemini, among others, not only consume data but also use it to generate coherent and contextually relevant responses. Data Engineers now face the challenge of providing data that is not only understandable for humans but also for these artificial entities.
The adaptation of the Data Engineer’s role now involves considering specific features such as massive data volumes, text quality and contextual diversity. Without these features, LLMs are unable to fully leverage their capacity to understand and generate high-quality content. In summary, the Data Engineer not only constructs for human end-users but also orchestrates the flow of data to meet the increasingly sophisticated demands of LLMs. This data flow can be divided into three parts:
- The training or construction of an LLM
- Fine-tuning the model
- Incorporating processes that add relevant information or Retrieval Augmented Generation (RAG).
Each of these stages has its complexity and techniques related to the data architecture, the amount of data to leverage, and other factors, where Data Engineers are already working on each of them.
Let’s start with the Retrieval Augmented Generation stage, which is the most widely applied in companies today. The RAG technique involves adding extra information to the query that we want the LLM to address. For example, this could include specific data about a person or company — ’new’ data that has not been used to train the model, such as recently generated information. If we relate this to the concept of ETL (Extraction, Transform and Load) that has persisted from the birth of Business Intelligence (BI) to the current world of Big Data, we can now speak of LTE (Load, Transform and Embed). This would be part of the process to orchestrate to make data available to a Large Language Model.
The RAG process may seem complex at first, but it is actually very similar to what Data Engineers have already been working on. Let’s examine each phase one by one, drawing comparisons with ETL processes:
Load:
Firstly, we have the Load phase, which essentially involves extracting data from different sources such as PDFs, websites, databases and others. This is equivalent to the traditional ‘Extract’ process.
Transform:
Next is Transform, which is very similar to the types of transformations commonly performed in ETLs, but with some additional techniques such as text splitting. Text splitting involves dividing our source data into chunks. Chunks are necessary because some of the current LLMs cannot handle a context that is too long due to token limitations and other constraints.
Embed:
Finally, the Embed phase is similar to the ‘Load’ phase in ETLs, but with the difference that the data loading occurs in the form of vectors or embeddings. Large Language Models will use these vectors to perform searches quickly and efficiently. Currently, these embeddings can be stored in different data sources such as Chroma, Pinecone, Delta Lake, Postgres, Redis, Milvus, Weaviate, Faiss etc. Some of them are called vector databases, and others just added extensions to support vectors.
In this blog, we have delved into how Data Engineers are collaborating across various domains associated with foundational models. Whether engaged in crafting augmented generation retrieval processes, establishing data workflows for fine-tuning, or supplying data to fuel generative AI models, their contributions are vital. To gain a deeper understanding of these new responsibilities for Data Engineers, Part II of this blog will look at some specific examples. We will explore these instances using Langchain as a library — an emerging tool born from the ascent of Large Language Models (LLMs) that is rapidly gaining popularity among developers.