Our team, formerly dedicated to enhancing user experiences through personalised recommendations on our e-commerce platforms, has recently shifted focus. We’ve delved into generative AI and spent the last six months developing a conversational search tool for Leboncoin, our second-hand marketplace in France. Leboncoin is one of the many platforms operating under our company Adevinta.
This conversational search tool aims to enhance how users interact with our marketplace, to offer a more streamlined and user-friendly experience. We’re using conversational search technology to simplify the process of finding products and connecting with sellers, making it more accessible for our users.
1. What is Conversational Search?
Conversational search can be considered a way to enhance the user experience by allowing natural language interactions with software agents or virtual assistants to retrieve information, perform tasks or access services. It leverages the capabilities of software agents to understand and respond to user queries in a conversational style, making information retrieval more intuitive and user-friendly.
In the field of Natural Language Processing, significant advancements have recently emerged, primarily in the realm of Large Language Models (LLMs). These LLMs possess remarkable capabilities, allowing them to comprehend user inputs and respond in a highly knowledgeable way, like a human expert. Leveraging LLMs for the natural language interactions of a conversational search assistant is a compelling and promising idea.
2. What are Large Language Models?
Large Language Models generate text using a process called “autoregressive language modelling”. In Figure 1 this workflow and a sample of text generation is shown. Here’s a simplified explanation of how the process works:
Input: The LLM needs a prompt or an initial piece of text to begin with. This could be a sentence, a question or just a few words to get the conversation started with the LLM.
Tokenisation: The LLM breaks down the input text into smaller units called “tokens.” Tokens can be words, subwords or even individual characters, depending on the model’s configuration.
Predicting the next word: Once the input is tokenised, the LLM’s job is to predict the next token or word in the sequence. It does this based on the patterns and associations it has learned from its training data, which consists of a massive amount of text from the internet and other sources.
Sampling: The LLM doesn’t have a single “correct” answer. Instead, it generates multiple possible next tokens and assigns a probability to each one. These probabilities determine the likelihood of each token being the next word.
Instead of always picking the most likely next word, it might sometimes choose a less likely word to introduce randomness and creativity into the output. This is controlled by parameters like “temperature,” where higher values make the output more random, and lower values make it more focused.
The LLM generates text one token at a time, continually predicting the next token based on the previous ones. It repeats this process until it reaches a predefined stopping point, which could be a set number of words, a sentence or until you decide to stop it.
3. Overview of The Developed System
As in Figure 2, a user engages the assistant with a request by entering text input. The assistant, powered by a Large Language Model (LLM), interprets this user input and generates a text prompt (answering questions, giving recommendations etc.). Then the assistant forwards the response from the LLM to the user.
To tailor the search, the assistant extracts the user’s preferences and other search-related values from the conversation again using the LLM. These preferences are all structured as a JSON object. This JSON object is then translated by the assistant into parameters that can be used by the Search API.
With the search criteria set, the assistant queries the Product Search service, specifically looking for items that match the user’s preferences. Upon retrieving the results from the Product Search, the assistant presents them to the user. The assistant, with the intelligence of the LLM, acts as a bridge between the user’s desires and the digital world of product databases, providing a personalised shopping experience.
4. Details of Assistant-LLM Interactions
In our design, interactions with the LLM are crucial (Steps 2 and 3 in Figure 2). These interactions play a significant role, greatly impacting the overall user experience. However, we need to be mindful of two aspects: how we use the LLM and how often we use it.
The quality of responses primarily depends on how we employ the LLM. We must use it thoughtfully and effectively to ensure that the responses it provides are of high quality.
Moreover, how frequently we use the LLM has a direct impact on two important things: cost and how quickly users can interact with it. If we use the LLM too often, it can become costly due to increased resource usage. It might also slow down user interactions as the LLM processes requests.
We chose OpenAI’s GPT-3.5 Turbo (gpt-3.5-turbo) as our preferred option for the LLM because of its cost efficiency, versatility and easy integration. From this point of the article onwards, most of the technical details are specific to this model.
4.1. GPT-3.5 and System Prompt
The system prompt in the context of OpenAI’s GPT refers to the initial instructions or guidelines that are set for the AI model. This prompt essentially dictates the model’s behaviour, objectives and the boundaries within which it operates. Not all LLMs have the same kind of system prompt.
For our conversational search assistant, response quality relies mostly on the system prompt. Below is an example of an OpenAI API call message; you can see how the messages list holds the chat history between the user and the assistant. The messages list always begins with the system role and system prompt. In our design, we need different system prompts for different tasks.
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "I need a family car less than 10K euros"},
{"role": "assistant", "content": "Ford Focus is a good option for that."},
{"role": "user", "content": "Is there a diesel model?"}
4.2. LLM Response Generation for User Input
In our conversations, we utilise a system prompt to narrow down the subjects that GPT can discuss. It essentially steers GPT’s responses towards the types of products available on our marketplace, Leboncoin. This way, we can keep our discussions centred around helping users find the right products within the marketplace’s offerings.
It’s important to mention that the earlier version, gpt-3.5-turbo-0613, encountered issues in adhering to the system prompt. However, with the introduction of the new version, gpt-3.5-turbo-1106, the system prompt now holds more influence over the generated text, and GPT seldom loses its direction or exhibits jailbreaking tendencies.
Another important consideration is that for every LLM text generation in response to the user’s latest input, we must also include the chat history. This step is essential to ensure that GPT-3.5 or any other LLM can generate text in alignment with the ongoing conversation context. Limiting the chat history is a recommended practice to control costs effectively. Typically, using only the last 5–10 interactions as the history length should suffice for the majority of situations.
4.3. Extracting User Preferences and Other Keywords for Product Searching
To display the products available on the marketplace to the user, we need to extract user preferences in a structured format based on interactions with the conversational search assistant. We have chosen JSON as the format for this purpose, and we use GPT-3.5 to generate JSON output. It’s important to note that the JSON generated by GPT-3.5 should be compatible with the schema of our product search API to ensure seamless integration and functionality. In other words, GPT-3.5 should produce text in a JSON format that aligns with the specifications of our product search API.
During the early stages of our development, we utilised Guardrails to achieve this goal. Quoting their description: “Guardrails is a Python package that lets a user add structure, type, and quality guarantees to the outputs of large language models (LLMs)”.
We’ve exclusively utilised the generated system prompt along with the JSON definition provided by Guardrails, without tapping into the library’s other features.
However, a challenge we encountered with the Guardrails-generated prompt was its complexity and excessive length, which led to latency and quality issues. To address this concern, we took advantage of the fine-tuning feature offered by the OpenAI API and fine-tuned GPT-3.5. This adjustment allowed us to mitigate the latency problem effectively.
4.4. Fine Tuning GPT-3.5 for JSON Generation Task
Our fine-tuning process involves creating training data in a mostly random manner, similar to how syntactic data is generated (Figure 3). Given that we already know the attributes and potential values for the desired JSON output (which mirrors our search filters and category values), we generate random JSON objects as the expected output for the training data.
To create the input text for fine-tuning, we prompt GPT-3.5 to generate sentences that resemble those from real users, including attributes and values found in the expected JSON. We can also generate Assistant responses using GPT-3.5. However, we must exercise caution to ensure that the generated text doesn’t introduce new attributes or values that are not in our expected JSON. For evaluating the model’s performance, we adopt the same approach, generating test data in the same way.
Our training process closely follows the steps outlined in the official documentation, using the default training parameter values.
When employing the fine-tuned model, we utilise a concise system prompt that instructs the model to extract information from the provided text and format it into JSON.
4.5. Using Official JSON Support of GPT-3.5
Before version 1106 of GPT-3.5, official support for generating JSON output was unavailable. This is why Guardrails-generated system prompts were valuable. However, with the introduction of version 1106, GPT-3.5 now guarantees valid JSON outputs. This presented an opportunity for us to employ this feature for our “bikes” category. To make use of this feature, we needed to set the response_format parameter to {“type”: “json_object”}. Additionally, we discovered that setting the temperature to a very low value, such as 0.1, contributed to more consistent JSON outputs. To ensure the system prompt aligns with our needs, we included the JSON definition within it.
One notable advantage of using this method for JSON generation, as opposed to relying solely on a fine-tuned model, is that we can keep the broader inference capabilities of GPT-3.5. These capabilities are often constrained in fine-tuned models due to their limited training data.
For instance, if a user inquires, “I need a big size bike,” even if the Assistant’s response doesn’t explicitly mention the size value, GPT-3.5 can still interpret “big size bike” based on the valid size values list (S, M, L, XL) provided in the system prompt. It can intelligently deduce that the user likely requires a size from the L or XL options, demonstrating the flexibility and contextual understanding that GPT-3.5 offers.
The quality of responses from a fine-tuned model is primarily determined by the training data. However, in the case of system prompt-guided JSON generation, response quality hinges on the effectiveness of the provided prompt. It’s important to note that to fix issues in the fine-tuned model’s responses, retraining is typically required, whereas this prompt-guided approach offers more flexibility and adjustment without the need for retraining.
One important point to remember is that generating testing data randomly as described above continues to be valuable for evaluating prompt-guided JSON generation.
4.6. Evaluating JSON Outputs: A Naive Approach
Evaluating the fine-tuned model or prompt-guided model for JSON generation we use a very naive approach based on differences of expected and generated JSON objects. For comparing and displaying differences in JSON objects DeepDiff library is used. DeepDiff provides different types of differences like ‘added’, ‘removed’, or ‘changed’ items etc. And it gives the number of these types of differences so that we can compare model performances. For future work, we plan to improve this approach.
5. Category Prediction
Given the diverse product categories within the Leboncoin marketplace, it’s essential to determine the user’s intended product category. This understanding allows us to employ customised prompts or fine-tuned models, ultimately enhancing the user experience.
One straightforward approach is to make a separate call and request the LLM to select a category that matches the user’s input, and then proceed accordingly. However, this method introduces an additional call for each user interaction, leading to increased latency.
A more efficient solution involves incorporating the category identification task into the system prompt for GPT-3.5. This way, GPT-3.5 can simultaneously provide both the category information and the actual response to the user’s input. To implement this multitasking approach effectively, we instruct GPT-3.5 to generate a JSON object containing two attributes: “category” and “response,” clearly defining these attributes in the system prompt. For the category attribute, it should pick a value from the predefined category list in the system prompt. For the response attribute, it should generate the actual response to the user input. This streamlined approach ensures a smoother user experience while minimising latency.
6. Final Notes
OpenAI’s new Assistant API can access a variety of tools, whether hosted by OpenAI or created/hosted by users, including code interpreters and knowledge retrieval systems. Additionally, Assistants can utilise persistent threads, which simplify AI application development by storing message history and managing it when it becomes too long for the model’s context. Furthermore, they can work with files in various formats, both in the creation process and during conversations, including creating and referencing files like images and spreadsheets while using tools.
Apart from being in the beta stage (January 2024), there is one significant issue for us — all messages are stored on OpenAI’s servers. While they can be deleted, we have legal concerns that make us uncomfortable with this aspect of the tool, which is why we chose not to pursue this option.
If you’re interested in learning more about this project, check out this article from our team: Transforming E-commerce with a Conversational Search Assistant Powered by Large Language Models
Special thanks go out to my teammates Narendra Parigi, Anton Lashin, Andrii Myhal, Dmitry Ershov, and Bongani Shongwe.