When customers shop in online marketplaces, recommending the right complementary items can inspire them to make additional purchases. Until recently, we were only showing customers similar items to the one they searched for — which meant we were missing out on an opportunity. In this article, we explain the steps we went through to introduce recommendations for complementary items in Adevinta’s marketplaces.
An introduction to the Adevinta marketplaces project
With an increasing number of products available in our marketplaces, tailoring products to customers is important to ensuring customer satisfaction. Most e-commerce companies focus on providing similar and related item recommendations (content-based, collaborative, or hybrid) to customers, based on their actions, such as likes, views and clicks. We realised we could improve customer engagement and matchmaking further by providing complementary item recommendations alongside similar items recommendations. For example, if a customer is looking for a tennis racket (seed), one could suggest accessories such as tennis balls or tennis bag (complementary).
This would not only increase CTR (click through rate), a major metric used in e-commerce companies to measure the success of recommendation engines, but also provide the best user experience to customers
What we set out to achieve
By changing our focus to recommend complementary items, we wanted to:
- Encourage users to purchase items together
- Improve the user experience by diversifying
- Divert traffic to other categories
This work would allow us to expand an email campaign to customers to personalise complementary recommendations based on previous purchases. After that, we wanted to use complementary recommendations as inspirational recommendations, particularly in the homepage feed.
A complementary product is one that adds value to another or that cannot be used without the other.
How do we know if two things are complementary? In practice, customers frequently buy complementary products together. Data from previous purchases could be used to learn about the complementary relationships. However, co-purchase data is typically only available for a subset of items in a catalogue. As a result, most modelling approaches rely on customer interaction data (co-views, co-purchase), but due to the scarcity of these interactions, they cannot handle cold-start or low engagement items adequately.
As we take you through our journey, we set out the approach we took to create labelled data, and the method we used to differentiate complementary items from similar items. We also demonstrate how Adevinta uses GCP (Google Cloud Platform) to create production-ready pipelines to serve complementary items to customers.
How we tackled labelling
Data labelling is an important step in this challenge. Similar items are easy to find in a candidate list because we expect them to be close in the latent space, but complementary items are not. It’s a time-consuming and expensive task to go through all of the items in the catalogue and label them at the item level. To avoid this, we used taxonomy categories to establish relationships between items (ads) at the leaf level.
As shown in Figure 2, the marketplace has three logical levels of taxonomy grouping categories: L1 category, bucket (L1 1/2), and L2 category.
We leveraged L2 categories to label the relation between items and made the following initial observations:
- Complementary relation structure is degenerate; not all categories are logically complementary to all other categories.
- The specificity of the model; relations are symmetric and hence we only need to label half of them.
- Data limitations; not enough interactions to enable all L2 categories as seed.
As a result, we exploited the taxonomy structure and applied some tricks to make it manageable. We focused on user interactions to derive the category combinations. These included lead (not purchase but interested in an item) interactions such as ExternalUrl, favourite, phonereveal, bid/send and message/send because we didn’t have purchase information. The following labelling scheme summarises the process.
- Collect user interactions (leads), sort by timestamp and remove items that are repeated.
- User interactions are cut into 60 minute sessions.
- For each pair of consecutive events run word count for L2cat_{i} → L2cat_{i+1}
- Take top N pairs of L2cat_{i} → L2cat_{i+1} or all interactions that are larger than 0.01 of normalised count.
- One can partially automate labelling after selecting the top N pairs. That is, when L2cat_{i} = L2cat_{i+1}, it implies a similar (S) relationship.
- The remaining category pairs are manually assigned the labels similar (S), complementary (C), or negative (N).
Dataset preparation
Once we’d completed labelling, we had a relationship between each pair of ads, based on their L2 categories. Now we needed to collect the relationships between ads based on all users in the marketplace and generate quadruplets, as shown below.
- Take all seed items that have complementary items, i.e. with relation = C.
- Using seed, join on pairs with relation = S.
- Again, using seed, join on pairs with relation = N.
The resulting dataset was entirely unbalanced. Because users may visit certain seed categories more frequently than others, the most frequently visited seeds have more examples than minority seeds. To rectify this, we balanced the dataset based on seed category using both elimination (removing a few seed categories) and under-sampling.
The modelling stage
The learning procedure was based on a Walmart Labs article. We first encoded the quadruplets obtained in the previous step to the Universal Sentence Encoder and then fed the resulting 512-dimensional vectors into the network. While the loss function and neural network remain almost unchanged, items were fed into the network using the parameters defined in the original article.
Figure 6a depicts the anticipated representation space. We were trying to encode items so that their representations were concentrated in the right circles. The blue circle contains items that are similar — the blue top and red top are variations of the same entity and can be interchanged. The green circle contains items that are complementary to those in the blue circle, while the white circle contains negative items. The loss function is designed in such a way that when items are outside of their corresponding circles, the model is penalised.
We used the same ranking accuracy metric as the original paper: a score of 1 is given when all three types of items are within their appropriate margins, and a score of 0 otherwise. We trained a model using the SGD optimiser with a learning rate of 0.001, layer sizes of 256 and 128, margins for similar/complementary/negative items of 0.2/0.7/1.2, and achieved an accuracy of 0.88. The plots below in Figure 7a and 7b show distance histograms for various item types. It’s worth noting that each item type clusters around its own margin.
Moving on to inference
So, now we have a model that can embed ad titles in such a way that similar items are represented in a short range and closely, complementary items are also represented closely to seed items but not as close as similar items, and negation items are represented much further away from both similar and complementary items.
In this step, we created a list of candidates and the top k complementary recommendations for each item on the list. This complementary matrix is used later in the pipeline to create personalised complementary item recommendations for the email campaign. However, because the market’s item base is so large, we used filters to select the most appropriate items.
How the filters worked
1 We chose all items that have been active in the last 90 days and have had at least one view in the last 15 days.
2 Following that, we created embeddings for each item using the previously trained model.
3 We used the Hierarchical Navigable Small World (HNSW) technique to index and query embeddings. In HNSW data points are organised in layers, with the top layer containing the fewest data points in comparison to the deeper layers. These data points are extremely well connected. If a search query is entered, the closest data points in the highest layer will be found. Then it goes one layer deeper, finding the closest data points from the first datapoint found in the highest layer, and searching for nearest neighbours from there. The actual closest data object to the search query will be found in the deepest layer.
4 To select only complementary items, we reused the labels created at the L2 category level.
5 To diversify the recommendations, a basic rule-based ranking mechanism was used.
Figures 10a and 10b show two examples of recommendations. The seed item in one is a running shoe, and the complementary recommendations are sport tops and track pants. The seed item in the other is a mattress, with bed frames and a cot as complementary recommendations.
Creating the pipeline
We used Vertex AI to train and serve models in the Google Cloud Platform (GCP). To establish a standard in coding and tooling, the marketplace data science enablement team implemented the framework depicted in Figure 11.
As a result, we processed the data with Big query and dataproc. VertexAI was used to train the model. To perform the inference, Dataproc is used. Figure 12 depicts all of the steps discussed in the preceding sections. The pipeline is divided into two sections. The first is the training pipeline, which includes labeling, data preparation, and training. The other is the inference pipeline, which is where we implemented the inference part. We even implemented a real-time pipeline in a locally hosted system. We will soon migrate the real-time pipeline to GCP.
In conclusion — a success!
Our approach worked well. By connecting the language model and the quadruplet network, we’re able to generate similar and complementary representations of ads and use these ad embeddings to generate recommendations. This method also allows for the search of similar and complementary items for a given seed item. We currently only use title text to generate embeddings, but we could broaden the approach by including other information such as description, image, price and brand.
What’s next? We have more improvements planned. The current approach ignores asymmetric relationships between items. For example, if the customer is looking for a TV, it is appropriate to recommend a TV cabinet; however, if the customer is looking for a TV cabinet, it is possible that the customer already has a TV, so recommending TV as a complementary item is redundant. By factoring in asymmetric relationships, we can make our recommendations even more relevant. In addition, we’re looking into ways to automate the labelling process. We’ll continue making improvements once the results of the current A/B testing are available.
Do you have any comments on our methodology? Or tips on further improvements? Please get in touch.