Multi-Task Learning in Recommender Systems: Challenges and Recent Breakthroughs
Positive and negative transfer, task grouping, hard parameter sharing, MMoE, auxiliary learning, and gradient scaling
Recent posts you may have missed:
- A Tour of the Recommender System Model Zoo (Part I, Part II)
- Towards Understanding the Mixtures of Experts Model
- Machine Learning with Expert Models: A Primer
- The Rise of Two-Tower Models in Recommender Systems
While multi-task learning has been has been well established in computer vision and natural language processing, its use in recommender systems is still relatively new and therefore not as well understood. In this post, we’ll take a deep dive into some of the most important design considerations and recent research breakthroughs behind modern multi-task ranking models. We’ll cover:
why we need multi-task ranking models in the first place,
positive and negative transfer: the key challenge in multi-task learners,
hard parameter sharing and soft parameter sharing with MMoE,
auxiliary learning: adding new tasks for the sole purpose of improving the main task, and
gradient scaling: how to make auxiliary learning even better.
Let’s get started.
Why multi-task recommender systems?
The key advantage of multi-task recommender systems is their ability to solve for multiple business objectives at the same time. For example,
in a video recommender system we may want to optimize for clicks, but also for watch times, likes, shares, comments, or other forms of user engagement.
in an e-commerce recommender system, we may want to optimize for purchase, but also for click and add-to-list (an indicator for future purchases).
In such a situation, a single multi-task model is not only computationally cheaper than multiple single-task models, it can also have better predictive accuracy per task when different tasks help each other, as we’ll see later.
Even in cases where we only want to predict one event, such as “purchase” in an e-commerce recommender system, we can still add additional tasks with the sole purpose of improving performance on the main task. We call these additional tasks “auxiliary tasks”, and this form of learning “auxiliary learning”. In the e-commerce example, it may make sense to also learn “add-to-cart” as well as “add-to-list” along with “purchase”, given that all of these events are closely related to each other: they indicate shopping intent.
Positive transfer, negative transfer, and task grouping
At a high level, predicting a second task can either help with the first task or do the opposite: make the prediction of the first task worse. We call the former case “positive transfer” and the latter “negative transfer”. The challenge in multi-task learning is then to only learn tasks together that have positive transfer on each other, and avoid negative transfer, which can degrade model performance.
A key question in multi-task learning is then which tasks “get along well” together, that is, create positive and not negative transfer. In many cases, we can make reasonable guesses with domain expertise. We’ve already seen an example above: “purchase” and “add-to-cart” both indicate shopping intent, and therefore should work well together in a multi-task learner (and in fact, they do, as we’ll see later).
However, if the number of tasks becomes large, we may need to determine algorithmically which tasks should be learned together, and which should be learned separately. Notably, this is an NP-hard problem because the number of possible task groupings scales exponentially with the number of tasks!
Finding scalable solutions to the task grouping problem is therefore not trivial, and requires us to make certain approximations. The 2019 paper “Which Tasks Should Be Learned Together in Multi-task Learning?” discusses multiple such approximations within the context of multi-task computer vision models, however this is beyond the scope of this article. For our use-case, let’s simply consider the case where the set of tasks is fixed, and ask how we learn them together.
Hard parameter sharing and MMoE
A simple way to build a multi-task neural network is a technique known as “hard parameter sharing”, or “shared bottom”, where we combine a shared bottom module with task-specific top modules, one for each task. With this architecture, the bottom module learns patterns that are task-generic, while the top modules learn patterns that are task-specific.
The output of the multi-task learner is then a list of predictions which we can combine into a final loss as
where p are the predictions, y are the labels, and w are the task-specific weights, controlling the relative importance of each task, indexed by k. Minimizing L with stochastic gradient descent would then optimize the model on each of the tasks jointly.
The disadvantage of hard parameter sharing is that it assumes that all tasks require the same amount of both shared and dedicated modeling capacity, and that this choice is hard-coded ahead of time. However, in practice this may not be ideal: for example, if a task creates negative transfer on all other task, we’d prefer having more dedicated and less shared modeling capacity for that task. Capacity allocation, in other words, should be dynamic, not static, depending on the tasks that are being learned.
Enter MMoE, short for “Multi-gate Mixtures of Experts”, introduced in a 2018 paper from Google. This is an extension of Mixtures of Experts (which I covered here) to multi-task learners: instead of one gate, we use N gates, one for each task.
The authors show that MMoE (red curve in the bottom plot above) outperforms both hard parameter sharing (green) and MoE (blue) on a synthetic dataset as well as on Google’s production data. The plot also shows that the gap between MoE and MMoE increased as we reduce task correlation: for poorly correlated tasks, we benefit from MMoE the most. This is because capacity allocation in MMoE is more flexible, which allows the model to allocate its neurons in a way that minimizes negative transfer.
(How exactly experts are allocated in MMoE is still an open research question. I discussed expert allocation in MoE here).
Auxiliary learning and gradient scaling
In many recommendation problems, the predictive performance on the main task can be improved via joint learning of auxiliary tasks even if these additional tasks aren’t actually needed at serving time. For example,
when trying to predict conversion rates, predictive performance improves when jointly learning the auxiliary task of predicting click-through rates (Ma et al 2018),
when trying to predict user ratings, predictive performance improves when jointly learning the auxiliary task of predicting item metadata such as genre and tags (Bansal et al 2016),
when trying to predict reading duration in a newsfeed, predictive performance improves when jointly learning the auxiliary task of predicting click-through rates (Zhao et al 2021),
just to name a few. In all of these cases, the purpose of the auxiliary task is not to use the prediction at inference time, but instead solely to boost predictive performance on the main task we’re trying to learn.
Auxiliary learning works because we’re adding gradient signal which helps the model find the best potential minimum in the parameter space, and this extra signal can be particularly useful when the gradient signal from the main task is sparse. For example, conversion (purchase) rates are much lower than click-through rates, and therefore it is expected that the gradient signal from the latter signal is richer, and can supplement that of the former.
However, it has been shown that the gradients in auxiliary learning can be highly imbalanced such that the auxiliary gradients either dominate the learning or don’t matter at all. This is a problem, hypothesize the authors of the MetaBalance algorithm, proposed in a 2022 paper. The key idea in MetaBalance is to scale the auxiliary gradients such that they’re of the same magnitude as those of the main task.
Formally:
g_aux <-- g_aux * r * |g_main|/|g_aux|,
where g_aux
is the gradient from the auxiliary task, g_main
is the gradient from the main task, and r is a hyper-parameter which the authors determine empirically (r = 0.7 is found to to work best for the problems studied in the paper).
On an e-commerce shopping dataset where the target task is “purchase”, and the auxiliary tasks are “click”, “add-to-card”, and “add-to-list”, MetaBalance improved NDGC@10 from 0.82 to 0.99 (MetaBalance), a 17% improvement, relative to the standard shared bottom architecture with no gradient scaling.
Summary
Let’s recap:
In modern recommender systems we often need to learn multiple tasks together, such as “click”, “add-to-cart”, and “purchase” in an e-commerce recommender system - hence the need for multi-task neural networks.
Not all tasks can be learned well together. Tasks can help each other, creating positive transfer, or fight with each other, creating negative transfer. Figuring out which tasks to learn together is an NP-hard problem and still an active research domain.
Hard parameter sharing (aka “shared bottom”) is the simplest and most common way to solve multi-task learning: it combines a shared bottom module with task-specific modules to allow the model to learn both shared patterns and task-specific patterns.
MMoE (multi-gate mixture of experts) is a way to allocate modeling capacity in a multi-task model dynamically. The key idea is to introduce N gates, one for each task. This has been shown to mitigate negative transfer and hence improve overall model performance.
In auxiliary learning, we add additional tasks with the sole purpose of improving performance on the main task. Algorithms such as MetaBalance apply additional scaling on top of the auxiliary gradients to boost performance on the main task even more.
And this is just the tip of the iceberg. There are plenty of open questions left in this domain: How to craft good auxiliary tasks? What’s a good ratio of auxiliary to main tasks? How exactly are experts being allocated in MMoE? What’s the best number of experts given the number of tasks?
Watch this space. Multi-task learning in recommender systems is still far from being a solved problem.
Reader question of the week
This question relates to MoE, which I covered in depth here. The question is:
How would MoE work in a computer vision problem? Like for an image classifier, would some experts earn dogs and others cats? Would the balancer then be able to figure out which one to use?
The famous MoE applications so far have been in NLP: Google's Switch Transformer or OpenAI's GPT-4 both leverage sparse MoE to scale up modeling capacity (with constant computational complexity!). That said, there's no reason not to use MoE to scale up vision models as well, and the same principles should apply.
For example, let's consider the simplified case of a classifier that learns to separate dog images from cat images. We could solve such a problem with a model like AlexNet, ResNet, or whatever is the latest available computer vision model at the time. Let's say it's a ResNet model for now.
With MoE, we could add multiple ResNet module "experts" in parallel and combine with a hard gate that routes each training example to a single expert. That way, the different experts could learn to solve the classification problem on different domains of the data: for example, some experts could specialize in daytime images and some in nighttime images. Or some could specialize in certain breeds of dogs.
Recently, and inspired by the success of MoE in NLP, Shen et al 2023 explored MoE as a way to scale up Vision-Language models, which are used for applications such as answering questions given an image. Similar to the NLP user-cases, they find that different experts specialize in different image regions, such as one expert specializing in eyes, one expert specializing in letters and words, and so on:
While MoE in NLP has been well established (Switch Transformer, GPT-4, etc), it’s still in its early stage in computer vision. I’d expect significant progress in this direction over the next few years.
Thank you for reading Machine Learning Frontiers. If you enjoy this content, please make sure to subscribe. Paid subscribers also get access to the growing Machine Learning Frontiers archive.
I’m also currently running a promotion on my e-book “Machine Learning on the Ground”, which I self-published last year. It’s a 50-page collection of my most popular ML and data science articles I’ve published over the past few years. Get 30% off using the discount code MLFRONTIERS: