(Some) Biases in Recommender Systems You Should Know

The quest to build unbiased ranking models from biased data

May 25, 2024

Recommender systems are among the most rapidly evolving ML applications in the industry today, and much of the research of the previous years has resulted in better neural architectures, better multi-task optimization, and better user history modeling.

However, even the best such models are still plagued by various biases: shortcuts the model learns to minimize its loss during training but that result in counter-productive behavior that we never intended in the first place. Biases are difficult to detect and fix because doing so requires us to look beyond predictive accuracy alone.

This week, let’s talk about 5 of them:

clickbait bias,
duration bias,
position bias,
popularity bias, and
single-interest bias.

Let’s get started.

1 — Clickbait bias

Wherever there’s a consumer-facing platform (be it video, news, or any other form of entertainment), there’s bound to be clickbait: sensational or misleading headlines or video thumbnails designed to grab a user’s attention and entice them to click, without providing any real value. “You won’t believe what happened next!”

If we train a ranking model using clicks as positives, naturally that model will be biased in favor of clickbait. This is bad, because such a model, once deployed into production, would promote even more clickbait to users, and therefore amplify the damage it does.

One solution for de-biasing ranking models from clickbait, proposed by Covington et al (2016) in the context of YouTube video recommendations, is to replace the standard logistic regression formulation (with impressions with clicks as positives and impressions without clicks as negatives) with weighted logistic regression, where the weights are the watch time for positive training examples (impressions with clicks), and unity for the negative training examples (impressions without clicks).

Mathematically, it can be shown that such a weighted logistic regression model learns odds that are approximately the expected watch time for a video. At serving time, videos are ranked by their predicted odds, resulting in videos with long expected watch times to be placed first, and clickbait (with the lowest expected watch times) last.

Covington et al don’t share detailed experimental results, but they do say that weighted logistic regression performs “much better” than predicting clicks directly in their production model.

2 — Duration bias

Weighted logistic regression works well for solving the clickbait problem, but it introduces a new problem: duration bias. Simply put, longer videos always have a tendency to be watched for a longer time, not necessarily because they’re more relevant, but simply because they’re longer.

Think about a video catalog that contains 10-second short-form videos along with 2-hour long-form videos. A watch time of 10 seconds means something completely different in the two cases: it’s a strong positive signal in the former, and a weak positive (perhaps even a negative) signal in the latter. Yet, the Covington approach would not be able to distinguish between these two cases, and would bias the model in favor of long-form videos (which generate longer watch times simply because they’re longer).

A solution to duration bias, proposed by Zhan et al (2022) from KuaiShou, is quantile-based watch-time prediction.

The key idea is to bucket all videos into duration quantiles, and then bucket all watch times within a duration bucket into quantiles as well. For example, with 10 quantiles, such an assignment could look like this:

(training example 1)
video duration = 120min --> video quantile 10
watch duration = 10s    --> watch quantile 1

(training example 2)
video duration = 10s --> video quantile 1
watch duration = 10s --> watch quantile 10
...

By translating all time intervals into quantiles, the model understands that 10s is “high” in the latter example, but “low” in the former, so the author’s hypothesis. At training time, we’re providing the model with the video quantile, and task it with predicting the watch quantile. At inference time, we’re simply ranking all videos by their predicted watch time, which will now be de-confounded from the video duration itself.

And indeed, this approach appears to work. Using A/B testing, the authors report

0.5% improvements in total watch time compared to weighted logistic regression (the idea from Covington et al), and
0.75% improvements in total watch time compared to predicting watch time directly.

The results show that removing duration bias can be a powerful approach on platforms that serve both long-form and short-form videos. Perhaps counter-intuitively, removing bias in favor of long videos in fact improves overall user watch times.

3 — Position bias

Position bias simply means that users engage with the first thing they see, whether it is particularly relevant to them or not. The model predictions therefore become a self-fulfilling prophecy, but this is not what we really want: we want to predict what users want, and not make them want what we predict.

Joachims et al (2016) show that we can de-bias our model by weighing each training sample by the inverse of its estimated position bias, creating more weight for samples with low bias, and less weight for samples with high bias. Intuitively, this makes sense: a click on the first-ranked item (with high position bias) is probably less informative than a click on the 10th-ranked item (with low position bias). In order to create the weights, we therefore need to have an estimate of position bias as a function of positions — how do we get that?

One way is result randomization: for a small subset of the serving population, simply re-rank the top N items randomly, and then measure the change in engagements as a function of rank within that population. This works, but it’s costly, as we’ll end up showing random recommendations to a subset of the user population.

A more economical way is intervention harvesting, proposed by Argawal et al (2018) in the context of full-text document search, and in parallel by Aslanyan et al (2019) in the context of e-commerce search. The key idea is that logged user engagement data in a matured ranking system already contains the ranks from multiple different ranking models, for example from historic A/B tests or simply from different versions of the production model that have been rolled out over time. This historic diversity creates an inherent randomness in ranks, which we can “harvest” to estimate position bias, without any costly interventions.

Lastly, there’s an even simpler idea, namely Google’s “Rule 36”. They suggest to simply add the rank itself as yet another feature when training the model, and then set that feature to a default value (such as -1) at inference time. The intuition is that, by simply providing all information to the model upfront, it will learn both the engagement model and a position bias model implicitly under the hood. The important thing is to keep the position separate from all the other features (because we don’t want to cross them), for example using the two-tower architecture.

4 — Popularity bias

Popularity bias refers to the tendency of the model to favor items that are more popular overall (due to the fact that they’ve been rated by more users), rather than being based on their actual quality or relevance for a particular user. This can lead to a distorted ranking, where less popular or niche items that could be a better fit for the user’s preferences are not given adequate consideration.

Yi et al (2019) from Google propose a simple but effective algorithmic tweak to de-bias a video recommendation model from popularity bias. During model training, they replace the logits in their logistic regression layer as follows:

logit(u,v) <-- logit(u,v) - log(P(v))

where

logit(u,v) is the logit function (i.e., the log-odds) for user u engaging with video v, and
log(P(v)) is the log-frequency of video v.

Of course, the right hand side is equivalent to:

log[ odds(u,v)/P(v) ]

In other words, they simply normalize the predicted odds for a user/video pair by the video popularity P(v). Extremely high odds from popular videos count as much as moderately high odds from not-so-popular videos. In online A/B tests, the authors find a 0.37% improvement in overall user engagements with the de-biased ranking model.

5 — Single-interest bias

Suppose you watch mostly drama movies, but sometimes you like to watch a comedy, and from time to time a documentary. You have multiple interests, yet a ranking model trained with the single objective of maximizing watch time would over-emphasize drama movies because that’s what you’re most likely to engage with. This is single-interest bias, the failure of a model to understand that users inherently have multiple interests and preferences.

In order to remove single-interest bias, a ranking model needs to be calibrated. Calibration simply means that, if you watch drama movies 80% of the time, then the model’s top 100 recommendations should in fact include around 80 drama movies: if they include more, the model would be over-calibrated, and if they include less, the model would be under-calibrated.

Netflix’s Harald Steck (2018) demonstrates the benefits of model calibration with a simple post-processing technique called Platt scaling, which maps predicted to actual probabilities using a scaled sigmoid function, the parameters of which are learned on a hold-out set. He presents experimental results that demonstrate the effectiveness of the method in improving the calibration of Netflix recommendations, which he quantifies with KL divergence scores. The resulting movie recommendations are more diverse — in fact, as diverse as the actual user preferences — and result in improved overall watch times.

Summary

Clickbait bias means that the model is biased in favor of clickbait content, and can be fixed with techniques such as weighted logistic regression.
Duration bias means that the model is biased in favor of long videos (and against short videos). One way to fix it is use quantile-based watch-time prediction instead of predicting watch times directly.
Position bias means that users are more likely to click on the first thing they see, whether it’s relevant or not. We can fix it by either weighing each training example by an estimate of its position bias (if we have such an estimate), or using the positions directly as a feature in the model.
Popularity bias means that the model is biased in favor of popular content instead of the unique interests of a particular user. One way to fix it is by scaling the logits by item popularity.
Single-interest bias means that the model fails to learn multiple user interests at the same time and instead over-exploits the most prevalent user interest. This can be fixed by calibrating the prediction scores, for example using Platt scaling.

It’s not enough to simply assume that ranking models are neutral or objective: they’ll always reflect the biases that exist in the data they are trained on. De-biasing is far from a being a solved problem, and as recommender systems continue to evolve, we can expect new biases to emerge. Coming up with innovative ways to detect, quantify and alleviate these biases remains one of the most important research domains in the industry today.

Machine Learning Frontiers

Discussion about this post