How Long Is Long? Duration Bias in Short-Form Video Recommendation
In the absence of clicks, watch time has become the new main proxy for video relevance - which introduces new biases that require new solutions
The recommender systems behind modern, immersive, short-form video feeds, the TikTok-pioneered applications where you directly swipe to the next video in full screen instead of selecting a particular video from a ranked list, have to solve for a unique problem: the absence of user clicks.
This is a problem because a click used to be a good proxy for relevance: if a user selected a particular video out of a list of multiple video thumbnails shown to them at the same time, we can have relatively high confidence that the video was indeed a good match. We just need to account for click bait bias, which can be done with techniques such as weighted logistic regression, i.e. weighting positive training examples by watch time and negative ones with unit weights.
In contrast, in modern short-form video recommendation, watch time — how long a user engages with a particular piece of content — has replaced click as the primary proxy for relevance. The longer the watch time, the more likely the video was a good match for the user, so the theory.
Which introduces a new problem: how long users watch a given piece of content doesn’t only depend on how much that content matches their unique interests but also how long that piece of content itself is. Longer videos usually generate longer watch times simply because it takes users longer to figure out whether the content is for them or not. A system that’s optimized for watch time alone will therefore be biased in favor of longer videos. This is a problem not because longer videos are inherently worse but because video length becomes a confounder for video relevance. We want to recommend what’s the most relevant, regardless of duration.
In this post, let’s dive into some of the recent research around the relatively new problem of debiasing short-form video recommendation from duration bias.
D2Q (Kuaishou, 2022)
D2Q, short for “Duration-Debiased Quantile-based watch time prediction”, was introduced in a 2022 paper from Kuaishou. The paper starts with a recognition and qualitative exploration of the problem of duration bias in their own short-form video recommendation system. The plot below shows the difference in the distributions of impressed videos over the course of 11 months. We see clearly that the system has shifted from shorter to longer content.
This, the authors argue, is not a consequence of shifting user preference but instead of the duration bias: their models start favoring longer videos not because they’re better but simply because they generate longer watch times, the metric they’re trying to optimize.
The key idea in their D2Q algorithm is to not use any actual durations in seconds in the model, but instead rely on quantization. In particular, we bucket all videos into duration quantiles, and then bucket all watch times within a duration bucket into their own quantiles as well. For example, with 10 quantiles, such an assignment could look as follows:
(training example 1)
video duration = 100s --> video quantile 10
watch duration = 10s --> watch quantile 2 (low)
(training example 2)
video duration = 10s --> video quantile 1
watch duration = 10s --> watch quantile 10 (high)
(training example 3)
video duration = 100s --> video quantile 10
watch duration = 100s --> watch quantile 10 (high)
...
It is worth pausing for a moment to ask why exactly this approach debiased video durations: it’s because the model never gets to see actual watch durations in seconds, just watch duration quantiles. In such a quantization scheme, the labels for training example 2 and 3 are identical (both are “high”), even though their actual watch times in seconds aren’t - and that’s precisely what we want. In contrast, the watch times in training example 1 and 2 are identical but they mean different things: the former is low (because the video itself is long), and latter is high (because the video itself is short).
At training time, we’re providing the model with the video duration and other inputs such as user id, video id, video category, etc, and task it with predicting the watch time quantile given the duration bucket the video falls into.
The authors consider 3 architectures:
Naive-D2Q combines shared feature embedding layers with dedicated MLPs for each duration bucket. The objective in each MLP is binary classification: does the video fall into this watch time bucket or not? The downside of this design is its computational complexity: for large numbers of buckets k the model size will grow undesirably large, which the authors deem “not practical in real production systems”.
D2Q uses a single, shared MLP to process the inputs for all duration buckets. In order for this to work, the learning objective needs to be a regression problem instead of binary classification: what’s the predicted watch time bucket (from 1 to 10) for this video? The advantage in this architecture is that information learned from different video duration buckets can be shared by design.
Res-D2Q is a variation of D2Q in which we pass the duration feature itself through a dedicated shallow tower instead of processing it along with all of the other features. Think of this as a special case of two-tower debiasing, where the second tower takes as inputs just the duration of the video.
The authors compare the latter two architectures against two baselines:
VR: value regression, i.e. predicting watch time directly,
WLR: weighted logistic regression with watch time weights for positives and unit weights for negatives. The binary label in this case is whether or not watch time for the video exceeds the 60th percentile of all watch times.
The authors evaluate the performance of the different algorithms using XAUC, an extension of AUC to dense values. For a pair of training samples, we score 1 if the two predicted watch times are in the same order as the ground truth and 0 otherwise. XAUC is then the average score over all pairs of examples, with a perfect model scoring 1.0 and a random model scoring 0.5.
The plot below summarizes the results for D2Q (blue) and Res-D2Q (yellow) as a function of number of quantiles used.
Both D2Q variants significantly outperform the baseline models in terms of XAUC. This is expected because WLR and VR suffer from duration bias, that is, for longer videos their predicted watch times are biased high. Res-D2Q works even better than D2Q, demonstrating the effectiveness of the additional shallow tower for processing video duration itself.
Lastly, the performance of D2Q depends on the number of quantiles: too few, and the debiasing is not as effective because a single duration group is too diverse. Too large, and the duration groups become too small, increasing the risk of overfitting. The sweet spot for this particular dataset appears to be between 20-30 duration quantiles.
The authors also conducted online A/B tests on Kuaishou’s production data, finding
0.57% improvements in total watch time compared to WLR, and
0.75% improvements in total watch time compared to VR.
If you think about it, the results are perhaps a bit counter-intuitive at first. We’ve removed a bias in favor of long videos, yet the total watch time is increasing instead of decreasing - how can this be? The explanation is that even though the recommended videos in the candidate arm of the A/B test are shorter, users watch more of them, which more than compensates for their shorter duration, resulting in longer watch times overall.
DVR (WeChat, 2022)
DVR, short for Debiased Video Recommendation, came out just a weeks after D2Q. Like D2Q, the key idea in the DVR algorithm is to replace biased watch times with unbiased proxies for watch times. However, unlike D2Q, in DVR we transform watch times not using quantiles but instead using coefficients of variation.
Here’s how it works: first, we divide all videos into buckets separated by 1 second: one bucket for all videos 0-1s long, one for all videos 1-2 seconds long, and so on. Then, we measure the mean and standard deviation of watch times within each bucket. Now we can transform watch time into watch time gain (WTG), defined as
which is simply the coefficient of variation of watch time (with μ being the mean and , σ being the standard deviation), given the duration of the video v. WTG measures how unusual the watch time duration was, giving typical watch times for videos of that duration.
In practice, this transformation could look as follows:
(training example 1)
video duration = 100s --> μ=70s, σ=10s
watch duration = 60s --> WTG=-1 (low)
(training example 2)
video duration = 100s --> μ=70s, σ=10s
watch duration = 90s --> WTG=2 (high)
...
The plot below shows WTG as a function of video duration for WeChat Channels data: longer videos result in longer average watch times, but also larger variance.
In experiments, the authors shown that DVR improves predictive performance in a number of backbone recommender architectures. For example, using the DeepFM backbone, the authors report an improvement in WTG@10 (the ground truth WTG averaged over the top 10 items) of 12.3% over the naive baseline of simply predicting watch times directly.
It would have been nice to see an A/B test showing how well DVR performs in production, as well as a comparison against competing algorithms such as Kuaishou’s D2Q, but unfortunately the paper doesn’t include these datapoints — perhaps because the authors were under publication pressure given the release of D2Q.
D2Co (Kuaishou, 2023)
Fast forward a year, and Kuaishou strikes back with D2Co, short for “Debiased and De-noised watch time Correction”. The key idea behind this work is that observed watch times are not just biased in favor of longer videos, but also noisy due to the fact that users may blindly trust the recommendations shown to them: even a completely irrelevant video may get several seconds of watch time until the user realizes that the algorithm has made a mistake.
One interesting finding that supports this idea is that the watch time distribution for videos of a fixed length appears to be bimodal, with the first mode peaking near 0 and the second mode peaking near the video duration, corresponding to uninterested and interested users, respectively. The width of that first peak is evidence for the noisy watching hypothesis: users that aren’t interested don’t know right away that they aren’t interested. It can take them several seconds to figure out that the video isn’t for them.
Formally, D2Co computes the predicted watch time as
where
x is the set of features (user id, video id, user engagement histories, dense features, etc)
w is the watch time,
d is the video duration,
R denotes whether the candidate video in the training data is relevant (1) or irrelevant (0). In the context of this work, the authors define R as a video having a watch time of at least 18 seconds or the entire duration of the video, whichever is shorter.
The key idea behind this equation is to explicitly model the noise term (with R=0) and the bias term (with R=1). Then, if we assume that that Pr(𝑊=𝑤|𝑑,𝑅=1) and Pr(𝑊=𝑤|𝑑,𝑅=0) are two Gaussian distributions, we can estimate the solution using a Gaussian Mixture model with two components that we compute using the EM algorithm. Mathematically, this means that the watch time for videos that users find relevant will follow one Gaussian distribution, while the watch time for videos users do not find relevant will follow another Gaussian distribution.
Ultimately, D2Co beats both DVR and D2Q on two different production datasets and across 3 different recommender system architecture backbones, FM, DeepFM, and AutoInt. For example, on a Kuaishou dataset, D2Co beats D2Q by 0.25% and DVR by 0.18% GAUC (grouped AUC, i.e. the weighted average of AUC over all duration groups where the weights are the group size).
D2Co was another big win for Kuaishou, and showed for the first time the importance of modeling both the bias term as well as the noise term in watch time prediction.
Summary
In the absence of clicks, watch time has become the new main proxy for video relevance. However, watch times are biased by video duration: longer videos tend to generate longer watch times, no matter if they’re relevant to the user or not.
D2Q debiases the model by replacing watch times with watch time quantiles that are computed per video duration quantile. In A/B tests, the authors find 0.75% improvement in total watch time compared to predicting watch times directly.
DVR debiases the model by replacing watch times with watch time gain, i.e. the coefficient of variation of watch time within a video duration bucket. The authors report an improvement in WTG@10 of 12.3% over the naive baseline of simply predicting watch times directly.
D2Co, one of the latest debiasing approaches, explicitly models both the duration bias term as well as the noise term, the latter of which is caused by users not immediately swiping away an irrelevant video. The resulting Gaussian Mixture model outperform both D2Q and DVR on two different production datasets.
Duration bias in short-form video recommendation is still a relatively new and unexplored problem, and we’re just beginning to understand its importance. At a more fundamental level, this is not just an algorithmic problem but also a psychological one: if humans would act perfectly rational — swiping away irrelevant videos immediately, say — this would be a non-issue.
Ultimately, then, studies such as the ones we’ve seen here are not only telling us something about algorithms - but also about ourselves.