TrueSkill
TrueSkill is a skill-based ranking system developed by Microsoft for use with video game matchmaking on the Xbox network. Unlike the popular Elo rating system, which was initially designed for chess, TrueSkill is designed to support games with more than two players. In 2018, Microsoft published details about an extended version of TrueSkill, named TrueSkill2.
It is based on a Thurstonian model with a Gaussian score distribution. It does not satisfy Luce's Choice Axiom.
History and use
TrueSkill was developed by researchers at Microsoft Research as a Bayesian skill rating system designed for online games in which players often compete in teams and matches may involve more than two players or teams. In their first published description of the system, Ralf Herbrich, Tom Minka and Thore Graepel evaluated the method using match data collected by Bungie Studios during the beta test of Halo 2 and described its operation in the Xbox Live service.Christopher Bishop later wrote that TrueSkill was deployed on Xbox Live in 2005 and has operated continuously since then, processing millions of game outcomes per day. Herbrich and colleagues reported that, as of September 2005, Xbox Live had over 2 million subscribed users and that the Xbox 360 Live service used TrueSkill for automatic player rating and matchmaking, processing hundreds of thousands of games per day.
In a survey of ranking systems for games and assessments, Educational Testing Service researchers described TrueSkill as a generalization of Elo-style ratings for multiplayer and team competitions and noted its use in Microsoft's Xbox online games. Later research from Microsoft proposed revisions and extensions to the original model, including TrueSkill2.
Statistical model
In the model described by Herbrich, Minka and Graepel, each player's skill is treated as an unobserved quantity, and the system maintains a probability distribution over that skill. The prior over player skills is assumed to factorise across players, with each individual skill modelled as a Gaussian distribution with mean and variance representing the system's uncertainty about that player's skill.Match outcomes are modelled as arising from noisy performances. For a given game, each player generates a latent performance value that is normally distributed around their skill, with a fixed performance variance parameter :
The parameter controls how strongly a single game's result is expected to vary around underlying skill.
For games with teams, the model assumes an additive team-performance structure: the latent performance of team is the sum of the performances of its members,, where is the set of players assigned to that team. A match result is represented as a ranking of the teams, and the likelihood of the observed ranking is defined in terms of inequalities between the corresponding team-performance variables.
Draws are incorporated by introducing a draw margin. For two adjacent teams in the ranking, a win is modelled by requiring the higher-ranked team to exceed the lower-ranked team by more than, while a draw is modelled by requiring their performance difference to lie within. Herbrich and colleagues related to an assumed or empirically estimated draw probability, allowing the draw rate to be calibrated for a given game mode.
Inference and rating updates
TrueSkill represents the joint model of skills, performances and observed outcomes as a factor graph. Inference is then framed as the computation of approximate single-variable marginals by message passing using the sum-product algorithm. For a single match, most of the messages in the graph can be represented as one-dimensional Gaussians, which makes the update computationally efficient.A key difficulty is that the outcome constraints introduce non-Gaussian terms. In the factor graph, these appear as comparison factors that impose inequalities on performance differences. The corresponding exact messages are non-Gaussian; the TrueSkill paper applies expectation propagation to approximate these messages by moment matching, replacing the intractable truncated-Gaussian forms with Gaussian approximations that have the same mean and variance. Because the comparison messages are approximate, the algorithm iterates message updates along paths connecting the affected variables until the approximate marginals stabilise.
After inference for a match, each player's updated marginal remains Gaussian and can be summarised by an updated mean and standard deviation for use in later matches. The original system also includes a dynamics variance parameter to model skill changes over time between games.
Parameterisation and rating display
TrueSkill maintains a Gaussian belief distribution for each player's skill, but practical deployments must also choose a numerical scale and values for model parameters such as the performance noise and the rate at which skills are allowed to drift over time. Herbrich and colleagues reported that Xbox Live used an initial prior scale with and, corresponding to a prior in which negative skills are unlikely.In the same deployment description, the variance of a single-game performance was set relative to the prior uncertainty as, and the between-game dynamics variance was set to.
For player-facing displays such as leaderboards, Xbox Live displayed a conservative estimate rather than the posterior mean alone. The TrueSkill paper described displaying a player's skill as the 1% lower quantile of the belief distribution,. Herbrich and colleagues wrote that this choice was intended to ensure that top positions in leaderboards were held by players who were both highly rated and estimated with high certainty, and noted that the prior implied an initial displayed value of.
Worked example
The following worked example illustrates how TrueSkill updates ratings after a three-player free-for-all match, using the model and win-update equations described by Herbrich, Minka and Graepel. In TrueSkill, each player has a skill belief distribution, where is the mean skill estimate and is the uncertainty.After a match, the system combines the prior belief with the match result to obtain a new belief distribution. The posterior mean is the mean of that updated distribution, i.e., the system's best single-number estimate of the player's skill given the evidence so far.
In the original Xbox Live scale described in the TrueSkill paper, a common choice for the performance noise is. This example uses that value and assumes no draws.
Displayed rating
In some deployments, a conservative skill estimate is displayed rather than the posterior mean. The TrueSkill paper describes displaying the 1% lower quantile of the belief distribution, which for a Gaussian distribution is approximately. In plain terms, this is a value that the system expects the player's true skill to exceed about 99% of the time, given the current uncertainty. As a result, a player with a high mean but large uncertainty will have a lower displayed rating than a similarly rated player whose skill is estimated more confidently.Setup
Three players enter a match with the following prior ratings:| Player | Displayed | ||
| Alice | 30 | 6 | 12 |
| Ben | 25 | 7 | 4 |
| Chloe | 20 | 8 | −4 |
The game finishes with Chloe in first place, Alice in second place, and Ben in third place.
How a three-player result is represented
For a strict ranking with no ties, the factor-graph formulation represents the result using adjacent comparison constraints. There is no separate "Chloe beats Ben" factor in this minimal representation because it is implied by the ranking: if Chloe is above Alice and Alice is above Ben, then Chloe is above Ben.In the full TrueSkill algorithm, message passing in the factor graph uses both adjacent constraints together to update all three players in a single inference problem. The step-by-step calculations below apply the standard two-player win update to each adjacent comparison in sequence to make the mechanics transparent; Chloe's advantage over Ben is still reflected through the chain, because Chloe's win pushes Alice down, and Alice's win pushes Ben down.
Win update equations (two players)
For a winner and loser, defineLet and be the standard normal probability density and cumulative distribution functions. For a win, define
The mean updates are
and the uncertainty decreases according to
Step 1: Chloe beats Alice
Using Chloe as winner and Alice as loser, with :- ,
- ,
- Chloe: increases to ; decreases to
- Alice: decreases to ; decreases to
Step 2: Alice beats Ben
The second comparison uses Alice's updated rating from step 1, with Ben unchanged so far. Alice beats Ben :- ,
- ,
- Alice: increases to ; decreases to
- Ben: decreases to ; decreases to
Ratings before and after the match
In this example, Chloe's first-place finish requires her performance to exceed Alice's, and Alice's to exceed Ben's, so the inferred skill distributions shift accordingly. Chloe's large increase in reflects that her win over a substantially higher-rated opponent was unexpected under the prior ratings, while the decreases in reflect that the match provides additional information about the players' relative skills.TrueSkill2
In 2018, Microsoft researchers published TrueSkill2 as an extension of the original TrueSkill model that incorporates additional information available in some online games, beyond the final win–loss ordering. The paper describes using signals such as player experience, squad membership, individual statistics, quitting behaviour, and performance in other game modes, with the aim of improving the accuracy of inferred skills for matchmaking.The authors described TrueSkill2 as retaining the same interface as classic TrueSkill, a single-number skill rating intended to remain compatible with existing matchmaking systems, while changing how skills are inferred from historical data. They described automatic parameter estimation from batches of historical matches, and two operating modes: an online mode that propagates skills forward in time and a batch mode that infers parameters and skills over all times.
Compared with classic TrueSkill, the paper lists model changes including: incorporating individual player statistics alongside team win/loss; treating mid-game quitting as a surrender for rating purposes; borrowing information across game modes by modelling skills as correlated between modes; and modelling skill evolution as biased towards improvement, especially in a player's early matches in a mode. It adds an explicit squad effect, modelling players who queue together as performing better than they would individually.
On match data from Halo 5, the authors reported that TrueSkill2 improved prediction of historical match outcomes, giving 68% accuracy on their evaluation compared with 52% for the original TrueSkill system.
Related models and extensions
TrueSkill is part of a broader family of statistical rating systems used to infer relative strength from match outcomes. The Elo rating system, originally developed for chess, models win probabilities from rating differences and updates ratings after games based on the difference between observed and expected results.Glickman later proposed dynamic paired-comparison models that incorporate uncertainty in a player's estimated strength and allow that uncertainty to change over time, motivated in part by limitations of Elo-style updates for large competitive populations. In a survey of ranking systems used in games and assessments, ETS researchers described TrueSkill as an Elo-style generalisation for multiplayer and team competitions that maintains uncertainty estimates alongside point ratings.
Some related probabilistic models instead treat rankings as samples from a distribution that satisfies Luce's choice axiom. Guiver and Snelson note that a Thurstonian model with independent score variables induces a Plackett–Luce distribution if and only if the scores follow a Gumbel distribution, and they contrast this with TrueSkill, which uses Gaussian score variables. They state that TrueSkill's Gaussian-score model does not satisfy Luce's choice axiom, while noting its use in large-scale online rating systems.
Several research extensions build on TrueSkill's probabilistic formulation. Dangauthier, Herbrich, Minka and Graepel proposed TrueSkill Through Time, which replaces the original filtering-style updates with smoothing over a time series of player skills, allowing retrospective estimation of skill trajectories while retaining uncertainty estimates and an explicit draw model.