How Accurate is an 8Realms Rating?

In short, it depends.

The 8Realms system uses a sample of matches against players of varying skill levels to establish a player’s ranking, based on Elo ratings. While this system is widely used because of its ability to dynamically adjust based on player performance and relative skill levels, it is important to understand that an Elo rating is not considered ‘true’ or reliable until a player has participated in a significant number of games.

This post explores the mathematical foundations behind this threshold, illustrating why at least 25 games are necessary for an 8Realms rating to be deemed ‘true’.

The Importance of Sample Size

One of the critical factors affecting the reliability of Elo ratings is the number of games played. When a player has played only a few games, their rating is highly volatile and can swing dramatically with each result. This volatility occurs because the system has limited information about the player’s true skill level.

Mathematically, the volatility in rating adjustments can be illustrated using the concept of variance. The variance of a player’s rating after n games can be expressed as

Variance = (d² / n)

where d represents the average deviation from the expected outcome. With a small value of n in a player’s first few games, the variance is high which means that a rating is accurate to a much lower degree. Think of this as the difference between trying to measure something with a ruler, which has lots of markings and a higher n value, and trying to measure the same thing with your own foot, which you know is roughly a foot long and has a much lower n value. For measuring something like the length of a car, you will probably get roughly the right answer with your foot, but your level of confidence in that measurement would be lower.

The 25-Game Threshold

The threshold of 25 games is derived from empirical observations and statistical analysis within the chess community and other domains which already use Elo ratings. Here’s a simplified breakdown:

Initial Phase (0-10 games): During this phase, the player’s rating is highly unstable. A few good or bad results can cause significant shifts, leading to an unreliable estimate of their skill level.
Transitional Phase (10-25 games): The rating begins to stabilize as more data points (game results) are collected. Each additional game contributes to reducing the variance, making the rating progressively more reliable.
Stabilised Phase (25+ games): By the time a player has participated in about 25 games, their rating reaches a level of stability where individual game results cause smaller adjustments. The variance is significantly lower, indicating a more accurate representation of the player’s true skill level.

To understand how 8Realms accounts for this, consider two players, Alarielle and Belakor, each with an initial rating of 1500. Suppose Alarielle wins against Belakor in their first game. With a high K-factor (commonly 40 for new players), the rating change for Alarielle would be significant:

Expected score for Alarielle (E_A) = 0.5

Actual score for Alarielle (S_A) = 1

New rating for Alarielle (R’_A) = 1500 + 40(1 – 0.5) = 1520

New rating for Belakor (R’_B) = 1500 – 40(1 – 0.5) = 1480

After just one game, a swing of 20 points occurs. If Alarielle continues to play and accumulate results, the changes will gradually diminish in magnitude as more data points provide a clearer picture of their skill. By the 25th game, the rating change per game is much smaller, reflecting the stabilizing effect of a larger sample size.

Conclusion

8Realms Ratings are a powerful tool for assessing player skill, but below a 25-game threshold the variance is too high for the rating to be considered ‘true’. As players reach and surpass the 25-game mark, their ratings stabilise and provide a more reliable measure of their performance. This threshold is a crucial consideration for players, TOs and stats nerds who want to understand how 8Realms ratings reflect player skill in the UK AOS competitive scene.