Javascript must be enabled to continue!
Keeping Elo alive: Evaluating and improving measurement properties of learning systems based on Elo ratings
View through CrossRef
The Elo Rating System which originates from competitive chess has been widely utilised in large-scale online educational applications where it is used for on-the-fly estimation of ability, item calibration, and adaptivity. In this paper, we aim to critically analyse the shortcomings of the Elo rating system in an educational context, shedding light on its measurement properties and when these may fall short in precisely and reliably capturing student abilities and item difficulties. In a simulation study, we look at the asymptotic properties of the Elo rating system. Our results show that the Elo ratings are generally not unbiased and their variances are context-dependent. Furthermore, in scenarios where items are selected adaptively based on the current ratings and the item difficulties are updated alongside the student abilities, the variance of the ratings across items and students artificially increases over time and as a result, the ratings do not converge. We propose a solution to this problem which entails using two parallel chains of ratings which remove the dependence of item selection on the current errors in the ratings.
Title: Keeping Elo alive: Evaluating and improving measurement properties of learning systems based on Elo ratings
Description:
The Elo Rating System which originates from competitive chess has been widely utilised in large-scale online educational applications where it is used for on-the-fly estimation of ability, item calibration, and adaptivity.
In this paper, we aim to critically analyse the shortcomings of the Elo rating system in an educational context, shedding light on its measurement properties and when these may fall short in precisely and reliably capturing student abilities and item difficulties.
In a simulation study, we look at the asymptotic properties of the Elo rating system.
Our results show that the Elo ratings are generally not unbiased and their variances are context-dependent.
Furthermore, in scenarios where items are selected adaptively based on the current ratings and the item difficulties are updated alongside the student abilities, the variance of the ratings across items and students artificially increases over time and as a result, the ratings do not converge.
We propose a solution to this problem which entails using two parallel chains of ratings which remove the dependence of item selection on the current errors in the ratings.
Related Results
Complex Collision Tumors: A Systematic Review
Complex Collision Tumors: A Systematic Review
Abstract
Introduction: A collision tumor consists of two distinct neoplastic components located within the same organ, separated by stromal tissue, without histological intermixing...
CREATING LEARNING MEDIA IN TEACHING ENGLISH AT SMP MUHAMMADIYAH 2 PAGELARAN ACADEMIC YEAR 2020/2021
CREATING LEARNING MEDIA IN TEACHING ENGLISH AT SMP MUHAMMADIYAH 2 PAGELARAN ACADEMIC YEAR 2020/2021
The pandemic Covid-19 currently demands teachers to be able to use technology in teaching and learning process. But in reality there are still many teachers who have not been able ...
Using simulations to compare the current Davis Cup ranking system to Elo
Using simulations to compare the current Davis Cup ranking system to Elo
The Davis Cup is the premier men’s team event in tennis, run by the International Tennis Federation and in which over 130 nations compete. It uses a merit-based ranking system that...
Consistency of perceiving odors: Inter- and Intra- Individual Differences in Odor Similarity Ratings
Consistency of perceiving odors: Inter- and Intra- Individual Differences in Odor Similarity Ratings
This study conducted odor pair similarity ratings twice and used the replicability of the ratings, as indicated by the correlation coefficients between the two sets of ratings, to ...
Do Likert-type items yield interval-scaled measurements of subjective agreement? An empirical test of individual-level response structures
Do Likert-type items yield interval-scaled measurements of subjective agreement? An empirical test of individual-level response structures
Numeric ratings of agreement (also known as Likert-scales) are one of the most common assessment tools in psychology. However, little is known about the measurement theoretic prope...
Do Likert-type items yield interval-scaled measurements of subjective agreement? An empirical test of individual-level response structures
Do Likert-type items yield interval-scaled measurements of subjective agreement? An empirical test of individual-level response structures
Numeric ratings of agreement (also known as Likert-scales) are one of the most common assessment tools in psychology. However, little is known about the measurement theoretic prope...
Selection of Injectable Drug Product Composition using Machine Learning Models (Preprint)
Selection of Injectable Drug Product Composition using Machine Learning Models (Preprint)
BACKGROUND
As of July 2020, a Web of Science search of “machine learning (ML)” nested within the search of “pharmacokinetics or pharmacodynamics” yielded over 100...
Are file review-based SAVRY ratings of violence risk reliable?
Are file review-based SAVRY ratings of violence risk reliable?
Since its publication a decade ago, the Structured Assessment for Violence Risk in Youth (SAVRY) has gained acceptance as a strong predictor of future violence in adolescent popula...

