Javascript must be enabled to continue!

Keeping Elo alive: Evaluating and improving measurement properties of learning systems based on Elo ratings

The Elo Rating System which originates from competitive chess has been widely utilised in large-scale online educational applications where it is used for on-the-fly estimation of ability, item calibration, and adaptivity. In this paper, we aim to critically analyse the shortcomings of the Elo rating system in an educational context, shedding light on its measurement properties and when these may fall short in precisely and reliably capturing student abilities and item difficulties. In a simulation study, we look at the asymptotic properties of the Elo rating system. Our results show that the Elo ratings are generally not unbiased and their variances are context-dependent. Furthermore, in scenarios where items are selected adaptively based on the current ratings and the item difficulties are updated alongside the student abilities, the variance of the ratings across items and students artificially increases over time and as a result, the ratings do not converge. We propose a solution to this problem which entails using two parallel chains of ratings which remove the dependence of item selection on the current errors in the ratings.

Center for Open Science

Maria Bolsinova Bence Gergely Matthieu J. S. Brinkhuis

2024

Title: Keeping Elo alive: Evaluating and improving measurement properties of learning systems based on Elo ratings

Description:

In this paper, we aim to critically analyse the shortcomings of the Elo rating system in an educational context, shedding light on its measurement properties and when these may fall short in precisely and reliably capturing student abilities and item difficulties.

In a simulation study, we look at the asymptotic properties of the Elo rating system.

Our results show that the Elo ratings are generally not unbiased and their variances are context-dependent.

Furthermore, in scenarios where items are selected adaptively based on the current ratings and the item difficulties are updated alongside the student abilities, the variance of the ratings across items and students artificially increases over time and as a result, the ratings do not converge.

We propose a solution to this problem which entails using two parallel chains of ratings which remove the dependence of item selection on the current errors in the ratings.

Back

Abstract Introduction: A collision tumor consists of two distinct neoplastic components located within the same organ, separated by stromal tissue, without histological intermixing...

CREATING LEARNING MEDIA IN TEACHING ENGLISH AT SMP MUHAMMADIYAH 2 PAGELARAN ACADEMIC YEAR 2020/2021

The pandemic Covid-19 currently demands teachers to be able to use technology in teaching and learning process. But in reality there are still many teachers who have not been able ...

Using simulations to compare the current Davis Cup ranking system to Elo

The Davis Cup is the premier men’s team event in tennis, run by the International Tennis Federation and in which over 130 nations compete. It uses a merit-based ranking system that...

Consistency of perceiving odors: Inter- and Intra- Individual Differences in Odor Similarity Ratings

This study conducted odor pair similarity ratings twice and used the replicability of the ratings, as indicated by the correlation coefficients between the two sets of ratings, to ...

Do Likert-type items yield interval-scaled measurements of subjective agreement? An empirical test of individual-level response structures

Numeric ratings of agreement (also known as Likert-scales) are one of the most common assessment tools in psychology. However, little is known about the measurement theoretic prope...

Do Likert-type items yield interval-scaled measurements of subjective agreement? An empirical test of individual-level response structures

Numeric ratings of agreement (also known as Likert-scales) are one of the most common assessment tools in psychology. However, little is known about the measurement theoretic prope...

Selection of Injectable Drug Product Composition using Machine Learning Models (Preprint)

BACKGROUND As of July 2020, a Web of Science search of “machine learning (ML)” nested within the search of “pharmacokinetics or pharmacodynamics” yielded over 100...

Are file review-based SAVRY ratings of violence risk reliable?

Since its publication a decade ago, the Structured Assessment for Violence Risk in Youth (SAVRY) has gained acceptance as a strong predictor of future violence in adolescent popula...

Email:
Password:

Email:

Keeping Elo alive: Evaluating and improving measurement properties of learning systems based on Elo ratings

Related Results