Javascript must be enabled to continue!

Large Language Models Encode Radiation Oncology Domain Knowledge: Performance on the American College of Radiology Standardized Examination

Introduction: The integration of large language models (LLMs) into medical education will represent a significant paradigm shift, offering transformative potential in how medical knowledge is accessed and assimilated. These models have not yet been systematically trained or validated on complex subspecialty medical examinations. This study explores the performance of seven major LLMs in radiation oncology. Materials and Methods: The 2021 American College of Radiology (ACR) Radiation Oncology In-Training Examination (TXIT) was used to evaluate the performance of various LLMs: OpenAI's GPT-3.5-turbo, GPT-4, GPT-4-turbo, Meta's Llama-2 models (7 billion, 13 billion, and 70 billion parameter models), and Google's PaLM-2-text-bison. The ACR provided publicly available national scoring for this examination. The examination comprised 300 questions across four major domains, including clinical, biology, physics, and statistics. The examination was processed through each LLM through application programming interface. LLM-generated answers were analyzed by domains and compared with radiation oncology trainee performance. The total cost of token inputs and outputs were aggregated and analyzed. Results: LLMs showed varied performance, with OpenAI's GPT-4-turbo leading with 74.2% correct answers and all three Llama-2 models underperforming (ranging between 26.2% and 43.3% correct). LLMs generally excelled in the statistics domain (93.0–100%) but were less effective in clinical areas (37.0–68.0%), with the exception of GPT-4-turbo that performed comparably (68.0%) with upper-level radiation oncology trainees (PGY4–5 64.1–68.3%) and superiorly with lower-level trainees (PGY2–3 51.6–61.6%). Notably, GPT-4-turbo demonstrated 7.0% clinical improvement over its predecessor GPT-4. LLMs scored the lowest in gastrointestinal, genitourinary, and gynecology and highest in bone and soft tissue, central nervous system, and head and neck. Overall costs of LLM inputs and outputs were modest at $2.63 across all seven models. Conclusion: GPT-4-turbo demonstrates clinical accuracy comparable with upper-level and superior with lower-level trainees. Score discrepancies across disease site domains may be due to data availability, complexity of medical conditions, and quality and quantity of training data sets. Future research will need to evaluate the performance of models that are fine-tune trained in clinical oncology. This study also underscores the need for rigorous validation of LLM-generated information against established medical literature and expert consensus, necessitating expert oversight in their application in medical education and practice.

SAGE Publications

Nikhil G. Thaker Navid Redjal Arturo Loaiza-Bonilla David Penberthy Tim Showalter Ajay Choudhri Shirnett Williamson Gautam Thaker Chirag Shah Matthew C. Ward Mihir Thaker Michael Arcaro

AI in Precision Oncology

2024

Title: Large Language Models Encode Radiation Oncology Domain Knowledge: Performance on the American College of Radiology Standardized Examination

Description:

These models have not yet been systematically trained or validated on complex subspecialty medical examinations.

This study explores the performance of seven major LLMs in radiation oncology.

Materials and Methods: The 2021 American College of Radiology (ACR) Radiation Oncology In-Training Examination (TXIT) was used to evaluate the performance of various LLMs: OpenAI's GPT-3.

5-turbo, GPT-4, GPT-4-turbo, Meta's Llama-2 models (7 billion, 13 billion, and 70 billion parameter models), and Google's PaLM-2-text-bison.

The ACR provided publicly available national scoring for this examination.

The examination comprised 300 questions across four major domains, including clinical, biology, physics, and statistics.

The examination was processed through each LLM through application programming interface.

LLM-generated answers were analyzed by domains and compared with radiation oncology trainee performance.

The total cost of token inputs and outputs were aggregated and analyzed.

Results: LLMs showed varied performance, with OpenAI's GPT-4-turbo leading with 74.

2% correct answers and all three Llama-2 models underperforming (ranging between 26.

2% and 43.

3% correct).

LLMs generally excelled in the statistics domain (93.

0–100%) but were less effective in clinical areas (37.

0–68.

0%), with the exception of GPT-4-turbo that performed comparably (68.

0%) with upper-level radiation oncology trainees (PGY4–5 64.

1–68.

3%) and superiorly with lower-level trainees (PGY2–3 51.

6–61.

6%).

Notably, GPT-4-turbo demonstrated 7.

0% clinical improvement over its predecessor GPT-4.

LLMs scored the lowest in gastrointestinal, genitourinary, and gynecology and highest in bone and soft tissue, central nervous system, and head and neck.

Overall costs of LLM inputs and outputs were modest at $2.

63 across all seven models.

Conclusion: GPT-4-turbo demonstrates clinical accuracy comparable with upper-level and superior with lower-level trainees.

Score discrepancies across disease site domains may be due to data availability, complexity of medical conditions, and quality and quantity of training data sets.

Future research will need to evaluate the performance of models that are fine-tune trained in clinical oncology.

This study also underscores the need for rigorous validation of LLM-generated information against established medical literature and expert consensus, necessitating expert oversight in their application in medical education and practice.

Back

<p><em><span style="font-size: 11.0pt; font-family: 'Times New Roman',serif; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-langua...

Učinak poučavanja razrednomu jeziku u izobrazbi nastavnika njemačkoga

The actual use of classroom language is principally limited to the classroom environment. As far as foreign language learning is concerned, the classroom often turns out to be the ...

AI and Incidental Findings

Photo by Accuray on Unsplash INTRODUCTION Delayed and missed follow-up on incidental findings threatens patient health and is a major financial risk for healthcare systems. The hea...

Increased life expectancy of heart failure patients in a rural center by a multidisciplinary program

Abstract Funding Acknowledgements Type of funding sources: None. INTRODUCTION Patients with heart failure (HF)...

ACKNOWLEDGMENTS

The UP Manila Health Policy Development Hub recognizes the invaluable contribution of the participants in theseries of roundtable discussions listed below: RTD: Beyond Hospit...

Radiation Evaluation System for Radiation Releasing Environment of Nuclear Power Plant Based on Domain-Driven Design

In order to meet the sustainable development demand for energy, developing nuclear power actively has become an important means for the country to improve energy supply pattern and...

Awareness of Interventional Radiology among Clinical Years’ Medical Students and Medical Interns at University of Hail

Context: One of the most important challenges facing the evolution of modern interventional radiology is its lack of awareness among medical students. ...

Best Practice Keselamatan Radiasi di Rumah Sakit Hermina Karawang

Radiation safety is an action taken to protect workers, community members, and the environment from radiation hazards. The purpose of writing is to realize the best practice of rad...

Email:
Password:

Email:

Large Language Models Encode Radiation Oncology Domain Knowledge: Performance on the American College of Radiology Standardized Examination

Related Results