Javascript must be enabled to continue!
Developing artificial intelligence tools for institutional review board pre-review: A pilot study on ChatGPT’s accuracy and reproducibility
View through CrossRef
AbstractThis pilot study is the first phase of a broader project aimed at developing an explainable artificial intelligence (AI) tool to support the ethical evaluation of Japanese-language clinical research documents. The tool is explicitly not intended to assist document drafting. We assessed the baseline performance of generative AI—Generative Pre-trained Transformer (GPT)-4 and GPT-4o—in analyzing clinical research protocols and informed consent forms (ICFs). The goal was to determine whether these models could accurately and consistently extract ethically relevant information, including the research objectives and background, research design, and participant-related risks and benefits. First, we compared the performance of GPT-4 and GPT-4o using custom agents developed via OpenAI’s Custom GPT functionality (hereafter “GPTs”). Then, using GPT-4o alone, we compared outputs generated by GPTs optimized with customized Japanese prompts to those generated by standard prompts. GPT-4o achieved 80% agreement in extracting research objectives and background and 100% in extracting research design, while both models demonstrated high reproducibility across ten trials. GPTs with customized prompts produced more accurate and consistent outputs than standard prompts. This study suggests the potential utility of generative AI in pre-institutional review board (IRB) review tasks; it also provides foundational data for future validation and standardization efforts involving retrieval-augmented generation and fine-tuning. Importantly, this tool is intended not to automate ethical review but rather to support IRB decision-making. Limitations include the absence of gold standard reference data, reliance on a single evaluator, lack of convergence and inter-rater reliability analysis, and the inability of AI to substitute for in-person elements such as site visits.Author summaryIn this pilot study, we examined whether ChatGPT could accurately and consistently extract key elements from clinical research protocols and informed consent forms written in Japanese. These elements include the research objectives and background, research design, and participant-related risks and benefits, all essential for ethical review. This study is part of a larger project aiming to develop an explainable AI tool to assist institutional review board (IRB) members in evaluating clinical research documents, but not to automate ethical review. As a prerequisite for implementing advanced technologies such as document retrieval and task-specific adaptation, this study evaluated the basic capabilities of current large language models. We compared standard and customized prompts for GPT-4 and GPT-4o and found that GPT-4o, when guided by customized prompts, produced more accurate and consistent outputs. However, study limitations remain, including the absence of gold standard reference data, reliance on a single evaluator, and lack of convergence and inter-rater reliability analysis. This study offers important early insights into how generative AI could assist the ethical evaluation of clinical research documents written in Japanese.
Title: Developing artificial intelligence tools for institutional review board pre-review: A pilot study on ChatGPT’s accuracy and reproducibility
Description:
AbstractThis pilot study is the first phase of a broader project aimed at developing an explainable artificial intelligence (AI) tool to support the ethical evaluation of Japanese-language clinical research documents.
The tool is explicitly not intended to assist document drafting.
We assessed the baseline performance of generative AI—Generative Pre-trained Transformer (GPT)-4 and GPT-4o—in analyzing clinical research protocols and informed consent forms (ICFs).
The goal was to determine whether these models could accurately and consistently extract ethically relevant information, including the research objectives and background, research design, and participant-related risks and benefits.
First, we compared the performance of GPT-4 and GPT-4o using custom agents developed via OpenAI’s Custom GPT functionality (hereafter “GPTs”).
Then, using GPT-4o alone, we compared outputs generated by GPTs optimized with customized Japanese prompts to those generated by standard prompts.
GPT-4o achieved 80% agreement in extracting research objectives and background and 100% in extracting research design, while both models demonstrated high reproducibility across ten trials.
GPTs with customized prompts produced more accurate and consistent outputs than standard prompts.
This study suggests the potential utility of generative AI in pre-institutional review board (IRB) review tasks; it also provides foundational data for future validation and standardization efforts involving retrieval-augmented generation and fine-tuning.
Importantly, this tool is intended not to automate ethical review but rather to support IRB decision-making.
Limitations include the absence of gold standard reference data, reliance on a single evaluator, lack of convergence and inter-rater reliability analysis, and the inability of AI to substitute for in-person elements such as site visits.
Author summaryIn this pilot study, we examined whether ChatGPT could accurately and consistently extract key elements from clinical research protocols and informed consent forms written in Japanese.
These elements include the research objectives and background, research design, and participant-related risks and benefits, all essential for ethical review.
This study is part of a larger project aiming to develop an explainable AI tool to assist institutional review board (IRB) members in evaluating clinical research documents, but not to automate ethical review.
As a prerequisite for implementing advanced technologies such as document retrieval and task-specific adaptation, this study evaluated the basic capabilities of current large language models.
We compared standard and customized prompts for GPT-4 and GPT-4o and found that GPT-4o, when guided by customized prompts, produced more accurate and consistent outputs.
However, study limitations remain, including the absence of gold standard reference data, reliance on a single evaluator, and lack of convergence and inter-rater reliability analysis.
This study offers important early insights into how generative AI could assist the ethical evaluation of clinical research documents written in Japanese.
Related Results
Exploring Large Language Models Integration in the Histopathologic Diagnosis of Skin Diseases: A Comparative Study
Exploring Large Language Models Integration in the Histopathologic Diagnosis of Skin Diseases: A Comparative Study
Abstract
Introduction
The exact manner in which large language models (LLMs) will be integrated into pathology is not yet fully comprehended. This study examines the accuracy, bene...
Assessment of Chat-GPT, Gemini, and Perplexity in Principle of Research Publication: A Comparative Study
Assessment of Chat-GPT, Gemini, and Perplexity in Principle of Research Publication: A Comparative Study
Abstract
Introduction
Many researchers utilize artificial intelligence (AI) to aid their research endeavors. This study seeks to assess and contrast the performance of three sophis...
CHATGPT ASSISTANCE ON BIOCHEMISTRY LEARNING OUTCOMES OF PRE-SERVICE TEACHERS
CHATGPT ASSISTANCE ON BIOCHEMISTRY LEARNING OUTCOMES OF PRE-SERVICE TEACHERS
This research investigates the effect of ChatGPT on the learning outcomes of pre-service biology teachers. Sampling was done by purposive sampling in class A (treated with ChatGPT)...
Appearance of ChatGPT and English Study
Appearance of ChatGPT and English Study
The purpose of this study is to examine the definition and characteristics of ChatGPT in order to present the direction of self-directed learning to learners, and to explore the po...
Exploring Teacher Attitudes Towards ChatGPT: A comprehensive Review
Exploring Teacher Attitudes Towards ChatGPT: A comprehensive Review
This article explores teachers' attitudes towards ChatGPT, a language model created by OpenAI, and its potential implications for education. ChatGPT employs sophisticated machine l...
Evaluating the Science to Inform the Physical Activity Guidelines for Americans Midcourse Report
Evaluating the Science to Inform the Physical Activity Guidelines for Americans Midcourse Report
Abstract
The Physical Activity Guidelines for Americans (Guidelines) advises older adults to be as active as possible. Yet, despite the well documented benefits of physical a...
User Intentions to Use ChatGPT for Self-Diagnosis and Health-Related Purposes: Cross-sectional Survey Study (Preprint)
User Intentions to Use ChatGPT for Self-Diagnosis and Health-Related Purposes: Cross-sectional Survey Study (Preprint)
BACKGROUND
With the rapid advancement of artificial intelligence (AI) technologies, AI-powered chatbots, such as Chat Generative Pretrained Transformer (Cha...
Global Healthcare Professionals’ Perceptions of Large Language Model Use In Practice (Preprint)
Global Healthcare Professionals’ Perceptions of Large Language Model Use In Practice (Preprint)
BACKGROUND
Chat Generative Pre-Trained Transformer (ChatGPTTM) is a large language model (LLM)-based chatbot developed by OpenAITM. ChatGPT has many potenti...

