Paper

Are Large Language Models Effective in Clinical Trial Design? A Study on Baseline Feature Generation

Abstract

In clinical trial design, baseline feature selection is one of the crucial tasks for characterizing study cohorts and ensuring accurate study outcomes. Large Language Models (LLMs) show promise in automating this process by analyzing trial data and identifying key features. To assess the capabilities of LLMs in generating appropriate baseline features for clinical trials, we create two datasets: CT-Repo\textit{CT-Repo}, which contains baseline features from 1,690 clinical trials sourced from clinicaltrials.gov, and CT-Pub\textit{CT-Pub}, a curated subset of 100 clinical trials with more detailed baseline features extracted from published studies. In this paper, we consider GPT-4o and LLaMa3-70B-Instruct models in three configurations: zero-shot, three-shot with a fixed set of examples, and three-shot using an adaptive set of examples based on Retrieval-Augmented Generation (RAG) approach. We evaluate the model performance of baseline feature generation using the LLM-as-a-Judge\textit{LLM-as-a-Judge} framework. We further validate the LLM-as-a-judge evaluation on the CT-Pub dataset using assessments from human experts in a clinical trial. The results indicated that the RAG-based three-shot learning approach significantly improved performance by providing relevant, context-specific examples. This study marks an important initial advancement in using LLM for the robust design of clinical trials and observational studies.

Related