Can language models boost the power of randomized experiments without statistical bias?
Introduction to Language Models and RCTs
Imagine unlocking hidden layers of insight within clinical trials—not from more participants, but by harnessing the rich narratives, conversations, and contextual data often overlooked. Randomized controlled trials (RCTs) have long stood as the gold standard for establishing causality in medicine and behavioral sciences, precisely because their randomized design minimizes confounding and bias. Yet, despite their rigor, RCTs face persistent limitations: high costs restrict sample sizes, and crucial information embedded in unstructured data—such as clinical notes, participant feedback, or transcripts—remains vastly underutilized. This is where the transformative potential of advanced artificial intelligence, specifically large language models (LLMs), enters the scene.
LLMs, trained on diverse and massive datasets, excel at interpreting unstructured text, capturing subtle semantic relationships and domain knowledge that traditional statistical methods typically miss. By integrating these AI-driven insights into RCT analyses, we can enrich causal inference without compromising the stringent standards of statistical validity. The Causal Analysis leveraging Language Models (CALM) framework embodies this vision. CALM treats predictions from LLMs as auxiliary prognostic signals and strategically calibrates them to correct potential biases inherent in AI outputs. This approach not only enhances precision but also maintains consistency, making it a promising solution to overcome the constraints of conventional trial analyses.
The opportunities are substantial. Unstructured data have been shown to prognosticate outcomes in RCTs across oncology, mental health, dementia, and healthcare delivery [45], often exceeding the predictive power of traditional structured variables. By tapping into this wealth of information through CALM, researchers can improve statistical power, detect treatment effect heterogeneity with greater confidence, and ultimately tailor interventions more effectively. But this integration is not without challenges—LLMs can introduce bias, and naive plugging of AI predictions can erode inference validity. CALM navigates these pitfalls with a mathematically principled calibration and weighting strategy that adapts selectively to where LLM predictions are most reliable.
As we embark on this exploration, the coming sections will dissect the technical foundations of CALM, demonstrate its robust theoretical guarantees, and showcase its empirical prowess via simulation and case studies. The path forged by CALM suggests a future where RCTs are not only rigorous but also richer in insight—propelled by the synergy of human design and artificial intelligence. To understand the full scope and promise, dive into the foundational concepts ahead and see how language models can revolutionize causal analysis in experimental research.
[1 Introduction] | [45]
CALM Framework Overview
At the heart of enhancing randomized controlled trials with artificial intelligence lies the CALM framework—an innovative methodology that integrates large language model (LLM) predictions into causal effect estimation while rigorously preserving statistical validity. Unlike conventional approaches that either ignore unstructured data or incorporate it naively, CALM leverages LLMs’ ability to extract nuanced signals from free-text, voice transcripts, or other rich modalities. This enables researchers to tap into prognostic information often buried within unstructured covariates, dramatically refining the precision of treatment effect estimates.
The mechanics of CALM hinge on two primary modes of LLM utilization: zero-shot and few-shot learning. In the zero-shot scenario, LLMs generate predictions based solely on structured and unstructured covariate inputs, relying entirely on their extensive pretraining without any trial-specific examples. This approach benefits from the broad, prior knowledge embedded in the model, offering immediate, unbiased estimations of counterfactual outcomes for subjects. Conversely, few-shot learning introduces a small set of demonstrative examples from the trial data into the LLM prompts. This subtle but powerful adaptation tailors the model’s predictions toward the specific trial context, often improving predictive accuracy and facilitating better alignment with observed outcome patterns. However, few-shot predictions are inherently correlated and sensitive to example selection, challenges that CALM addresses through a novel resampling and aggregation strategy to restore statistical independence and robustness.
A foundational example of CALM’s application is estimating the mean potential outcome 𝔼[Y(t)] under each treatment arm in randomized experiments. CALM treats LLM predictions as auxiliary prognostic variables but calibrates them via residualization and heterogeneous weighting steps, ensuring that any bias from LLMs does not compromise consistency or valid inference. This adaptive calibration selectively borrows strength from AI signals only where they enhance estimation, recognizing that informativeness varies across subject subgroups or covariate strata. Such an approach yields efficiency gains over classical estimators like augmented inverse probability weighting, as demonstrated in both theoretical derivations and simulation studies based on the BRIGHTEN trial.
Common objections to CALM’s implementation often center around the potential introduction of bias through imperfect AI predictions or the complexity of incorporating these models without violating statistical assumptions. Yet, CALM is explicitly designed to be robust to biased predictions. Through carefully constructed estimating equations, cross-fitting, and sample splitting, it guarantees consistency and valid confidence intervals. Moreover, by leveraging the inherent capacity of LLMs to process unstructured data—including free-text responses, interview transcripts, or multimodal inputs—CALM unlocks sources of prognostic information that classical methods cannot harness efficiently.
Metrics underscore these advantages vividly. For example, the calibration weight function within CALM quantifies the strength of association between LLM predictions and observed outcomes, revealing heterogeneity in predictive relevance across covariates. Empirical results show that where this alignment is strong, CALM achieves substantial variance reductions—sometimes exceeding 30%—compared to benchmarks. This selective augmentation not only sharpens average treatment effect estimates but also boosts power for detecting heterogeneous effects within subpopulations.
In sum, the CALM framework represents a principled, practical, and robust methodology for integrating language model predictions into the statistical analysis of randomized trials. Its design thoughtfully balances leveraging complex AI insights with the rigor required for causal inference, setting the stage for the detailed estimation strategies and theoretical properties that follow.
[2 CALM for estimating 𝔼[Y(t)]] | [44]
Estimation of Causal Parameters Using CALM
Building upon the foundation of estimating mean potential outcomes, CALM extends its power to handle key causal parameters that matter most in randomized controlled trials: the Average Treatment Effect (ATE) and the Conditional Average Treatment Effect (CATE). These parameters provide overall and personalized measures of treatment impact, respectively, enabling researchers not only to quantify average benefits but also to uncover meaningful treatment heterogeneity across subgroups.
To illustrate CALM’s practical utility, consider the BRIGHTEN study—a large-scale, smartphone-based RCT evaluating digital interventions for depression. By integrating structured demographic and clinical variables with rich unstructured textual data from participants’ baseline surveys and app-use narratives, CALM leverages LLM-generated counterfactual predictions to reveal nuanced variations in treatment effects. For instance, CALM flagged a significant positive treatment effect among female Hispanic participants—a subgroup analysis that more conventional methods failed to detect due to limited power and underutilized data complexity. This example highlights how CALM adaptively calibrates the prognostic strength of language model predictions to boost estimation precision where it truly matters.
The core of CALM’s approach to estimating ATE and CATE lies in its statistical calibration mechanism. Unlike traditional estimators that treat covariates homogeneously, CALM constructs heterogeneous calibration weights that quantify how informative the LLM predictions are across different covariate strata. These weights modulate the influence of AI-generated signals, ensuring that noisy or biased predictions do not compromise consistency. Mathematical guarantees underpin this design: under mild assumptions, CALM estimators remain consistent, asymptotically normal, and achieve strictly lower variance than classical augmented inverse probability weighting (AIPW) estimators whenever the LLM predictions correlate with observed outcomes. Further, sample-splitting and cross-validation techniques embedded in CALM relax stringent conditions often required by semiparametric inference, paving the way for flexible machine learning methods to estimate nuisance functions without sacrificing validity.
Translating theory into practice, adopting CALM involves several clear steps:
- Data preparation: Combine structured and unstructured covariates, embedding textual data using pretrained sentence transformers or equivalent models.
- LLM counterfactual prediction: Generate zero-shot or few-shot predictions depending on the availability and quality of demonstrative examples.
- Heterogeneous calibration weight estimation: Use robust regression or kernel smoothing to learn where LLM predictions are most reliable.
- Final estimator construction: Incorporate calibrated, residualized LLM predictions into the AIPW framework to form the CALM estimator.
- Statistical inference: Apply cross-fitting and asymptotic variance estimators for valid co...