Probabilistic medical predictions of large language models.

NPJ digital medicine
Authors
Abstract

Large Language Models (LLMs) have shown promise in clinical applications through prompt engineering, allowing flexible clinical predictions. However, they struggle to produce reliable prediction probabilities, which are crucial for transparency and decision-making. While explicit prompts can lead LLMs to generate probability estimates, their numerical reasoning limitations raise concerns about reliability. We compared explicit probabilities from text generation to implicit probabilities derived from the likelihood of predicting the correct label token. Across six advanced open-source LLMs and five medical datasets, explicit probabilities consistently underperformed implicit probabilities in discrimination, precision, and recall. This discrepancy is more pronounced with smaller LLMs and imbalanced datasets, highlighting the need for cautious interpretation, improved probability estimation methods, and further research for clinical use of LLMs.

Year of Publication
2024
Journal
NPJ digital medicine
Volume
7
Issue
1
Pages
367
Date Published
12/2024
ISSN
2398-6352
DOI
10.1038/s41746-024-01366-4
PubMed ID
39702641
Links