Vogl, B. (2024). LLM calibration: A dual approach of post-processing and pre-processing calibration techniques in large language models for medical question answering [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2024.118886
E194 - Institut für Information Systems Engineering
-
Date (published):
2024
-
Number of Pages:
52
-
Keywords:
LLM; Calibration; Chain-of-Thought; Medical Questioning; Diagnosis; LLM Confidence
en
Abstract:
This thesis investigates the performance of Large Language Models in answering medical multiple-choice questions and explores strategies to enhance their accuracy, confidence estimation, and calibration. Specifically, we analyze the capabilities of GPT-3.5 and Cohere using the MedMCQA dataset, focusing on prompting techniques, revision strategies, and post-processing calibration methods. Our goals include assessing the efficacy of Chain of Thought prompting, examining the relationship between model confidence and correctness, and evaluating post-processing calibration techniques such as Platt Scaling, Beta Calibration, and Isotonic Regression.Findings reveal GPT-3.5's superior accuracy compared to Cohere in medical question-answering. However, CoT prompting did not significantly improve model performance, suggesting its limited effectiveness in this context. Model confidence correlated with answer accuracy, but discrepancies between predicted and actual performance underscored the importance of robust calibration methods. Revision strategies marginally improved accuracy, with models adjusting responses when prompted to reconsider. Post-processing calibration techniques, particularly Isotonic Regression, demonstrated significant improvements in alignment between predicted probabilities and actual outcomes, enhancing model reliability.