Auditing Demographic Bias in Mistral: An Open-Source LLM’s Diagnostic Performance on the MedQA Benchmark

Sadka, Abdul and Ahmed, Hosameldin (2026). Auditing Demographic Bias in Mistral: An Open-Source LLM’s Diagnostic Performance on the MedQA Benchmark. IEEE Access, 14 , pp. 12526-12543.

Abstract

The application of large language models (LLMs) within clinical decision-support frameworks is receiving growing research attention, yet their fairness and demographic robustness remain insufficiently understood. This study introduces MedQA-Demog, a purpose-built, label-invariant extension of the MedQA-USMLE benchmark, designed to enable systematic auditing of demographic bias in medical reasoning models. Using a deterministic augmentation framework, we generated 4,659 question-answer items that incorporated counterfactual variations in gender, race/ethnicity, and age, and validated them through automated integrity and balance checks. We evaluated the Mistral 7B-Instruct model under stochastic (temperature = 0.7) and deterministic (temperature = 0.0) inference rules via the Ollama local environment, applying Wilson's 95 % confidence intervals, χ²/z-tests, McNemar’s paired analysis, and Cohen’s h effect sizes to quantify fairness. Across all demographic variants, diagnostic accuracy remained consistent (Δ < 0.04; p > 0.05), and all performance gaps fell within Minimal or Low Bias thresholds. Confusion-matrix and prediction-balance analyses revealed no systematic over- or under-prediction patterns, while power analysis confirmed that observed fluctuations were below the minimum detectable effect (≈ 0.057). A stratified robustness analysis further confirms that these fairness patterns persist across question difficulty levels and are not an artefact of uniformly limited performance. These findings demonstrate that open-weight, instruction-tuned LLMs can maintain demographic stability in clinical reasoning when evaluated through reproducible, controlled pipelines. This framework provides a practical foundation for bias evaluation in open clinical LLMs, supporting their ethical integration into digital health tools and clinical decision-support systems.

Publication DOI:	https://doi.org/10.1109/ACCESS.2026.3656396
Divisions:	College of Engineering & Physical Sciences > Aston Digital Futures Institute College of Engineering & Physical Sciences Aston University (General)
Funding Information:	This work was supported in part by the Sir Peter Rigby Digital Futures Institute, Aston University, Funding Scheme
Additional Information:	Copyright © 2026 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Uncontrolled Keywords:	Large language models (LLMs); demographic bias; fairness auditing; medical question answering; MedQA benchmark; Mistral 7B-Instruct; open-weight models; Ollama; Wilson confidence interval; statistical bias evaluation; digital health; ethical AI.
Publication ISSN:	2169-3536
Last Modified:	06 Mar 2026 08:07
Date Deposited:	22 Jan 2026 12:01
Full Text Link:
Related URLs:	https://ieeexpl ... cument/11359144 (Publisher URL) http://www.scop ... tnerID=8YFLogxK (Scopus URL)
PURE Output Type:	Article
Published Date:	2026-01-26
Published Online Date:	2026-01-20
Accepted Date:	2026-01-15
Authors:	Sadka, Abdul ( 0000-0002-9825-5911) Ahmed, Hosameldin ( 0000-0002-8523-1099)

Download

Version: Accepted Version

Version: Published Version

License: Creative Commons Attribution

Export / Share Citation

Explore Further

Statistics

Additional statistics for this record

Record administration