Sadka, Abdul and Ahmed, Hosameldin (2026). Auditing Demographic Bias in Mistral:An Open-Source LLM’s Diagnostic Performance on the MedQA Benchmark. IEEE Access, 14 ,
Abstract
The application of large language models (LLMs) within clinical decision-support frameworks is receiving growing research attention, yet their fairness and demographic robustness remain insufficiently understood. This study introduces MedQA-Demog, a purpose-built, label-invariant extension of the MedQA-USMLE benchmark, designed to enable systematic auditing of demographic bias in medical reasoning models. Using a deterministic augmentation framework, we generated 4,659 question-answer items that incorporated counterfactual variations in gender, race/ethnicity, and age, and validated them through automated integrity and balance checks. We evaluated the Mistral 7B-Instruct model under stochastic (temperature = 0.7) and deterministic (temperature = 0.0) inference rules via the Ollama local environment, applying Wilson's 95 % confidence intervals, χ²/z-tests, McNemar’s paired analysis, and Cohen’s h effect sizes to quantify fairness. Across all demographic variants, diagnostic accuracy remained consistent (Δ < 0.04; p > 0.05), and all performance gaps fell within Minimal or Low Bias thresholds. Confusion-matrix and prediction-balance analyses revealed no systematic over- or under-prediction patterns, while power analysis confirmed that observed fluctuations were below the minimum detectable effect (≈ 0.057). A stratified robustness analysis further confirms that these fairness patterns persist across question difficulty levels and are not an artefact of uniformly limited performance. These findings demonstrate that open-weight, instruction-tuned LLMs can maintain demographic stability in clinical reasoning when evaluated through reproducible, controlled pipelines. This framework provides a practical foundation for bias evaluation in open clinical LLMs, supporting their ethical integration into digital health tools and clinical decision-support systems.
| Publication DOI: | https://doi.org/10.1109/ACCESS.2026.3656396 |
|---|---|
| Divisions: | College of Engineering & Physical Sciences > Aston Digital Futures Institute College of Engineering & Physical Sciences Aston University (General) |
| Additional Information: | (c) 2026 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ |
| Uncontrolled Keywords: | Large language models (LLMs); demographic bias; fairness auditing; medical question answering; MedQA benchmark; Mistral 7B-Instruct; open-weight models; Ollama; Wilson confidence interval; statistical bias evaluation; digital health; ethical AI. |
| Publication ISSN: | 2169-3536 |
| Last Modified: | 28 Jan 2026 08:53 |
| Date Deposited: | 22 Jan 2026 12:01 |
| Full Text Link: | |
| Related URLs: |
https://ieeexpl ... cument/11359144
(Publisher URL) |
PURE Output Type: | Article |
| Published Date: | 2026-01-20 |
| Published Online Date: | 2026-01-20 |
| Accepted Date: | 2026-01-15 |
| Authors: |
Sadka, Abdul
(
0000-0002-9825-5911)
Ahmed, Hosameldin (
0000-0002-8523-1099)
|
0000-0002-9825-5911