Auditing Demographic Bias in Mistral:An Open-Source LLM’s Diagnostic Performance on the MedQA Benchmark

Abstract

The application of large language models (LLMs) within clinical decision-support frameworks is receiving growing research attention, yet their fairness and demographic robustness remain insufficiently understood. This study introduces MedQA-Demog, a purpose-built, label-invariant extension of the MedQA-USMLE benchmark, designed to enable systematic auditing of demographic bias in medical reasoning models. Using a deterministic augmentation framework, we generated 4,659 question-answer items that incorporated counterfactual variations in gender, race/ethnicity, and age, and validated them through automated integrity and balance checks. We evaluated the Mistral 7B-Instruct model under stochastic (temperature = 0.7) and deterministic (temperature = 0.0) inference rules via the Ollama local environment, applying Wilson's 95 % confidence intervals, χ²/z-tests, McNemar’s paired analysis, and Cohen’s h effect sizes to quantify fairness. Across all demographic variants, diagnostic accuracy remained consistent (Δ < 0.04; p > 0.05), and all performance gaps fell within Minimal or Low Bias thresholds. Confusion-matrix and prediction-balance analyses revealed no systematic over- or under-prediction patterns, while power analysis confirmed that observed fluctuations were below the minimum detectable effect (≈ 0.057). A stratified robustness analysis further confirms that these fairness patterns persist across question difficulty levels and are not an artefact of uniformly limited performance. These findings demonstrate that open-weight, instruction-tuned LLMs can maintain demographic stability in clinical reasoning when evaluated through reproducible, controlled pipelines. This framework provides a practical foundation for bias evaluation in open clinical LLMs, supporting their ethical integration into digital health tools and clinical decision-support systems.

Publication DOI: https://doi.org/10.1109/ACCESS.2026.3656396
Divisions: College of Engineering & Physical Sciences > Aston Digital Futures Institute
College of Engineering & Physical Sciences
Aston University (General)
Additional Information: (c) 2026 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Uncontrolled Keywords: Large language models (LLMs); demographic bias; fairness auditing; medical question answering; MedQA benchmark; Mistral 7B-Instruct; open-weight models; Ollama; Wilson confidence interval; statistical bias evaluation; digital health; ethical AI.
Publication ISSN: 2169-3536
Last Modified: 28 Jan 2026 08:53
Date Deposited: 22 Jan 2026 12:01
Full Text Link:
Related URLs: https://ieeexpl ... cument/11359144 (Publisher URL)
PURE Output Type: Article
Published Date: 2026-01-20
Published Online Date: 2026-01-20
Accepted Date: 2026-01-15
Authors: Sadka, Abdul (ORCID Profile 0000-0002-9825-5911)
Ahmed, Hosameldin (ORCID Profile 0000-0002-8523-1099)

Download

[img]

Version: Accepted Version


[img]

Version: Published Version

License: Creative Commons Attribution


Export / Share Citation


Statistics

Additional statistics for this record