A Filter-Based Feature Selection Framework to Detect Phishing URLs Using Stacking Ensemble Machine Learning

Bari, Nimra, Saleem, Tahir, Shah, Munam, Algarni, Abdulmohsen, Patel, Asma and Ullah, Insaf (2025). A Filter-Based Feature Selection Framework to Detect Phishing URLs Using Stacking Ensemble Machine Learning. Computer Modeling in Engineering and Sciences, 145 (1), pp. 1167-1187.

Abstract

Today, phishing is an online attack designed to obtain sensitive information such as credit card and bank account numbers, passwords, and usernames. We can find several anti-phishing solutions, such as heuristic detection, virtual similarity detection, black and white lists, and machine learning (ML). However, phishing attempts remain a problem, and establishing an effective anti-phishing strategy is a work in progress. Furthermore, while most anti-phishing solutions achieve the highest levels of accuracy on a given dataset, their methods suffer from an increased number of false positives. These methods are ineffective against zero-hour attacks. Phishing sites with a high False Positive Rate (FPR) are considered genuine because they can cause people to lose a lot of money by visiting them. Feature selection is critical when developing phishing detection strategies. Good feature selection helps improve accuracy; however, duplicate features can also increase noise in the dataset and reduce the accuracy of the algorithm. Therefore, a combination of filter-based feature selection methods is proposed to detect phishing attacks, including constant feature removal, duplicate feature removal, quasi-feature removal, correlated feature removal, mutual information extraction, and Analysis of Variance (ANOVA) testing. The technique has been tested with different Machine Learning classifiers: Random Forest, Artificial Neural Network (ANN), Ada-Boost, Extreme Gradient Boosting (XGBoost), Logistic Regression, Decision Trees, Gradient Boosting Classifiers, Support Vector Machine (SVM), and two types of ensemble models, stacking and majority voting to gain A low false positive rate is achieved. Stacked ensemble classifiers (gradient boosting, random forest, support vector machine) achieve 1.31% FPR and 98.17% accuracy on Dataset 1, 2.81% FPR and Dataset 3 shows 2.81% FPR and 97.61% accuracy, while Dataset 2 shows 3.47% FPR and 96.47% accuracy.

Publication DOI:	https://doi.org/10.32604/cmes.2025.070311
Divisions:	College of Business and Social Sciences College of Business and Social Sciences > Aston Business School > Cyber Security Innovation (CSI) Research Centre College of Business and Social Sciences > Aston Business School Aston University (General)
Funding Information:	This research was financially supported by the Deanship of Scientific Research and Graduate Studies at King Khalid University under research grant number (R.G.P.2/21/46) and in part by the Deanship of Scientific Research, Vice Presidency for Graduate Studies and Scientific Research, King Faisal University, Saudi Arabia, under Grant KFU253116.
Additional Information:	Copyright © 2025 The Author(s). Published by Tech Science Press. This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Publication ISSN:	1526-1506
Data Access Statement:	The datasets used in this study are publicly available from the following sources:<br/><br/>• Dataset 1: Sourced from Mohammad et al. (2012) and available at the UCI Machine Learning Repository, https://archive.ics.uci.edu/ml/datasets/Phishing+Websites (accessed on 21 August 2025).<br/><br/>• Dataset 2: Sourced from Buber (2019). The data was collected from PhishTank and Open Phish, which are publicly accessible platforms. https://www.phishtank.com/ (accessed on 21 August 2025) and https://openphish.com/ (accessed on 21 August 2025).<br/><br/>• Dataset 3: Sourced from Hannousse (2021). This dataset is a benchmark for machine learning-based phishing detection and is available for research purposes. 10.1016/j.engappai.2021.104347 (accessed on 21 August 2025).
Last Modified:	20 Mar 2026 08:08
Date Deposited:	10 Feb 2026 10:42
Full Text Link:
Related URLs:	https://www.tec ... ES/v145n1/64339 (Publisher URL)
PURE Output Type:	Article
Published Date:	2025-10-30
Accepted Date:	2025-08-22
Authors:	Bari, Nimra Saleem, Tahir Shah, Munam Algarni, Abdulmohsen Patel, Asma ( 0000-0003-1636-5955) Ullah, Insaf

Download

Version: Published Version

License: Creative Commons Attribution

Export / Share Citation

Explore Further

Statistics

Additional statistics for this record

Record administration