Introduction

Despite notable advancements in medical technology and improved prognostic outcomes, cardiovascular diseases (CVDs) remain among the leading causes of mortality worldwide. Cardiac disease refers to a broad spectrum of conditions that affect the structure and function of the heart [1]. It remains a major global health concern with substantial clinical and economic implications. Early diagnosis and prompt medical intervention play a crucial role in decreasing mortality associated with conditions such as myocardial infarction and sudden cardiac death [2].

The diverse range of modifiable and non-modifiable risk factors linked to cardiac disease is primarily grounded in empirical clinical evidence. These factors are generally categorized into social determinants such as educational attainment, health literacy, and access to healthcare, as well as biological and behavioral determinants including age, lifestyle habits, comorbidities, and individual behaviors [3-5].

CVD remains the foremost cause of mortality globally. Its prevalence and associated death rates have risen significantly over recent decades, from 271 million cases and 12.1 million deaths in 1990 to 523 million cases and 18.1 million deaths by 2019. Projections indicate that CVD-related deaths may reach 23.6 million by 2030 [6, 7].

Furthermore, projections from 2010 estimated that the global cost of treating CVDS would rise from 863 billion USD to approximately 1.044 trillion USD by 2030. In response, the application of Artificial Intelligence (AI) in screening, diagnosis, and treatment has been increasingly recognized as a promising strategy to reduce both the prevalence and economic burden of CVDS [8].

AI is a rapidly evolving field that has become embedded in many aspects of modern life. The rapid advancement of computational technologies and the growing availability of digital data have significantly contributed to the progress of AI. Over the past decade, machine learning techniques have seen widespread adoption across numerous medical disciplines, including cardiology [9, 10]. Cardiology is a key area for AI application, as treatment decisions rely heavily on digital patient data and various diagnostic tests. AI algorithms excel at processing and analyzing large volumes of clinical information with a speed and accuracy that surpass human capabilities [11]. Machine Learning (ML) presents promising opportunities to enhance the screening and prevention of CVDS. As AI continues to integrate into medicine, it is expected to significantly augment doctors' capabilities, improving efficiency and productivity. Among cardiac conditions, arrhythmias involve disturbances in cardiac rhythm that manifest in diverse clinical forms. Advances in diagnostic technologies have considerably improved the detection of arrhythmias [12, 13].

The Electrocardiogram (ECG) is widely regarded as the most straightforward and dependable method for diagnosing these conditions [14]. Clinicians can extract key parameters, including the P wave, QRS complex, ST segment, T wave, and U wave, from standard 12-lead ECG recordings to identify irregularities. Yet, manually reviewing long-duration ECG recordings for subtle abnormalities remains time-consuming and prone to observer variability [15]. Automatic detection of arrhythmias from ECG recordings is essential for accurate clinical diagnosis and treatment. ML techniques have proven to be valuable in this area, facilitating the automated and precise identification of cardiac arrhythmias [9].

Unlike prior studies that typically apply feature selection in a single phase or evaluate only a limited number of models, this research proposes a systematic multi-stage CFS-based feature selection framework with fold-based analysis to capture the most consistent and informative ECG parameters. The study also offers a comprehensive comparison of classifier performance across three progressive feature selection scenarios (including no selection, best 50 features, and elite 27 features) thereby establishing a reproducible and fine-tuned pipeline for arrhythmias classification that is both interpretable and performance-driven.

Although the global burden of CVDS continues to rise, arrhythmias constitute a clinically important subgroup due to their diagnostic complexity and potential for severe outcomes. The complexity of arrhythmia, influenced by numerous factors, creates challenges in analyzing extensive data and can lead to confusion in diagnosis and treatment. In contrast, the integration of AI and ML across various clinical applications in cardiology, including screening, diagnosis, and treatment, has garnered considerable attention. Since most cardiac arrhythmias data is electronically recorded and influenced by multiple factors, this study employed the CFS method to identify the most relevant features for diagnosing arrhythmia. Across three feature selection scenarios, multiple ML models were developed, analyzed, and rigorously evaluated.

Material and Methods

The data set

This computational, retrospective, machine learning–based analytical study utilized the publicly available Cardiac Arrhythmia dataset from the UCI Machine Learning Repository [16]. The dataset contains ECG recordings from 452 patients, each represented by 279 attributes, of which 206 are continuous and 73 are categorical or binary. These features represent cardiac electrical activity across 12 standard ECG leads, including time intervals (QRS duration, PR interval), amplitude values of P, Q, R, S, and T waves, and derived attributes such as QRSTA and RR interval variability.

The dataset includes 16 classification categories, with Class 01 representing normal ECG findings and Classes 02 to 15 corresponding to specific arrhythmia types, such as atrial flutter, ventricular tachycardia, left bundle branch block, and supraventricular arrhythmias. Class 16 includes patients whose ECG recordings could not be reliably assigned to a specific category. Each class label is determined based on the interpretation of ECG signals by expert cardiologists, following clinical diagnostic standards. However, due to class imbalance and the diagnostic complexity of arrhythmia types, many studies (including ours) consolidate the multiclass output into a binary classification framework, distinguishing normal (Class 01) from abnormal/arrhythmic (Classes 02–16) ECGs. This binary approach facilitates early screening and simplifies performance evaluation while maintaining clinical relevance. Table 1 demonstrates the description of types of arrhythmias.

Feature selection method

Feature selection, also known as variable or attribute selection, differs from dimensionality reduction, which reduces dimensionality by generating new feature combinations [17]. The main goal of feature selection is to identify and eliminate unnecessary, irrelevant, and redundant features in order to enhance classification accuracy.

Table 1: Description of types of arrhythmias and class labels

Number of instances	Class Name	Class Label
245	Normal ECG	01
44	Ischemic changes	02
15	Old anterior myocardial infarction	03
15	Old inferior myocardial infarction	04
13	Sinus tachycardia	05
25	Sinus bradycardia	06
3	Ventricular Premature Contraction	07
2	Supraventricular Premature Contraction	08
9	Left bundle branch block	09
50	Right bundle branch block	10
0	1-degree Atrioventricular block	11
0	2-degree AV block	12
0	3-degree AV block	13
4	Left ventricular hypertrophy	14
5	Atrial Fibrillation or Flutter	15
22	Others	16

Abbreviations: ECG, Electrocardiogram; MI, Myocardial Infarction; VPC, Ventricular Premature Contraction; SVC, Supraventricular Premature Contraction; LBBB, Left Bundle Branch Block; RBBB, Right Bundle Branch Block; AV, Atrioventricular; LVH, Left Ventricular Hypertrophy.

Feature selection techniques are generally classified into three main categories based on their approach and how they interact with model development: Filter Methods, Wrapper Methods, and Embedded Methods [18]. Each of these methods presents distinct benefits and is applicable to different situations.

In this study, we selected the CFS method with Best-First Search (BFS) due to its proven efficiency in reducing feature redundancy while preserving high predictive value. CFS evaluates feature subsets based on both relevance to the target class and inter-feature correlations, making it well-suited for biomedical datasets where many attributes may be collinear or weakly informative. Moreover, BFS offers a practical heuristic for exploring the search space without exhaustive computation, enabling scalability for high-dimensional datasets.

Rationale for the multi-phase CFS procedure

Our three-phase CFS approach was designed to ensure comprehensive feature exploration by iteratively removing highly correlated features and re-evaluating the remaining ones. This method allows for the identification of both globally dominant and locally nuanced ECG characteristics. By removing the most influential features from the first and second steps, we force the algorithm to identify additional meaningful features that might have been initially overshadowed by highly correlated or dominant attributes. Moreover, each phase provides an opportunity to observe feature selection stability across folds, ensuring robust feature selection.

The CFS algorithm evaluates feature subsets based on Pearson correlation between features and the class, with a merit threshold set to accept subsets that improve predictive correlation while reducing inter-feature redundancy. The BFS was configured with a forward search direction, allowing up to 5 consecutive non-improving nodes before termination, and a cache size of 5 to balance exploration and computation time.

Selection criteria for elite features

The elite features were selected from the initial 50 best features using two-stage criteria of Fold Consistency and CFS Subset Evaluation. We ranked the 50 features by their selection frequency across 10-fold cross-validatio. Only features selected in at least 3 folds were retained. We re-applied CFS to this filtered set, which yielded 27 features that maximized the subset merit score (correlation with class minus inter-feature correlation). This dual-criteria approach ensured both stability (fold consistency) and optimal subset relevance (CFS merit).

Learning algorithms and approach assumption

Upon consolidation into a binary framework, the dataset comprised 245 normal ECGs (Class 01) and 207 abnormal ECGs (Classes 02–16), representing a moderate class distribution (54.2% normal vs. 45.8% abnormal). Model performance was evaluated using stratified 10-fold cross-validation to ensure robustness and fair comparison across classifiers. Given the moderate class imbalance in the dataset, stratified cross-validation was employed to preserve the original class distribution across training and test folds. No resampling was performed on test data. Performance metrics were averaged across all five independent splits to ensure stability and reliability.

The rationale for selecting a diverse set of classifiers lies in their complementary strengths. Random forest was chosen for its resilience to overfitting and strong performance on tabular biomedical data. LogitBoost was selected for its boosting mechanism that can handle complex decision boundaries. Bayes Net was included for its probabilistic interpretability, which is particularly useful in medical settings. Additional classifiers such as Support Vector Machine (SVM), decision tree, K-nearest neighbors (KNN), and ensemble methods (voting, AdaBoost, bagging, and stacking) were evaluated to ensure a comprehensive comparison across different algorithm families, probabilistic, tree-based, kernel-based, and ensemble techniques.

We compared the performance of several classifiers, including Bayes net, SVM, random forest (with 10 and 40 trees), LogitBoost (with decision stump), voting, AdaBoost (with random forest), and bagging (with random forest). Additionally, two versions of the voting classifier were developed: Vote-1, which included Bayes net, SVM, random forest (40 trees), and LogitBoost (decision stump), and Vote-2, which included Bayes net, random forest (40 trees), and LogitBoost (decision stump).

Brief and comparative overview of ML algorithms

In this study, a wide range of ML classifiers was used to evaluate the impact of feature selection on the classification of cardiac arrhythmias. These algorithms were selected based on their popularity in biomedical domains and their diverse methodological foundations.

Tree-based classifiers such as decision tree, J48 tree, random tree, and random forest are well-known for their interpretability and ability to model non-linear relationships. Random forest offers strong generalization performance through ensemble averaging but may be less transparent than single-tree models like J48.

Boosting methods (AdaBoost, LogitBoost) aim to improve classification accuracy by iteratively correcting the errors of weak learners such as decision stumps or random trees. While boosting can significantly improve performance on imbalanced or noisy datasets (as is common in ECG data), it may risk overfitting if not properly tuned.

Bagging-based methods (Bagging with random forest, Naïve Bayes, reduced error pruning (REP) tree, and SVM) enhance stability and reduce variance by training multiple models on different subsets of the data. These are particularly beneficial when base learners are prone to overfitting.

Bayesian classifiers, including Naïve Bayes, kernel Naïve Bayes, and Bayes net, provide probabilistic outputs and are computationally efficient. Naïve Bayes assumes feature independence, which may not be held in ECG data, whereas Bayes Net captures dependencies between variables but requires more complex structure learning.

Instance-based algorithms like KNN and K* rely on distance metrics to classify new observations. These can capture complex decision boundaries but are sensitive to irrelevant features and data scaling, hence their performance typically improves after feature selection.

SVM is effective in high-dimensional spaces and is robust to overfitting, especially with proper kernel selection. However, it can be computationally intensive and sensitive to parameter settings, particularly in multiclass scenarios like arrhythmia classification. Rule-based classifiers such as Rule Induction and decision table generate human-readable rules, offering interpretability that is valuable in clinical decision-making. Hoeffding tree, a stream-based decision tree model, and Stacking ensembles (with different base learners including SVM, random forest, and ZeroR) were also employed to explore performance diversity. While stacking has the potential to leverage complementary strengths of base models, its success is highly dependent on the choice and interaction of these components. Poor performance likely reflects limited sample size and overfitting in meta-learners.

Performance evaluation

Feature selection protocol

Our multi-phase feature selection process involved two main stages:

Stage 1: Identification of 50 best features (Three-phase CFS procedure)

We implemented a three-phase CFS protocol to identify the 50 most relevant features:

1. Phase 1: Apply CFS with BFS to the full dataset; retain top 20 features (selected in ≥8 folds).

2. Phase 2: Remove Phase 1 features; re-run CFS on remaining 259 features; retain next 20 features.

3. Phase 3: Remove all 40 previously selected features; run CFS on remaining 239 features; retain top 10 features.

Stage 2: Refinement to 27 elite features

From the 50 best features, we applied a two-stage refinement:

1. Fold consistency filtering: Retain only features selected in ≥3 folds during 10-fold cross-validation.

2. CFS subset optimization: Re-apply CFS to this filtered set, yielding 27 elite features that maximized the subset merit score.

Classification Scenarios

We evaluated classifier performance across three progressively refined feature sets:

Scenario 1: No feature selection (Baseline)

Classification was performed using all 279 original features, serving as the baseline performance.

Scenario 2: 50 Best features

Classification was performed using the 50 features identified through the three-phase CFS procedure.

Scenario 3: 27 Elite features

Classification was performed using the refined set of 27 elite features.

Following feature selection, several machine learning algorithms, including decision tree, random forest, KNN, SVM, random tree, AdaBoost, bagging, and stacking, were applied to compare and evaluate the performance of each scenario.

Results

Our primary objective was to identify the attribute evaluator that yielded the highest classification accuracy across the tested models. The outcomes of the experiment are summarized in Table 2, which provides a comparative assessment of the performance of the evaluated methods.

Table 3: Top features selected through CFS methodology

Selection Step	Number of Folds (%)	Attribute	Attribute Number
Top 20 features selected in the first step	10 (100)	QRS duration	5
	10 (100)	Tinterval	8
	10 (100)	T	11
	10 (100)	Heartrate	15
	10 (100)	chDIII_Qwave	40
	10 (100)	chAVF_Qwave	76
	10 (100)	chV1_intrinsicReflections	93
	10 (100)	chV2_RPwave	103
	10 (100)	chV3_Qwave	112
	10 (100)	chV3_Swave	114
	10 (100)	chAVR_JJwaveAmp	190
	10 (100)	chV1_RPwaveAmp	224
	10 (100)	chV1_QRSA	228
	10 (100)	chV3_TwaveAmp	247
	9 (90)	Q-Tinterval	7
	9 (90)	chV6_QRSTA	279
	9 (90)	chV6_TwaveAmp	277
	8 (80)	chV2_Qwave	100
	8 (80)	chV3_DD_RPwaveExists	121
	8 (80)	chV5_TwaveAmp	267
Top 20 features selected in the second step	10 (100)	chV1_RPwave	91
	10 (100)	chV2_Swave	102
	10 (100)	chDI_TwaveAmp	167
	10 (100)	chDII_TwaveAmp	177
	10 (100)	chV6_JJwaveAmp	270
	10 (100)	chV4_TwaveAmp	257
	10 (100)	chAVR_TwaveAmp	197
	10 (100)	chAVR_QRSTA	199
	9(90)	Sex	2
	9 (90)	chV4_Swave	126
	9 (90)	chV5_QRSTA	269
	9 (90)	chAVF_QwaveAmp	211
	9 (90)	chV2_RPwaveAmp	234
	9 (90)	chV3_JJwaveAmp	240
	9 (90)	chV3_QwaveAmp	241
	8 (80)	chV3_SwaveAmp	243
	8 (80)	chV3_QRSA	248
	8 (80)	chV3_QRSTA	249
	7 (70)	chDII_QwaveAmp	171
	6 (60)	chDIII_QwaveAmp	181
Top 10 features selected in the third step	10 (100)	chV3_Rwave	113
	9 (90)	chDII_intrinsicReflections	33
	9 (90)	chAVR_DD_RPwaveExists	61
	9 (90)	chAVL_TwaveAmp	207
	9 (90)	chV2_QRSA	238
	9 (90)	chV4_QRSTA	259
	8 (80)	chDII_Qwave	28
	8 (80)	chV5_JJwaveAmp	260
	7 (70)	chV1_DD_RRwaveExists	95
	7 (70)	chV2_QwaveAmp	231

Table 4: Top 27 elite features

Number of folds (%)	Attribute	Attribute Number
10 (100)	QRSduration	5
10 (100)	Tinterval	8
10 (100)	T	11
10 (100)	Heartrate	15
10 (100)	chDIII_Qwave	40
10 (100)	chAVF_Qwave	76
10 (100)	chV1_intrinsicReflections	93
10 (100)	chV2_RPwave	103
10 (100)	chV3_Swave	114
10 (100)	chV1_QRSA	228
9 (90)	chAVR_JJwaveAmp	190
9 (90)	chV1_RPwaveAmp	224
9 (90)	Q-Tinterval	7
9 (90)	chV6_QRSTA	279
9 (90)	chV4_TwaveAmp	257
9 (90)	chV3_Qwave	112
8 (80)	chV3_TwaveAmp	247
8 (80)	chV6_TwaveAmp	277
7 (70)	chV2_Swave	102
6 (60)	chV2_Qwave	100
6 (60)	chV3_QRSA	248
5 (50)	chV3_DD_RPwaveExists	121
5 (50)	chV5_TwaveAmp	267
4 (40)	chDII_QwaveAmp	171
3 (30)	chAVR_QRSTA	199
3 (30)	chAVF_QwaveAmp	211
3 (30)	chV6_JJwaveAmp	270

Subsequently, several ML algorithms were applied to each scenario, and their accuracy results were calculated (Table 5).

Table 5: Classification accuracy of various algorithms for 3 scenarios

Classifier	Without feature selection	Best features	Elite features
Decision tree	69.49	69.73	71.93
Random forest	69.46	73.67	75.66
KNN	60.41	64.61	65.51
Naïve Bayes	62.38	69.91	70.13
Bayes net	69.91	72.78	76.47
Decision stump	59.74	59.74	59.74
Rule induction	59.51	62.60	62.42
Naïve Bayes (kernel)	75.71	75.66	74.67
SVM (Lib)	71.24	70.35	66.17
Hoeffding tree	54.20	54.20	56.78
J48 tree	64.38	67.69	70.58
Random tree	51.14	59.07	66.91
K*	55.97	61.28	57.78
Decision table	65.70	66.81	66.17
AdaBoost (decision stump)	55.53	55.53	55.53
AdaBoost (random forest)	67.92	74.11	77.21
AdaBoost (random tree)	54.42	64.15	59.95
AdaBoost (SVM)	67.37	69.91	69.24
Bagging (REP tree)	72.12	73.00	74.33
Bagging (random tree)	64.60	72.34	73.89
Bagging (random forest)	64.60	72.34	75.88
Bagging (Naïve Bayes)	64.60	69.69	71.01
Bagging (SVM)	71.90	70.57	68.36
Logit Boost (decision stump)	74.55	75.66	80.97
Logit Boost (random tree)	68.80	73.00	73.37
Stacking (ZeroR)	54.20	54.20	54.20
Stacking (random forest)	54.20	54.20	54.20
Stacking (random tree)	54.20	54.20	54.20
Stacking (decision stump)	54.20	54.20	54.20
Stacking (SVM)	54.20	54.20	54.20

Abbreviations: KNN, k-Nearest Neighbors; SVM, Support Vector Machine; Lib, Library; REP, Reduced Error Pruning; ZeroR, Zero Rule.

Across the five classifiers and three feature-selection scenarios, consistent performance gains were observed as the feature set was reduced from the full 279 variables to the Elite 27 features. Feature selection substantially improved classifier performance, as reflected in the comparative results across the three stages. Models trained without feature selection showed only moderate accuracy due to noise and redundancy in the full dataset, while restricting training to the top 50 features yielded noticeable improvements by removing irrelevant attributes. The refined set of 27 elite features further enhanced accuracy, particularly for random forest, Bayes net, and LogitBoost.

As shown in Fig 1, the highest accuracy in the without feature selection scenario was achieved by Bayes net at 69.91 percent, whereas in the best features scenario the strongest performance was observed for both Bayes net and LogitBoost at 75.66%. In the elite features scenario, LogitBoost with decision stump achieved the highest overall performance at 80.97 percent, alongside a precision of 88.52 percent, recall of 81.29 percent, and an AUC of 0.8084. As shown in Table 6, All models showed parallel improvements in sensitivity and specificity under the elite 27 configuration, and ensemble methods such as AdaBoost and bagging also benefited substantially from feature reduction. These findings demonstrate that the proposed multi-stage CFS framework yields a compact and discriminative subset of ECG attributes that consistently strengthens classification performance across modeling approaches.

Fig 1: The confusion matrices of the best classifiers

Table 6: Performance of five best models across three feature-selection scenarios

Model	Feature Set	Accuracy (%)	Precision (%)	Recall Sensitivity (%)	Specificity (%)	F1-Score (%)	ROC-AUC	TN	FP	FN	TP
Random forest	Without FS	69.47	80.71	69.73	68.99	74.82	0.6936	109	49	89	205
Random forest	Best Features	73.67	83.78	73.81	73.42	78.48	0.7361	116	42	77	217
Random forest	Elite Features	75.66	85.11	75.85	75.32	80.22	0.7558	119	39	71	223
Bayes net	Without FS	69.91	81.10	70.07	69.62	75.18	0.6984	110	48	88	206
Bayes net	Best Features	72.79	83.27	72.79	72.78	77.68	0.7279	115	43	80	214
Bayes net	Elite Features	76.55	85.61	76.87	75.95	81.00	0.7641	120	38	68	226
AdaBoost (random forest)	Without FS	67.92	79.68	68.03	67.72	73.39	0.6787	107	51	94	200
AdaBoost (random forest)	Best Features	74.12	84.17	74.15	74.05	78.84	0.7410	117	41	76	218
AdaBoost (random forest)	Elite Features	77.21	86.04	77.55	76.58	81.57	0.7707	121	37	66	228
Bagging (random forest)	Without FS	64.60	77.24	64.63	64.56	70.37	0.6459	102	56	104	190
Bagging (random forest)	Best Features	72.35	82.88	72.45	72.15	77.31	0.7230	114	44	81	213
Bagging (random forest)	Elite Features	75.88	85.17	76.19	75.32	80.43	0.7575	119	39	70	224
LogitBoost (decision stump)	Without FS	74.56	84.29	74.83	74.05	79.28	0.7444	117	41	74	220
LogitBoost (decision stump)	Best Features	75.66	85.11	75.85	75.32	80.22	0.7558	119	39	71	223
LogitBoost (decision stump)	Elite Features	80.97	88.52	81.29	80.38	84.75	0.8084	127	31	55	239

Given the class imbalance in our dataset, accuracy alone may be misleading. Therefore, we emphasize complementary metrics such as F1-score, ROC-AUC, and PR-AUC, which provide a more robust assessment of model performance in imbalanced classification scenarios.

Fig 2 presents the distribution of the most influential feature categories for the LogitBoost (decision stump) classifier. The visualization identifies global cardiac parameters and V-lead complex measures as the predominant categories, highlighting which ECG-derived characteristics were most critical to the model's decision-making process.

Fig 3 presents a comparative heatmap of feature importance scores across multiple machine learning models. It allows for a direct visual assessment of which features are consistently ranked as important by random forest, Bayes net, AdaBoost (random forest), bagging (random forest), and LogitBoost (decision stump) algorithms, and which are unique to specific model architectures.

A colorful chart with a pie chart

Fig 2: The LogitBoost model (decision stump) feature category distribution

Fig 3: The feature importance heatmap across models

Fig 4 details the ranked importance of elite features across the four top-performing classifiers. The analysis reveals that core cardiac interval measures, heart rate, QRS duration, T interval, and Q-T interval, emerged as the most influential and consistent predictors across all algorithms. This cross-model consensus highlights these features as particularly robust, stable drivers of high predictive accuracy irrespective of the underlying model architecture.

Fig 4: The feature importance analysis for elite features across four models

Discussion

Feature selection is a critical step in machine learning for reducing data dimensionality while retaining the most informative attributes. In this study, a multi-step feature selection approach was employed to progressively refine the dataset, ultimately identifying a compact set of elite features that capture both global and localized cardiac phenomena. These features provide a comprehensive representation of the cardiac electrical activity, ensuring robust arrhythmia classification.

In the “Best Mode,” the feature selection process identified the most relevant attributes, including QRS duration, T interval, heart rate, and specific wave amplitudes such as chV3_TwaveAmp. Many of these features were derived from multiple ECG leads (e.g., chDIII_Qwave, chAVF_Qwave, chV3_Swave), enabling the model to capture diverse aspects of cardiac activity. The frequency of feature selection across folds highlighted their stability and predictive relevance. In the “Elite Mode,” the dataset was further refined to 27 critical features, emphasizing their consistent contribution to accurate arrhythmia classification [15, 19].

Doran et al. explored the risk factors for CVD in their research. Key factors identified in their study included elevated blood sugar levels, high cholesterol, a BMI of 25 or higher, hypertension, smoking or exposure to secondhand smoke, insufficient physical activity (PA), excessive use of medications and alcohol, and poor adherence to healthy dietary guidelines [20].

Previous studies have demonstrated the effectiveness of machine learning in predicting cardiovascular outcomes across diverse populations and datasets. Motwani et al. used LogitBoost on approximately 10,000 coronary CT angiography cases, achieving an AUC of 0.79 and outperforming the Framingham risk score [21]. You et al. analyzed a prospective cohort of over 10,000 participants using demographic, lifestyle, and medication data, achieving 87% accuracy with an ensemble ML approach for 10-year risk prediction [22]. Kang et al. applied LASSO regression on 280 myocardial infarction patients, selecting 5 key variables from 46 clinical predictors, achieving a c-statistic of 0.809 [23]. Ma et al. predicted coronary CVDS risk in 1,500 diabetic patients using XGBoost, random forest, and logistic regression, with an AUC of 0.701 [24]. To explore patterns associated with the type and prognosis of amyloidosis, Allegra et al. categorized data from 1,394 patients into four groups: AL, ATTRv, ATTRwt, and cases without a definitive amyloidosis diagnosis. These classifications were carried out 24 times across various categories, including demographic factors (such as gender, race, and age), biometric data (like body mass index), cardiovascular risk factors, and more. Subsequently, the data were grouped into seven clusters using unsupervised clustering techniques. The study then compared patient prognosis, survival rates, and other outcomes across these clusters, particularly for those with confirmed amyloid subtypes [25].

Although these studies differ substantially in population size, data modality, and prediction task, they collectively highlight the importance of robust feature selection and ensemble learning in cardiovascular risk modeling.

Compared to these studies, the present work analyzed 452 patients from the UCI Arrhythmia dataset with 279 ECG-based features across 12 leads. Over 30 machine learning models, including boosting and bagging algorithms, were evaluated. By implementing a three-phase feature selection pipeline and focusing on elite feature consistency, this study achieved an accuracy of 80.97%, demonstrating the importance of rigorous feature selection in improving arrhythmia classification. Selected features such as QRS duration, T interval, heart rate, and lead-specific wave amplitudes consistently contributed to model performance, highlighting both physiological relevance and predictive power.

QRS duration and T interval are well-known ECG parameters critical for arrhythmia diagnosis and monitoring, while features such as chAVR_JJwaveAmp, Q-T interval, and chV6_QRSTA showed strong, though slightly less universal, relevance. Other attributes, including chAVF_QwaveAmp, chV4_TwaveAmp, and chV6_TwaveAmp, emphasize the importance of amplitude and waveform characteristics in capturing subtle arrhythmogenic patterns. Overall, the selected features reflect both global and local cardiac phenomena, ensuring comprehensive modeling of the heart’s electrical activity.

These findings highlight that careful feature selection not only improves model accuracy but also enhances interpretability by focusing on physiologically meaningful ECG attributes. The study underscores the potential of multi-stage feature selection frameworks to produce compact, discriminative feature sets that consistently improve machine learning performance across diverse modeling approaches.

Limitations and future directions

One of the primary limitations of this study is the relatively small sample size (452 patients) and the limited number of data points (279 features), which may reduce the generalizability of the findings to the broader population of individuals with cardiac arrhythmias. To enhance future research, it is recommended to utilize larger and more diverse datasets to increase the applicability and robustness of the models. Our binary classification approach, while sacrificing arrhythmia-specific differentiation, offers practical clinical utility for initial screening. In resource-constrained environments or for preliminary automated analysis, identifying any abnormality with high sensitivity (81.29% with elite features) can prioritize cases for expert review. The selected features are broadly indicative of electrical disturbances across multiple arrhythmia types, making them suitable for this first-line detection task. Future work should extend our feature selection framework to hierarchical or multi-class settings to address specific arrhythmia identification while building upon the robust feature subset identified here

Another limitation lies in the narrow scope of features, with a primary focus on ECG data. This constraint may affect the model’s predictive accuracy. Future studies should expand the feature space by incorporating broader clinical parameters such as patient history, medication use, and comorbidities, as well as contextual factors like social determinants of health, environmental exposures, and lifestyle habits (diet, exercise, smoking). These elements can have a meaningful influence on cardiovascular health and may contribute to more nuanced and accurate predictive models.

Although multiple metrics were reported, model selection was primarily guided by accuracy. Future research should explore a variety of algorithms, evaluation techniques, and performance metrics to obtain a more nuanced understanding of model effectiveness. Moreover, for better validation and generalizability, it is advisable to test the models using independent datasets from external data centers that were not involved in the initial model development. This approach can lead to the creation of more reliable and widely applicable predictive models.

Conclusion

This study underscores the vital role of feature selection in enhancing ML-based classification of arrhythmias. This research demonstrates a novel stepwise feature selection approach that improves classifier performance while enhancing feature interpretability. The use of cross-fold feature frequency and the clear delineation of "best" and "elite" feature sets provide clinicians and data scientists with deeper insight into the most consistent and diagnostically valuable ECG parameters. Our findings suggest that such a refined and reproducible process can lead to more robust clinical decision support tools, setting a methodological precedent for future work in cardiac arrhythmia classification. By carefully refining the feature set, the study achieved notable improvements in accuracy, computational efficiency, and interpretability. Future efforts should aim to validate these methods using larger and more heterogeneous datasets, incorporate domain-specific knowledge to improve generalizability, and develop practical, clinician-friendly diagnostic tools. Effective feature selection not only boosts the technical performance of ML models but also strengthens the connection between data-driven approaches and real-world medical practice.

Acknowledgement

The authors gratefully acknowledge the valuable support and collaboration of the experts who contributed to this research.

Author’s contribution

SAFA: Conceptualization, methodology, data analysis, supervision, writing original draft, review & editing; AA: Conceptualization, methodology, project administration, writing original draft, review & editing.

All authors contributed to the literature review, design, data collection, drafting the manuscript, read and approved the final manuscript.

Conflicts of interest

The authors declare no conflicts of interest regarding the publication of this study.

Ethical Approval

This study utilized de-identified, publicly available data from the UCI Machine Learning Repository. All datasets within this repository have been anonymized to protect participant privacy by removing any personal identifiers prior to public release. As the research involved the analysis of this pre-existing, anonymized data, it was exempt from requiring separate ethical approval from an institutional review board. The study was conducted in accordance with recognized standards for the secondary analysis of public data.

Financial disclosure

No financial interests related to the material of this manuscript have been declared.

References

1. Ha ACT, Doumouras BS, Wang CN, Tranmer J, Lee DS. Prediction of sudden cardiac arrest in the general population: Review of traditional and emerging risk factors. Can J Cardiol. 2022; 38(4): 465-78. PMID: 35041932 DOI: 10.1016/j.cjca.2022.01.007 [PubMed]

2. Ozcan M, Peker S. A classification and regression tree algorithm for heart disease modeling and prediction. Healthcare Analytics. 2023; 3: 100130.

3. Zhong L, Xie B, Wang H-L, Ji X-W. Causal association between remnant cholesterol level and risk of cardiovascular diseases: A bidirectional two sample mendelian randomization study. Sci Rep. 2024; 14(1): 27038. PMID: 39511362 DOI: 10.1038/s41598-024-78610-0 [PubMed]

4. Lallah PN, Laite C, Bangash AB, Chooah O, Jiang C. The use of artificial intelligence for detecting and predicting atrial arrhythmias post catheter ablation. Rev Cardiovasc Med. 2023; 24(8): 215. PMID: 39076714 DOI: 10.31083/j.rcm2408215 [PubMed]

5. Truong ET, Lyu Y, Ihdayhid AR, Lan NS, Dwivedi G. Beyond clinical factors: Harnessing artificial intelligence and multimodal cardiac imaging to predict atrial fibrillation recurrence post-catheter ablation. J Cardiovasc Dev Dis. 2024; 11(9): 291. PMID: 39330349 DOI: 10.3390/jcdd11090291 [PubMed]

6. Roth GA, Mensah GA, Johnson CO, Addolorato G, Ammirati E, Baddour LM, et al. Global burden of cardiovascular diseases and risk factors, 1990–2019: Update from the GBD 2019 study. J Am Coll Cardiol. 2020; 76(25): 2982-3021. PMID: 33309175 DOI: 10.1016/j.jacc.2020.11.010 [PubMed]

7. Mehta LS, Warnes CA, Bradley E, Burton T, Economy K, Mehran R, et al. Cardiovascular considerations in caring for pregnant patients: A scientific statement from the American heart association. Circulation. 2020; 141(23): e884-903. PMID: 32362133 DOI: 10.1161/CIR.0000000000000772 [PubMed]

8. Mela A, Rdzanek E, Poniatowski LA, Jaroszynski J, Furtak-Niczyporuk M, Galazka-Sobotka M, et al. Economic costs of cardiovascular diseases in Poland estimates for 2015–2017 years. Front Pharmacol. 2020; 11: 1231. PMID: 33013357 DOI: 10.3389/fphar.2020.01231 [PubMed]

9. Li X, Cai W, Xu B, Jiang Y, Qi M, Wang M. SEResUTer: A deep learning approach for accurate ECG signal delineation and atrial fibrillation detection. Physiol Meas. 2023; 44(12): 125005. PMID: 37827168 DOI: 10.1088/1361-6579/ad02da [PubMed]

10. Abasi A, Nazari A, Moezy A, Fatemi Aghda SA. Machine learning models for reinjury risk prediction using cardiopulmonary exercise testing (CPET) data: Optimizing athlete recovery. BioData Mining. 2025; 18(1): 16. PMID: 39962522 DOI: 10.1186/s13040-025-00431-2 [PubMed]

11. Quartieri F, Marina-Breysse M, Toribio-Fernandez R, Lizcano C, Pollastrelli A, Paini I, et al. Artificial intelligence cloud platform improves arrhythmia detection from insertable cardiac monitors to 25 cardiac rhythm patterns through multi-label classification. J Electrocardiol. 2023; 81: 4-12. PMID: 37473496 DOI: 10.1016/j.jelectrocard.2023.07.001 [PubMed]

12. Zhang Y, Liu S, He Z, Zhang Y, Wang C. A CNN model for cardiac arrhythmias classification based on individual ECG signals. Cardiovasc Eng Technol. 2022; 13(4): 548-57. PMID: 34981316 DOI: 10.1007/s13239-021-00599-8 [PubMed]

13. Garcha I, Phillips SP. Social bias in artificial intelligence algorithms designed to improve cardiovascular risk assessment relative to the Framingham Risk Score: A protocol for a systematic review. BMJ Open. 2023; 13(5): e067638. PMID: 37258078 DOI: 10.1136/bmjopen-2022-067638 [PubMed]

14. Shi J, Li Z, Liu W, Zhang H, Guo Q, Chang S, et al. Optimized solutions of electrocardiogram lead and segment selection for cardiovascular disease diagnostics. Bioengineering (Basel). 2023; 10(5): 607. PMID: 37237677 DOI: 10.3390/bioengineering10050607 [PubMed]

15. Jekova I, Christov I, Krasteva V. Atrioventricular synchronization for detection of atrial fibrillation and flutter in one to twelve ECG leads using a dense neural network classifier. Sensors (Basel). 2022; 22(16): 6071. PMID: 36015834 DOI: 10.3390/s22166071 [PubMed]

16. Guvenir H, Acer B, Muderrisoglu H, Quinlan R. UCI machine learning repository: Arrhythmia [dataset]. 1997 [cited: 15 Sep 2025]. Available from: https://archive.ics.uci.edu/dataset/5/arrhythmia

17. Irfan S, Anjum N, Althobaiti T, Alotaibi AA, Siddiqui AB, Ramzan N. Heartbeat classification and arrhythmia detection using a multi-model deep-learning technique. Sensors (Basel). 2022; 22(15): 5606. PMID: 35957162 DOI: 10.3390/s22155606 [PubMed]

18. Xiao Q, Lee K, Mokhtar SA, Ismail I, Pauzi AL, Zhang Q, et al. Deep learning-based ECG arrhythmia classification: A systematic review. Applied Sciences (Basel). 2023; 13(8): 4964.

19. Zhang H, Wang X, Liu C, Liu Y, Li P, Yao L, et al. Detection of coronary artery disease using multi-modal feature fusion and hybrid feature selection. Physiol Meas. 2020; 41(11): 115007. PMID: 33080588 DOI: 10.1088/1361-6579/abc323 [PubMed]

20. Doran K, Resnick B. Cardiovascular risk factors of long-term care workers. Workplace Health Saf. 2017; 65(10): 467-77. PMID: 28422575 DOI: 10.1177/2165079917693018 [PubMed]

21. Motwani M, Dey D, Berman DS, Germano G, Achenbach S, Al-Mallah MH, et al. Machine learning for prediction of all-cause mortality in patients with suspected coronary artery disease: A 5-year multicentre prospective registry analysis. Eur Heart J. 2017; 38(7): 500-7. PMID: 27252451 DOI: 10.1093/eurheartj/ehw188 [PubMed]

22. You J, Guo Y, Kang J-J, Wang H-F, Yang M, Feng J-F, et al. Development of machine learning-based models to predict 10-year risk of cardiovascular disease: A prospective cohort study. Stroke Vasc Neurol. 2023; 8(6): 475-85. PMID: 37105576 DOI: 10.1136/svn-2023-002332 [PubMed]

23. Kang MG, Koo B-K, Tantry US, Kim K, Ahn J-H, Park HW, et al. Association between thrombogenicity indices and coronary microvascular dysfunction in patients with acute myocardial infarction. JACC Basic Transl Sci. 2021; 6(9): 749-61. PMID: 34754989 DOI: 10.1016/j.jacbts.2021.08.007 [PubMed]

24. Ma C-Y, Luo Y-M, Zhang T-Y, Hao Y-D, Xie X-Q, Liu X-W, et al. Predicting coronary heart disease in Chinese diabetics using machine learning. Comput Biol Med. 2024; 169: 107952. PMID: 38194779 DOI: 10.1016/j.compbiomed.2024.107952 [PubMed]

25. Allegra A, Mirabile G, Tonacci A, Genovese S, Pioggia G, Gangemi S. Machine learning approaches in diagnosis, prognosis and treatment selection of cardiac amyloidosis. Int J Mol Sci. 2023; 24(6): 5680. PMID: 36982754 DOI: 10.3390/ijms24065680 [PubMed]

Attribute Evaluator	Search Method	Selection Mode	Classifiers
Attribute Evaluator	Search Method	Selection Mode	Logitboost	Random Forest	Bayes Net
Gain Ration	Ranker	Cross Validation - 10 fold	70.57	69.02	70.35
Chi Squared	Ranker	Cross Validation - 10 fold	74.55	71.46	71.23
Consistency Subset	BFS	Cross Validation - 10 fold	72.78	70.35	72.56
Filtered Method	Ranker	Cross Validation - 10 fold	72.34	70.35	71.68
Info Gain	Ranker	Cross Validation - 10 fold	72.34	70.35	71.68
OneR	Ranker	Cross Validation - 10 fold	73.23	69.02	70.35
ReliefF	Ranker	Cross Validation - 10 fold	70.79	71.90	68.58
CFS Subset	BFS	Cross Validation - 10 fold	75.44	71.68	75.66