SEU Chronic Illness Prediction Model Discussion

Discuss and describe7 literature resources relating to the project being carried out must be from bast 3 years (I already have 5 resources and I want 2 more )

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper

RESEARCH ARTICLE | SEPTEMBER 20 2023
Predictive and comparative analysis for diabetes using
machine learning algorithms 
Poneeswari Jeyamurugan  ; Saranya Durairaj; Premchand Somasundaram; Chidambaram Subbiah
AIP Conf. Proc. 2831, 020023 (2023)
https://doi.org/10.1063/5.0166388
 
View
Online
CrossMark
Export
Citation
Articles You May Be Interested In
Sleep quality in diabetic mellitus with diabetic foot ulcer
AIP Conference Proceedings (April 2019)
Predictor selection for progression and development of diabetic nephropathy among diabetes mellitus type
2 patients
AIP Conference Proceedings (April 2023)
Automatic detection of diabetic retinopathy
24 September 2023 18:35:53
AIP Conference Proceedings (February 2023)
Predictive and Comparative Analysis for Diabetes Using
Machine Learning Algorithms
PoneeswariJeyamurugan1,a),SaranyaDurairaj1,b) ,PremchandSomasundaram1,c),
ChidambaramSubbiah1,d)
1)
Department of IT, National Engineering College, Kovilpatti, Tamilnadu, India
a)
Corresponding author: poneeswarij2001@gmail .com,
b)
Electronic mail: saranyadurairaj2001123@ gmail .com,
c)
Electronic mail: premsk1905@gmail .com,
d)
Electronic mail: chidambaram@nec.edu.in
Index Terms: Machine learning, classification, prediction, support vector machine
INTRODUCTION
International Diabetes Federation reported as [1], there were 425 million diabetics on the planet at that point,
and it was likewise inferred that the number will increment to 625 million by 2045 [2]. Diabetes mellitus is a
gathering of endocrine sicknesses related with hindered glucose take-up that creates because of the outright or
relative inadequacy of the chemical “Insulin.” The illness is portrayed by an ongoing course, just as an infringement
of a wide range of digestion. For the most part, diabetes is ordered into four classifications [3]: type 1 diabetes, type
2diabetes,gestational diabetes mellitus, explicit kinds of diabetes because of different causes. The most widely
recognized kinds of the infection are the accompanying two: type 1 diabetes (T1D) and type 2 diabetes (T2D). The
previous is brought about by the obliteration of the pancreatic beta cells, bringing about insulin lack, while the last
option is because of the inadequate transportation of insulin into cells. The two sorts of the infection can prompt
hazardous intricacies, for example, strokes, coronary episodes, ongoing renal disappointment, diabetic foot
condition, hostility, neuropathy, encephalopathy, hyperthyroidism, adrenal gland growths, cirrhosis of the liver,
glucagonoma, transient hyperglycemia, and numerous different confusions. Consequently, the forecast [4] and
early discovery [5] of diabetes are fundamental for all individuals who are inclined to diabetes.
At present, a few sicknesses can be analyzed utilizing man- made brainpower (AI) methods, and profound
neural networks [6] have accomplished the best presentation in grouping issues. Lately, DNNs have been utilized
for diagnosing different infections. Without diabetes, the pancreas turns out great and delivers sufficient insulin.
When insulin ties to receptors on the outer layer of the cell, the section for the glucose atom into the cell will open
too. With T1D, the pancreas progressively quits creating insulin, which likewise upsets the course of glucose
conveyance to cells. T2D isn’t brought about by the pancreas not having the option to deliver insulin. There is
sufficient insulin and glucose entering cells, yet the insulin receptors that permit insulin to go into cells have lost
their capacity to react toinsulin.This paper proposed a mixture neural organization with strategic relapse and
decision tree, gullible Bayes, and SVM for foreseeing existing frameworks. Work has done a relative investigation
of these calculations and we accomplished high exactness for cross breed calculation when we contrast it and
typical calculations.
International Conference on Smart Technologies and Applications (ICSTA 2022)
AIP Conf. Proc. 2831, 020023-1–020023-11; https://doi.org/10.1063/5.0166388
Published by AIP Publishing. 978-0-7354-4651-9/$30.00
020023-1
24 September 2023 18:35:53
Abstract. Diabetes is one of the most deadly diseases in the globe. It causes an increase in blood glucose levels due to the lack of
insulin in the body, raising the risk of consequences such as stroke and heart disease. All forms of diabetes stem not only from the
person being overweight or leading an inactive lifestyle some are present from childhood. Diabetes can’t be cured, but early
prediction and timely treatment can stop the progression and severity rate of the disorder. In this paper, the approach of machine
learning- based technique has been proposed for classification, early-stage identification, and prediction. It can be applied with
much success to predict, prevent, managing a Diabetic Mellitus Disease. To solve this we aim to implement the data analytic on
four classifiers namely Naïve Bayes, SVM, logistic regression, and Decision Tree to predict with a large number of datasets and
provide the best Result. So, the combination of the model and the ANN can be used for feature selection and processing as an
optimized Predictive Model with the best accuracy. During the analysis, it is observed that outperforms other classifiers of accuracy
and ANN improves the significant prediction with accuracy ofdiabetes.
RELATED WORKS
Sushant Ramesh, H. Balaji, (2018) proposed to anticipate the patient danger variable and seriousness of
diabetics utilizing restrictive informational index. The model includes profound learning as a profound neural
organization which assists with applying prescient investigation on the diabetes informational index to acquire ideal
outcomes. The current prescient models are utilized to anticipate the seriousness and the danger element of the
diabetics dependent on the information which is handled. For our situation Firstly, an element determination
calculation is run for the choice cycle. Besides, the profound learning model has a profound neural organization that
utilizes a Restricted Boltzmann Machine (RBM) as a fundamental unit to break down the information by relegating
loads to each part of the neural organization. This profound neural organization, coded on python, will assist with
acquiring numeric outcomes on the seriousness and the danger element of the diabetics in the informational
collection.
EXISTING CLASSIFIERS
SUPPORT VECTOR MACHINE
“Support Vector Machine” (SVM) is a directed machine learning calculation that can be utilized for both
order and relapse difficulties. Nonetheless, it is generally utilized in order issues. In the SVM calculation, we plot
every information thing as a point in n-layered space (where n is manyhighlights you have) with the worth of
each element being the worth of a specific direction. Then, at that point, we perform grouping by observing the
hyper-plane that separates the two classes well
indeed.
Figure 1. SVM classifier
Support Vectors are simply the coordinates of individual observation. The SVM classifier is a frontier that best
segregates the two classes (hyper- plane/ line).
020023-2
24 September 2023 18:35:53
Raushan Myrzashova and Rui Zheng (2020), proposed to anticipate the event of diabetes later on yet
additionally to decide the kind of infection that an individual encounters. Taking into account that type 1 diabetes
and type 2 diabetes have numerous distinctions in their treatment strategies, this strategy will assist with giving the
right treatment to the patient. By changing the undertaking into a grouping issue, our model is principally
constructed utilizing the secret layers of a profound neural organization and utilizations dropout regularization to
forestall overfitting. We tuned numerous boundaries and utilized the double cross-entropy misfortune work, which
acquired a profound neural organization forecast model with high precision. Dumpala shanthi, (2018) recommends
that patients can be aware of their danger of diabetes without the assistance of specialists. The patient can only login
into the site and they can give their traits (ie) information gathered from labs as information and they can finding
their result without specialists. Somnath Rakshit, and Suvojit Manna (2017), presents to anticipate the beginning of
diabetes among ladies matured something like 21 years utilizing a Two- class Neural Network and analyze
theoutcomes.
DECISION TREE
The Decision Tree calculation has a place with the group of directed learning calculations. Dissimilar to other
regulated learning calculations, the decision tree calculation can be utilized for taking care of relapse and grouping
issues as well. The objective of utilizing a Decision Tree is to make a preparation model that can use to anticipate
the class or worth of the objective variable by taking in straightforward decision rules surmised from earlier
information (preparinginformation).
In Decision Trees, for foreseeing a class mark for a record we start from the base of the tree. We think about the
upsides of the root trait with the record’s quality. In view of the correlation, we follow the branch relating to that
worth and leap to the following hub.
Decision trees order the models by arranging them down the tree from the root to some leaf/terminal hub, with
the leaf/terminal hub giving the grouping of the model. Every hub in the tree goes about as an experiment for some
characteristic, and each edge slipping from the hub relates to the potential responses to the experiment. This
interaction is recursive and is rehashed for each subtree established at the newhub.
NAIVE BAYES
It is an order strategy dependent on Bayes’ Theorem with a suspicion of freedom among indicators. In
straightforward terms, a Naive Bayes classifier accepts that the presence of a specific component in a class is
disconnected to the presence of some other element. For instance, a natural product might be viewed as an apple
assuming that it is red, round, and around 3 crawls in breadth. Regardless of whether these elements rely upon
one another or upon the presence of different elements, these properties freely add to the likelihood that this
organic product is an apple and for that reason it is known as’Gullible’.
The Naive Bayes model is not difficult to assemble and especially valuable for extremely enormous
informational indexes. Alongside effortlessness, Naive Bayes is known to beat even exceptionally modern
grouping strategies.
PROPOSEDAPPROACH
Right away, the Diabetes information is gathered from the info dataset by utilizing panda’s library. Then, at
that point, we pre-process the information by dropping invalid worth, then, at that point, we make highlight
determination by choosing input highlights for taking care of it into the SVM, Naïve Bayes, decision tree, and cross
breed calculation of neural organization with strategic relapse, we plan a half breed troupe machine learning method
by joining neural organization and strategic relapse by utilizing a democratic classifier, the extricated highlights are
embedded into the machine learning models and the machine gets prepared. Subsequent to preparing, we
020023-3
24 September 2023 18:35:53
Figure.2 Decision tree classifier
anticipate diabetes by taking care of test information’s into the bestmodel.
DATA COLLECTION
In this work, the diabetes illness dataset are collected from the Kaggle data science website, PIMA Indian
Datasets and demographic data’s. This information comprises of more than 6000 patients information’s and has
highlights, for example, no of pregnancies, Glucose level, pulse level, skin thickness, Body Mass Index, diabetes
family capacity, age, and result
Figure.3 Proposed System Architecture
DATA PRE-PROCESSING
MACHINE LEARNING
Machine learning is a part of man-made reasoning (AI) and software engineering which centers around the
utilization of information and calculations to impersonate the way that people learn, continuously further
developing its exactness. Machine learning is a significant part of the developing field of information science.
Using measurable techniques, calculations are prepared to make orders or forecasts, revealing key bits of
knowledge inside information mining projects. These bits of knowledge in this way drive decision- production
inside applications and organizations, in a perfect world affecting key development measurements. As large
information proceeds to expand and develop, the market demand for information researchers will increment,
expecting them to aid the distinguishing proof of the most applicable business questions and consequently the
information to respond tothem.
Deep Learning methods like neural network to generate or extract features from unstructured data and use
classical Machine Learning approaches to build highly accurate classification models using the unstructured
data.Thus, using Deep Hybrid Learning (DHL) — we can take the benefits from both DL and ML and alleviate the
drawbacks of both the techniques and provide more accurate and less computationally expensive solutions.
LOGISTIC REGRESSION
Logistic Regression (LR) is a strategic relapse of the measurable examination strategy used to foresee
information esteem dependent on earlier perceptions of an informational index. Calculated relapse has turned into
a significant instrument in the discipline of machine learning. The methodology permits a calculation being
utilized in a machine learning application to group approaching information dependent on verifiable information.
As more applicable information comes in, the calculation ought to improve at foreseeing groupings inside
informational indexes. Calculated relapse can likewise assume a part in information readiness exercises by
permitting informational collections to be placed into explicitly predefined cans during the concentrate, change,
and burden (ETL) cycle to arrange the data for investigation.
020023-4
24 September 2023 18:35:53
At first, the dataset is fetched by using the panda’s library and then we save the data’s inside a pandas dataframe.
At first, this dataset consists of lots of null values, then we drop all the null values because our Machine learning
model cannot able to process null values
A strategic relapse model predicts a reliant information variable by breaking down the connection between at
least one existing autonomous factors. For instance, calculated relapse could be utilized to foresee whether a
political candidate will win or lose a political decision or regardless of whether a secondary school understudy
will be conceded to a specific school.
The subsequent scientific model can think about various information measures. On account of school
acknowledgment, the model could consider factors, for example, the understudy’s grade point normal, SAT score,
and some extracurricular exercises. In light of chronicled information about before results including similar info
measures, it then, at that point, scores new cases on their likelihood of falling into a specific result classification.
ARTIFICIAL NEURAL NETWORK
Artificial neural networks (ANNs), typically basically called neural networks (NNs), are registering frameworks
dubiously roused by the organic neural networks that comprise creature minds. An ANN depends on an assortment
of associated units or hubs called counterfeit neurons, which freely model the neurons in a natural mind. Every
association, similar to the neurotransmitters in a natural cerebrum, can communicate a sign to different neurons. A
counterfeit neuron that gets a sign then, at that point, processes it and can flag neurons associated with it. The
“signal” at an association is a genuine number, and the result of every neuron is registered by some non-straight
capacity of the amount of its bits of feedbacks. The associations are callededges.
Neurons and edges commonly have a weight that changes as learning continues. The weight increments or
diminishes the strength of the sign at an association. Neurons might have a limit to such an extent that a sign is
conveyed provided that the total message passes that boundary. Regularly, neurons are collected intolayers.
Figure.4 Neural Network formation
The principal layer in our Neural net is the information layer, which comprises of 339 neurons, which are
the special lemmatized words. The result state of the information layer is around 2/third of the quantity of
result prospects ie. 62 Tags (⅔ of62= 41). Consequently the center secret layer comprises of 41 neurons. The
Softmax enactment work is utilized for the result layer, which contains 62 neurons that relate to the 62 plan
classes. The principle benefit of utilizing Softmax is the result probabilities range, as it assists with planning
the non-standardized result to a likelihood dispersion over predictedyield classes as seen from the numerical
condition (2). The reach is between 0 to 1, and the amount of the relative multitude of probabilities will be
equivalent toone.
(2)
After specifying the number of layers for the model, we configure the learning process using the compile
020023-5
24 September 2023 18:35:53
Various layers might perform various changes on their bits of feedbacks. Signals travel from the primary layer (the
info layer) to the last layer (the result layer), potentially in the wake of crossing the layers on various occasions.The
preparation which is to be passed to the ANN model is planned with the end goal that it is a blend of the encoded
particular lemmatized words and labels as displayed in (1).
Preparing Set = {encoded(different lemmatized list)+encoded(tag list)}
method wherein we specify an optimizer, a loss function, and a list of metrics.
ENSEMBLE VOTING CLASSIFIER
The Ensemble Voting Classifier is a meta- classifier for consolidating comparable or thoughtfully unique
machine learning classifiers for arrangement by means of greater part or majority casting a ballot. (For
straightforwardness, we will allude to both larger part and majority casting a ballot as greater part casting a ballot.)
Figure.5 Voting classifier
The Ensemble Vote Classifier executes “hard” and “delicate” casting a ballot. In hard democratic, we foresee the
last class name as the class mark that has been anticipated most often by the characterization models. In delicate
democratic, we foresee the class names by averaging the class probabilities (possibly suggested in the event that the
classifiers are very much adjusted).
24 September 2023 18:35:53
Figure.6 ANN Training
SIMULATION RESULTS
The dataset comprises of more than 6000 patients’ clinical information. It comprises of 9 segments and 6000
lines. This dataset is downloaded from the Kaggle site.
020023-6
Figure.7 Result of hybrid ANN+LR algorithm
The dataset comprises of more than 6000 patients’ clinical information. It comprises of 9 segments and 6000
lines. This dataset is downloaded from the Kagglesite.
Figure 9. Result of Decision tree algorithm
Figure.8
The above diagram represents the Testing result of the Decision Tree algorithm.The accuracy score of the
Decision Tree is 77%
Figure.9
020023-7
24 September 2023 18:35:53
Figure.8 Confusion Matrix of ANN+LR
Figure.10 Result of SVM
The above diagram represents the Testing result of the SVM algorithm.The accuracy score of the SVM algorithm
is 75%
Figure.11 Result of Naïve Bayes classifier
24 September 2023 18:35:53
Figure 11
The above diagram represents the Testing result of the Naïve Bayes algorithm.The accuracy score of the
Naïve Bayes algorithm is 75%
TABLE.1 Classification result of different machine learning algorithms
Algorithm
ANN+LR
Decision Tree
SVM
Naïve
Cohens kappa
0.9956
0.2977
0.4569
0.4602
Accuracy
0.9980
0.7141
0.7701
0.7597
TABLE.2 Precision, recall and F-measures
Algorithm
ANN+LR
Decision Tree
SVM
Naïve
Precision
1.0000
0.6588
0.7375
0.6680
020023-8
Recall
0.9943
0.3689
0.5262
0.6142
FI Scores
0.9971
0.2977
0.4569
0.6400
A Diabetes dataset was processed for this study, outliers were identified and eliminated, and a variety of
classification techniques, including ANN+LR, DT, SVM, and Naive, were used. The best performing algorithm for
forecasting the occurrence of diabetes was determined by comparing the different cross-validation performance
parameters. Table 1shows that when accuracy and cohens kappa. Table 2 shows precision, recall, and f-measures
are taken into account, ANN+LR perform good accuracy
Figure.12 ROC Curve
Fig.12 is a graphical depiction of the ROC-controlled territory(AUROC).
The ROC curve is constructed using the true positive rate and false positive rate values.
24 September 2023 18:35:53
Figure.13 Performance comparison of algorithms
The above bar graph represents the performance of each algorithm over test data.In these algorithms, the
Hybrid ANN with Logistic regression algorithm gives us highaccuracy
Figure.14 Resulting BestAlgorithm
In the above figure the machine prints that the Hybrid algorithm is the best one.
Figure15. Testing Result
020023-9
The above figure 15, shows the continuous testing result. The client gives their clinical boundaries under a
rundown of components isolated by space remark. The machine gets the info and then, at that point, it dissects it
by utilizing prepared information and then, at that point, it gives an outcome.
CONCLUSION
In this paper involved four distinct calculations for anticipating diabetes illness. They are cross breed outfit
ANN with Logistic relapse, Decision Tree, SVM, Naive Bayes. In this calculation, Hybrid troupe ANN with
Logistic relapse gives high exactness and subsequently we infer that Hybrid group ANN with Logistic relapse is
the best Model. We utilized the Diabetes illness dataset which holds more than 6000 patient information’s for
preparing purposes. In the wake of preparing, we foresee diabetes sickness by utilizing test information’s utilizing
Hybrid ANN with Logistic relapse. Our Model Archives have over 96% precision during testing and preparing.
The anticipated outcomes by our mixture calculation are exact and stable, its examples are likewise coordinated
with the current dataset designs. And consequently our model is impeccably prepared and it can ready to foresee
diabetes illness with high security.
FUTURE ENHANCEMENT
This is a rich topic to work on and a lot of further work can be done to improve the efficiency of the neural
network in terms of both speed and accuracy. Moreover, using deep learning, AI systems can be created that can
predict the onset of diabetes well before a patient is diagnosed with it. Finally, the same model used in this paper
can also be applied to a variety of other health problems such as heart diseases, different types of cancers, strokes,
respiratory problems and even gall-bladder disease.
[1]. A. L. Hines, M. L. Barrett, H. J. Jiang andC.A.Steiner,”Conditions with thelargest number of adult
hospital readmissions by payer” HCUP Statistical Brief, 172, 2014.
[2]. Agency for Healthcare Research and Quality (AHRQ), “HCUP Nationwide Inpatient Sample (NIS).”
2011.
[Online]. Available: http://hcupnet.ahrq.gov/ HCUPnet.jsp..
[3]. “ADA: Economic Costs of Diabetes in theU.S. in 2012. Diabetes Care,” 2013 .
[4]. R. G. Shashank, “Examining the drivers of hospital readmissions of Type-2 Diabetic patients,”
Oklahoma State University, 2018.
[5]. X. Charlie, C. Christina and P. Stephone, “Beating Diabetes: Predicting Early Diabetes Patient Hospital
Readmittance to Help Optimize Patient, Care,” 2018.
[6]. X. Yifan and J. Sharma, “Diabetes patient readmission prediction using big data analytic tools,” Aug2017.
[7]. Dara. Mize, “A Prediction Model for Disease- Specific 30-Day Readmission Following Hospital
Discharge,” MASTERs, 2018.
[8]. Y. Kumar Jain and S. Kumar Bhandare, “Min Max Normalization Based Data Perturbation Method
for Privacy Protection”, International Journal of Computer &Communication Technology, vol. 2, no.
8,(2011), pp.45-50.
[9]. M. ArifWani Saduf, “Comparative Study of Back Propagation Learning Algorithms for Neural
Networks”, International Journal of Advanced Research in Computer Science and Software
Engineering, vol. 3, no. 12, (2013), pp.1151-1156.
[10]. L. Deng and D. Yu, “Deep Learning Methods and Applications”, Foundations and Trends® in Signal
Processing, vol. 7, no. 3- 4, (2014), pp.197-387.
[11]. Abarna.A, Amuthavani.B, Varshini.V, Chidambaram.S “Prediction of Emergency Admissions in
Health centres using Data Mining”, International journal of innovative Technology and Exploring
020023-10
24 September 2023 18:35:53
REFFERENCE
Engineering. (Scopus indexed) Value of Citation: 6.03, ISSN:2278-3075, Volume-9, Issue 8, Jun 2020,
pp.No. 664-667, https://doi.org/10.35940/ijitee.h6486.069820.
[12]. Y. Bengio, “Learning Deep Architectures for AI”, Foundations and trends® in Machine Learning,
vol. 2, no. 1, (2009),pp.1-127.
[13]. G. Kaur, “Improved J48 Classification Algorithm for the Prediction of Diabetes”, International
Journal of Computer Applications, vol. 98, no. 22, (2014).
[14]. R. Zolfaghari, “Diagnosis of Diabetes in Female Population of Pima Indian Heritage with Ensemble
of BP Neural Network and SVM”, International Journal of Computational Engineering & Management,
vol.15, no. 4, (2012), pp. 2230-7893.
[15]. R. Zolfaghari, “Using the ADAP LearningAlgorithm to Forecast the Onset of Diabetes Mellitus”,
International Journal of Computational Engineering & Management, vol.15, no. 4, (2012), pp. 22307893.
24 September 2023 18:35:53
020023-11
(2023) 10:144
Baghdadi et al. Journal of Big Data
https://doi.org/10.1186/s40537-023-00817-1
RESEARCH
Journal of Big Data
Open Access
Advanced machine learning techniques
for cardiovascular disease early detection
and diagnosis
Nadiah A. Baghdadi1 , Sally Mohammed Farghaly Abdelaliem1* , Amer Malki2 , Ibrahim Gad3,
Ashraf Ewis4,5 and Elsayed Atlam2,3
*Correspondence:
Smfarghaly@pnu.edu.sa
1
Nursing Management
and Education Department,
College of Nursing, Princess
Nourah bint Abdulrahman
University, P.O. BOX 84428,
Riyadh 11671, Saudi Arabia
2
Computer Science Section,
College of Computer Science
and Engineering, Taibah
University, Yanbu Campus,
Al‑Madinah, 46421, 41411 Yanbu,
Saudi Arabia
3
Computer Science Department,
Faculty of Science, Tanta
University, Tanta, Egypt
4
Department of Public Health
and Occupational Medicine,
Faculty of Medicine, Minia
University, El‑Minia, Egypt
5
Department of Public Health,
Faculty of Health Sciences,
AlQunfudah, Umm AlQura
University, Meccah, Saudi Arabia
Abstract
The identification and prognosis of the potential for developing Cardiovascular
Diseases (CVD) in healthy individuals is a vital aspect of disease management. Accessing the comprehensive health data on CVD currently available within hospital databases holds significant potential for the early detection and diagnosis of CVD, thereby
positively impacting disease outcomes. Therefore, the incorporation of machine
learning methods holds significant promise in the advancement of clinical practice for the management of Cardiovascular Diseases (CVDs). By providing a means
to develop evidence-based clinical guidelines and management algorithms, these
techniques can eliminate the need for costly and extensive clinical and laboratory
investigations, reducing the associated financial burden on patients and the healthcare system. In order to optimize early prediction and intervention for CVDs, this
study proposes the development of novel, robust, effective, and efficient machine
learning algorithms, specifically designed for the automatic selection of key features
and the detection of early-stage heart disease. The proposed Catboost model yields
an F1-score of about 92.3% and an average accuracy of 90.94%. Therefore, Compared
to many other existing state-of-art approaches, it successfully achieved and maximized
classification performance with higher percentages of accuracy and precision.
Keywords: Heart disease, Machine learning, Feature selection, Cardiovascular diseases,
Quality of life, Disease prevention, CVD
Introduction
The heart is the second-most important organ in the human body, after the brain. The
heart’s confusion eventually results in body turmoil. We are living in the modern era,
and the world around us is undergoing significant transformations that have some
impact on our day-to-day lives. Heart disease, which is claiming lives around the world,
is one of the leading ailments among the top five deadly diseases [1]. Because it enables
us to take the necessary steps at the right time, forecasting this disease is of the utmost
importance.
© The Author(s) 2023. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits
use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original
author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third
party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or
exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://
creativecommons.org/licenses/by/4.0/.
Baghdadi et al. Journal of Big Data
(2023) 10:144
Cardiovascular Diseases (CVD) are a group of heterogeneous diseases that affect the
heart and circulatory system causing a variety of ailments that are typically brought on
by atherosclerosis. Typically, CVD are chronic in nature and progressively manifest over
time without symptoms for long periods of time before becoming advanced and showing up as symptoms of different intensity [2–4]. According to reports from the World
Health Organization (WHO), CVD has been the leading cause of premature death in
the world for decades, and it is expected that by 2030, CVD will be responsible for the
deaths of around 23.6 million people annually.
In addition, the cost of treating cardiovascular disease and it’s future consequences
and early death, as measured by Disability Adjusted Life Years (“DALYS”), entails a significant economic burden [5–7]. Many factors contribute variably to the development
of cardiovascular disease; these factors can be classed as modifiable and non-modifiable
risk factors [5, 8]. Age, gender, and inherited variables are factors that cannot be modified. However, the other category of concerns, referred to as modifiable risk factors,
comprises fasting blood sugar, high blood pressure, serum cholesterol, smoking, dietary
propensity, obesity, and physical inactivity [9, 10].
Individuals will be able to avoid the development of CVD by identifying modifiable
risk factors and attempting to alter lifestyle-related risk factors into healthy ones. Chest
discomfort, arm pain, slowness and dizziness, weariness, and perspiration are among
the early warning signs of a heart attack [11]. Individuals will be able to prevent the progression of CVD by identifying modifiable risk factors and attempting to alter lifestylerelated risk factors into healthy ones. Patients with heart disease do not have symptoms
in the early stages of the disease, but they do in later stages when it is sometimes too
late to manage or treat [12–14]. Therefore, despite the difficulty, Rapid recognition and
prediction of CVD hypersensitivity in it seem healthy people is essential in assessing
prognosis and prognosis. For early diagnosis of CVD, it will be incredibly beneficial and
necessary to analyze the current significant CVD-health information contained in the
huge database of hospital records. Thus, machine learning algorithms and other techniques for intelligent systems are beneficial in this field, and their findings are reliable
and accurate [15–17].
The field of machine learning enables the identification of concealed patterns and the
establishment of analytical structures, including clustering, classifications, regression,
and correlations, through the integration and application of various techniques, such
as machine learning models, neural networks, and information retrieval [18–20]. Consequently, machine learning techniques have demonstrated great potential to support
clinical decision-making, aid in the development of clinical guidelines and management
algorithms, and promote the establishment of evidence-based clinical practices for the
management of Cardiovascular Diseases (CVDs) [21–27]. Furthermore, the early detection of CVDs using machine learning techniques can reduce the need for extensive and
expensive clinical and laboratory investigations, resulting in a reduction of the financial
burden on both the healthcare system and individuals [28, 29].
Cardiovascular disease is a chronic syndrome that can result in heart failure, a critical condition characterized by impaired heart function, and symptoms such as compromised blood vessel function and infarction of the coronary artery [30]. According
to the American Heart Association (World Health Organization, 2021), cardiovascular
Page 2 of 29
Baghdadi et al. Journal of Big Data
(2023) 10:144
diseases are a set of heart and blood vessel abnormalities and one of the main causes
of death worldwide. Accounting to statistic of almost 18 million deaths, cardiovascular
disease was responsible for 32% of all deaths all over the world [31]. Heart attacks and
strokes accounted for 85% of all deaths, with 38% occurring in individuals younger than
70. In the treatment and management of cardiovascular disorders, early detection is crucial, and machine learning (ML) can be a useful tool for recognizing a probable heart
disease diagnosis [17, 32].
Heart disease as well identified as cardiovascular disease is a leading cause of death
worldwide. The cardiac muscle is responsible for the circulation of blood around the
body [33]. Although machine learning methods have demonstrated intriguing results in
forecasting certain medical disorders, they have not been applied to the prediction of
individual CVD survival in hypertensive patients utilizing routinely obtained big digital
electronic administrative health data [34]. If a machine learning algorithm can be used to
exploit the large administrative data set, it may be attainable to optimize the use of accumulated data sets to support in predicting patient outcomes, planning individualized
patient care, monitoring resource utilization, and improving institutional performance.
Comorbidity status, demographic information, laboratory test results, and medication
information would improve prognostic evaluation and direct treatment decisions for
hypertension patients [35].
In this study, we proposed a Gradient Boosting model to predict the existence of cardiovascular disease and to identify the most predictive value based on their Rough sets
values. Afterward, a number of Machine Learning and Deep Learning techniques are
used to analyze cardiovascular disease. Below are the main contributions of this study:
• Utilizing cross-validation and split validation, discover a machine learning algorithm
with improved performance that will be applied to the detection of cardiovascular
disease.
• The application of an appropriate feature selection technique can optimize prediction accuracy. Utilizing a robust machine learning algorithm can enhance early prediction of Cardiovascular Disease (CVD) development in its early stages, facilitating
early intervention and promoting the selection of key features to support recovery
algorithms.
• Predicting cardiovascular disease using a broadly cutting-edge Cardiovascular Diseases dataset.
• Providing reliable advise to health and medical specialists regarding significant
changes in the healthcare sector.
Sect. 2 of this paper presents related work. Section 3 proposes a methodology. Section 4 describes experimental evaluation. Section 5 analyzes discussion and comparative results. Section 6 focuses in conclusion and future work.
Related work
Many researchers examine a number of cardiac disease expectation frameworks utilizing various data mining techniques. They utilizing datasets and various calculations, in
addition to test findings and future work that would be possible on the framework, and
achieving more productive results. Researchers completed numerous research attempts
Page 3 of 29
Baghdadi et al. Journal of Big Data
(2023) 10:144
to accomplish efficient techniques and high accuracy in recognizing disorders associated
with the heart.
Pattekari [36] study creating a model using the Naive Bayesian data mining presentation method. It’s a computer program in which the user answers predetermined questions. It pulls hidden information from a dataset and evaluates client values to a preset
data set. It can provide answers to difficult questions regarding heart disease diagnosis,
allowing medical service providers to make more informed clinical decisions than normal choice emotionally supporting networks. It also helps reduce treatment expenses by
providing effective treatments.
Tran [37] study built an Intelligent System using the Naive Bayes data mining modeling technique. It is a web application in which the user answers pre-programmed questions. It tries to find a database for hidden information and compares user values to a
trained data set. It can provide answers to difficult questions about cardiac disease diagnosis, allowing healthcare professionals to make more informed clinical decisions than
traditional decision support systems. It also lowers treatment costs by delivering effective care.
Gnaneswar [38] demonstrates the significance of monitoring the heart rate when
cycling. Cyclists can cope with cycling meetings, such as cycling rhythm, to identify the
level of activity by monitoring their pulse while accelerating. By managing their pedaling
exertion, cyclists can avoid overtraining and cardiac failure. The cyclist’s pulse can be
used to determine the intensity of an exercise. Pulse can be measured using a sensor that
can be worn. Unfortunately, the sensor does not capture all information at regular intervals, such as one second, two seconds, etc. Consequently, we will need a pulse expectation model to fill in the gaps.
Gnaneswar [38] work aims to use a Feedforward Brain Organization to construct a
predictive model for pulse in consideration of cycling rhythm. On the second, pulse and
rhythm are the data sources. The result is the predicted pulse for the following second.
Using a feed-forward brain structure, the relationship between pulse and bicycle rhythm
is represented statistically. Mutijarsa [39] expand of medical care administrations, based
on these arguments. Numerous breakthroughs in remote communication have been
made in anticipation of cardiac sickness. Utilizing data mining (DM) techniques for the
detection and localization of coronary disease is highly useful. In their assessment, a
comparative analysis of multiple single- and mixed-breed information mining calculations is conducted to determine which computation most accurately predicts coronary
disease.
Yeshvendra [40] argues that the use of AI computations in the forecasting of various
diseases is growing. This notion is so significant and diverse because of the ability of an
AI computation to have a comparable perspective as a human for improving the accuracy of coronary disease prognosis. Patil [41] notes that a proper diagnosis of cardiac
disease is one of the most fundamental biomedical concerns that must be addressed.
Three information mining techniques: support vector machine, naïve bayes, and Decision tree. These techniques were used to create an emotionally supportive network for
their preferred option. Tripoliti [42] argues that the identification of diseases with large
prevalence rates, such as Alzheimer’s, Parkinson’s, diabetes, breast cancer, and coronary disease, is one of the most fundamental biomedical tests demanding immediate
Page 4 of 29
Baghdadi et al. Journal of Big Data
(2023) 10:144
attention. Gonsalves [43] attempted to forecast coronary CVD using machine learning
and historical medical data. Oikonomou [44] provides an overview of the varieties of
information encountered in chronic disease settings. Using multiple machine learning
methods, they elucidated the extreme value theory in order to better measure chronic
disease severity and risk.
According to Ibrahim [45], machine learning-based systems can be utilized for predicting and diagnosing heart disease. Active learning (AL) methods enhance the accuracy of classification by integrating user-expert system feedback with sparsely labeled
data. Furthermore, Pratiyush et al. [46] explored the role of ensemble classifiers over
the XAI framework in predicting heart disease from CVD datasets. The proposed work
employed a dataset comprising 303 instances and 14 attributes, with categorical, integer,
and real type attribute characteristics, and the classification task was based on classification techniques such as KNN, SVM, naive Bayes, AdaBoost, bagging and LR.
The literature attempted to create strategies for predicting cardiac disease diagnosis.
Because of the high dimensionality of textual input, many traditional machine learning
algorithms fail to incorporate it into the prediction process at the same time [47–53]. As
a result, this paper investigates and develops a set of robust machine learning algorithms
for improving the early prediction of CVD development, allowing for prompt intervention and recovery.
Methodology
This section describes the suggested classification scheme for heart disease instances.
Initially, exploratory analysis is conducted. A comprehensive analysis is undertaken on
both the target and the features, and category variables are converted to numeric values. Various criteria are utilized to compare models under consideration. The outputs
of each model are analyzed, and the optimal model for the problem at hand is selected.
The proposed model is thoroughly examined, and the Optuna library is used to tweak
the model hyperparameters to see how much they have been enhanced. The suggested
model is divided into three phases: (1) pre-processing, (2) Training, and (3) classification as shown in Fig. 1. In the following sections, the Authors will examine each of these
components in further depth.
Pre‑processing
Before training the selected models, it is important to address the Cholesterol missing
values that were initially input as 0. To accomplish this, the data is separated into groups
Fig. 1 The main steps of the proposed methodology
Page 5 of 29
Baghdadi et al. Journal of Big Data
(2023) 10:144
Page 6 of 29
based on the presence of a verified cardiac condition, and the mean of each group is used
to fill in the missing values. To assess whether these variables are influential in predicting heart disease based on their Shapley Values, interaction terms were included to the
models to capture any possible correlations between the data elements. SHAP (SHapley
Additive exPlanations) employs game theory to identify the significance of each characteristic and can be used to explain both individual model predictions and aggregated
model results. SHAP determines the magnitude of each predictor’s contribution to the
model’s output by averaging the marginal contributions of each feature over all feasible
feature combinations.
Before doing feature selection using Shapley values, a gradient boosting model containing all variables is trained. The final predictors will be selected from the characteristics with a Shapley value greater than 0.1 that contribute significantly to the model’s
prediction. Then, these predictors will be used to establish the most effective model.
Due to the multicollinearity between the interaction variables, a variety of nonparametric tree-based methods for predicting the risk of CVD are explored to discover the best
accurate method.
Training process
The machine learning algorithm will be correctly trained after preprocessing and normalizing the datasets. Following the modification of the data, it is arbitrarily categorized
into a training set and a test set, with 70% of the rows assigned to the training set and
30% to the test set. The k-fold is a common cross-validation method that entails running
a large number of pertinent tests to determine the model’s typical accuracy metric. This
technique has existed for quite some time. To examine the proposed strategy, such AI
procedures as SVC [54], MultinomialNB [55], K-Neighbor [56], BernoulliNB [55], SGD
[57], Random forest [58] and Decision tree [59] are deployed for best terms of result.
XGBoost (Extreme Gradient Boosting) is a supervised learning method for improving prediction accuracy by combining multiple decision trees. XGBoost iteratively adds
decision trees using gradient boosting, with each subsequent tree attempting to correct
the errors of the previous trees. The final prediction is the weighted sum of all the individual tree predictions. XGBoost’s objective function includes a loss function as well as a
regularization term, which helps to prevent overfitting. The XGBoost objective function
equation is:
Obj (t) =
n

(t−1)
l(yi , ŷi
(1)
+ ft (xi )) + �(ft )
i=1
(t−1)
where l is the loss function, yi is the true label for example i, ŷi
is the predicted value
from the previous iteration, ft (xi ) is the prediction of the t th tree for example i, and �(ft )
is the regularization term.
AdaBoost (Adaptive Boosting) is another boosting algorithm that also uses decision trees as weak learners. AdaBoost assigns weights to each training example,
with higher weights given to examples that were misclassified by the previous weak
learner. In each subsequent iteration, a new decision tree is trained on the weighted
data, with the weights updated based on the accuracy of the tree. The final prediction
Baghdadi et al. Journal of Big Data
(2023) 10:144
Page 7 of 29
is the weighted sum of the predictions of all the individual trees. The equation for the
prediction function of AdaBoost is:
f (x) =
T

αt ht (x)
(2)
t=1
where T is the total number of trees, ht (x) is the prediction of the tth tree for input x,
and αt is the weight assigned to the tth tree.
Linear Support Vector Classifier (SVC) focus employs a straight-bit capacity to
order data and operates superbly with enormous datasets [54]. The Linear SVC has
more restrictions, such as standardization of consequence and misfortune work. Due
to the fact that direct SVC is dependent on the bit strategy, the part strategy cannot
be modified. A Direct SVC is meant to handle the data by returning the “best fit”
hyper-plane that partitions or sorts it. After acquiring the hyperplane, the highlights
are placed within the classifier, which predicts which class they belong to.
The Naive Bayes algorithm assigns equal weight to all features or qualities. The
algorithm becomes more efficient as one property has no effect on another. According
to Yasin 2020, the Naive Bayes classifier (NBC) is a simple, effective, and well-known
text categorization algorithm. NBC has used the Bayes theorem to classify documents
since the 1950 s, and it is theoretically sound. A posterior estimate is used to determine the class using the Naive Bayes classifier. Characteristics, for example, are categorized based on their highest conditional potential.
Bernoulli Naive Bayes is a statistical technique that produces boolean results based
on the presence or absence of required text. The discrete Bernoulli Distribution is fed
into this classifier. When identifying an unwanted keyword or tagging a specific word
type within a text, this type of Naive Bayes classifier is useful. It is also distinct from
the multinomial approach in that it generates binary output such as 1–0, True-False,
or Yes–No. A stochastic system or procedure is one that has a random fit solution as
part of it. Stochastic Gradient Descent (SGD) randomizes a few data samples rather
than the entire dataset in each iteration. As a consequence, rather than calculating
the sum of the gradients for all instances, each iteration calculates the gradient of
the cost function for a single example. SGD is a method for determining the optimal
smoothness properties of a differentiable or sub-differentiable objective function that
is iterative.
Decision Tree is a widely known Machine Learning technique in which data is
repeatedly partitioned based on specific parameters. The tree has two traversable
entities: nodes and leaves. Leaves represent decisions or outcomes, whereas decision nodes partition data [59]. Decision trees can be used in combination to solve
problems (ensemble learning). The Random Forest algorithm resolves the overfitting
issues associated with decision tree algorithms. The algorithm is capable of dealing
with regression and classification problems, as well as evaluating a large number of
attributes to determine which ones are most important. Random data can learn without well-planned data alterations [58].
The K-Nearest Neighbor (K-NN) algorithm classifies new observations based on
their distances from known examples. Based on the majority vote of its neighbors
Baghdadi et al. Journal of Big Data
(2023) 10:144
and a distance function as a measuring tool, the case is designated to the class with
the highest frequency among its k-nearest neighbors. In classification problems,
k-NN returns the class membership. Whereas, in regression problems, it returns the
object’s property value. Whether k-NN is used for classification or regression has an
effect on the output. Because this method relies on distance for classification, normalization can dramatically improve the training data. If the features correspond to
different physical units or scales, standardization can significantly enhance the accuracy of the training data [56].
Classification
The proposed model is based on machine learning with strong generalization capabilities and a high degree of paradigm-specific precision. In this study, we will evaluate a
number of machine learning algorithms and establish objectively which one delivers
the greatest results. This is the primary purpose for the usage of machine learning: to
combat the problem of overfitting that happens in machine learning. The curriculum
also includes a structural concept of risk minimization. Machine learning can run bestdescribed classes, particularly in higher-dimensional space, and to suggest a hyper-plane
with the largest possible separation. In this stage, labeling data is used as an input, and
the most significant characteristics are extracted using a feature extraction process.
Finally, the optimal model is used to categorize new instances of data.
Experimental evaluation
In the experiments of the study, we utilized Google Colab as the implementation platform for machine learning models. The platform includes a virtual machine that runs
on Google’s servers and gives users access to a Python environment that includes popular data science libraries like TensorFlow, PyTorch, and Scikit-Learn. Google Colab
is a cloud-based Jupyter notebook environment that offers free access to computing
resources such as a virtual machine with 12 GB of RAM and up to 100 GB of hard disk
space. The memory size allocated to the virtual machine is up to 25 GB, and it is also
possible to enable high-RAM options up to 52 GB for large-scale models or data. The
virtual machine runs on Google’s servers and is equipped with NVIDIA Tesla K80 GPU,
enabling us to train deep learning models efficiently. Additionally, Google Colab provides a wide range of preinstalled libraries and tools, making it easy to install and use
the necessary dependencies. The virtual machine is powered by a Linux-based operating system, ensuring that the implementation environment is stable and reliable. Also,
the operating system used by the virtual machine is Linux Ubuntu, which comes preinstalled with various system libraries and tools commonly used in data science projects.
The following subsection discussed the dataset and the results of the machine learning
models.
Data collection
The Heart Condition data utilized in this study is a synthesis of data sets from the
UCI Machine Learning Repository and contains eleven features that can be used to
forecast the existence of heart failure, a prevalent cardiovascular disease that significantly raises the probability of a CV-related mortality [60, 61]. The target variable is
Page 8 of 29
Sex
M
F
M
F
M
Age
41
48
38
49
53
NAP
ASY
ATA​
NAP
ATA​
Type chest pain
152
136
132
162
142
BP resting
Table 1 A sample of the Heart Failure Dataset
185
224
273
182
287
Cholesterol
0
0
0
0
0
BS fasting
Nor1
Nor1
ST
Nor1
Nor1
ECG resting
123
109
98
157
173
HR max
N
Y
N
N
N
Angina
exercise
0.0
1.5
0.0
1.0
0.0
Old peak
Upper
Flat1
Upper
Flat1
Upper
ST slope
0
1
0
1
0
Disease
of heart
Baghdadi et al. Journal of Big Data
(2023) 10:144
Page 9 of 29
Baghdadi et al. Journal of Big Data
(2023) 10:144
Page 10 of 29
Table 2 Symptoms, signs and laboratory investigations of the dataset of the heart disease
Variable
Interpretation
Age
Patient’s Age/year
Gender
Patient’s Gender, Male/Female
Type of chest pain
Type of chest pain:
i. TA: Typical Angina
ii. ATA: Atypical Angina
iii. NAP: Non-Anginal Pain
iv. ASY: Asymptomatic
Resting blood pressure
Patient’s Blood Pressure/mmHg.
Total Cholesterol
Patient’s Cholesterol (mg/dl).
Blood Glucose level (Fasting)
Patient’s fasting blood glucose level.
i. glucose >120 mg/dL =1
ii. glucose below 120 mg/dL =0
ECG at rest
Electrocardiography (at rest):
i. Normal
ii. ST: ST segment and/or T wave abnormality
iii. LVH: Probable or Definite Left Ventricular Hypertrophy
Heart Rate at Maximum
Maximum Heart Rate, heart beats per minute.
Angina on Exercising
Exercise-associated Angina, present /absent.
Old peak
Measure of ST Depression.
ST_Slope
Slope of Peak Exercise.
i. Up: up sloping
ii. Flat
iii. Down: down sloping
Table 3 The different datasets used to create the dataset of the heart disease
Datasets
#Observations
Cleveland
303
Hungarian
294
Stalog (Heart) Data Set
270
Long Beach VA
200
Switzerland
123
Total
1190
Duplicated
272
Final dataset
918
a binary attribute that indicates a diagnosis of Heart Failure if HeartDisease is = 1 as
illustrated in Table 1. Moreover, Table 2 presents the list of variables and the description of the features in the heart disease dataset.
The dataset was created by combining a diverse range of datasets that were previously available independently, and were not combined before [60, 61]. In this dataset,
five heart datasets are combined over 11 common features which makes it the largest
heart disease dataset accessible for research purposes. The specific datasets utilized
in the curation of this composite dataset are shown in Table 3.
The Heart Disease dataset has 918 observations and 12 columns [60, 61]. Table 4
summaries the main statistics for the numeric features. It is clear that, the mean value
of age is 53 and the maximum is 77 as shown in Table 4. Similarly, Table 5 presents
Baghdadi et al. Journal of Big Data
(2023) 10:144
Page 11 of 29
Table 4 Summary statistics of numeric variables
Age
RestingBP
Cholesterol
FastingBS
MaxHR
Oldpeak
HeartDisease
Count
918
918
918
918
918
918
918
Max
77
200
603
1
202
6.20
1
Min
28
0
0
0
60
-2.6
0
Mean
53.51
132.39
198.79
0.23
136.81
0.89
0.55
Std
9.43
18.51
109.38
0.42
25.46
1.06
0.49
25%
47
120
173.25
0
120
0
0
50%
54
130
223
0
138
0.60
1
75%
60
140
267
0
156
1.50
1
Table 5 Summary statistics of categorical variables
Sex
TypeChestPain
ECGResting
AnginaExercise
ST_Slope
Count
920
920
920
920
920
Unique
2
3
4
2
4
Top
M
ASY
Normals
N
Flat1
Freq
735
486
562
557
470
Table 6 The proportion of Heart Disease
Variable
Value
Total patients
Proportion of
heart disease
Sex
M
725
90.2%
F
193
9.8%
ASY
496
77.2%
NAP
203
14.2%
ATA​
173
4.7%
TA
46
3.9%
Normal
552
56.1%
ST
178
23.0%
LVH
188
20.9%
Y
371
62.2%
N
547
37.8%
Flat
460
75.0%
Up
395
15.4%
Down
63
9.6%
ChestPainType
RestingECG
ExerciseAngina
ST_Slope
bold numbers mean the highest frequency and percentage
the statistics of categorical attributes. From this table, the unique values in ChestPainType attribute are 4 and the top is “ASY”.
Table 6 summaries the main details for the numeric features. It is clear that, the
variable Sex has two main values male (M) and female (F) such that the proportion of
Heart Disease for M is 90.2% and for F is 9.8%. Similarly, Table 6 presents the statistics of ChestPainType attribute, there are 4 values (ASY, NAP, ATA, and TA) and the
most frequent is ASY of 77.2%.
Baghdadi et al. Journal of Big Data
(2023) 10:144
Exploratory data analysis
Remarkably, the classifications in the heart disease attribute value are reasonably wellbalanced. 508 of the 918 patients who participated in the study have been diagnosed
with heart failure, while 410 have not. Patients with heart disease have a median age of
57, whereas those without heart disease have a typical age of 51. As illustrated in Fig. 2,
around 63% of males have heart disease, whereas approximately 25% of females have
been diagnosed with heart disease. A female has a chance of 25.91% having a Heart Disease. A male has a probability of 63.17% having a Heart Disease.
Figure 3 demonstrates the heart disease ranges for Age, Systolic Blood Pressure, Cholesterol, Heart Rate, and ST Segment Depression. The boxplot of heart disease patients
fall between the ages of 51 and 62, as depicted by the Age boxplot. There are also a few
younger outliers below the lower margin in this category. Non-cardiovascular diseasefree individuals have an age range that is slightly more variable but more evenly distributed, and there are no outliers. The vast majority of patients falling into this category are
quite young, with ages ranging from 43 to 57 [62].
Fig. 2 Prevalence of heart disease among men and women
Page 12 of 29
Baghdadi et al. Journal of Big Data
(2023) 10:144
Fig. 3 The distributions of heart disease for age, systolic blood pressure, cholesterol, heart rate and ST
segment depression
Page 13 of 29
Baghdadi et al. Journal of Big Data
(2023) 10:144
Furthermore, the boxplots between the groups for the Pulse Pressure Pressure variable are extremely similar. Both have upper and lower outliers, with the vast majority
of patients’ blood pressure falling between 120 and 145 mmHg. As demonstrated in
Fig. 3, the median blood pressure in both groups is roughly 130 mmHg. Also, for the
Cholesterol variable, the distribution of cholesterol appears to be skewed to the right,
particularly among individuals with heart disease, where a substantial number of
observations were reported with cholesterol values of 0. As illustrated in Fig. 3, those
without heart illness have a median heart rate of 150 beats per minute, but those with
heart disease have a median heart rate of 126 beats per minute.
In the case of the ST Segment Depression (OldPeak) variable, there is a variance
between the distribution of ST segment depression groups. ST depression is more
variable in patients with heart disease, with numerous larger outliers. The majority
of these patients exhibit ST depressions between 0 and 2 mm, with a mean of 1.2
mm. In patients without heart disease, the range is narrower, between 0 and 0.6 mm,
with a median ST depression of 0 mm, however the distribution of this group is more
skewed overall, as illustrated in Fig. 3.
Figure 4 displays the correlation matrix associated with the heart disease dataset.
heartdisease has the strongest positive link with OldPeak (correlation = 0.4) and the
strongest negative association with MaxHR (correlation = −0.4), according to the
correlation matrix. Age and MaxHR also have a reasonably high link, with a correlation of −0.38. As seen in Fig. 4, heart rate tends to decrease as age increases. Results
observe a weak correlation between the numerical features and the target variable
based on the matrix. Oldpeak (a depression-related number) correlates positively
with heart disease. Heart disease is negatively correlated with maximal heart rate.
Cholesterol has an interestingly negative association with heart disease.
Fig. 4 The correlation matrix for the Heart Disease dataset
Page 14 of 29
Baghdadi et al. Journal of Big Data
(2023) 10:144
Figure 5 illustrates the correlation between heart disease and category variables.
Nearly 80% of diabetic persons suffer heart problems. Patients with exercise-induced
angina have an even greater incidence of cardiovascular disease, at over 85%. Over
65% of patients diagnosed with cardiac disease had ST-T wave abnormalities in their
resting ECGs, the greatest percentage across the categories. Patients with a Flat or
Declining ST Slope during exercise have the highest frequency of cardiovascular disease, at 82.8% and 77.8%, respectively.
Figure 6 explains data details regarding asymptomatic chest pain in heart disease at
almost 77%, the absence of chest pain (asymptomatic) is the most prevalent symptom
in patients with heart disease. In addition, heart disease is roughly nine times more
prevalent in males than in females among patients with a cardiovascular diagnosis. A
Fig. 5 Prevalence of heart disease by resting ECG. (a) Prevalence of Heart Disease in Patients with Diabetes.
(b) Prevalence of Heart Disease in Patients with Exercise Angina. (c) Prevalence of Heart Disease by Resting
ECG. (d) Prevalence of Heart Disease by ST Slope
Page 15 of 29
Baghdadi et al. Journal of Big Data
(2023) 10:144
Fig. 6 Prevalence of chest pain in heart disease data
patient with asymptomatic chest pain (ASY) is approximately six times more likely to
suffer heart disease than a patient with atypical angina chest pain (ATA).
Overall insights obtained from the exploratory data analysis, Data for the target variable are near to balanced. The association between numerical features and the target
variable is weak. Oldpeak (a depression-related number) correlates positively with heart
disease. Heart illness is negatively correlated with maximum heart rate. Interestingly,
there is a negative link between cholesterol and heart disease. Males are approximately
2.44 times more likely to suffer from heart disease than females. There are distinct variances between the types of chest pain. Patients with asymptomatic chest pain (ASY) are
about six times more likely to suffer heart disease than those with Atypical Angina chest
pain (ATA). Resting ECG: electrocardiogram values at rest are comparable. Patients with
ST-T pulse abnormalities have a higher risk of developing heart disease than those who
do not. ExerciseAngina: people who have exercise-induced angina are nearly 2.4 times
more likely to have heart disease than people who don’t. The slope of the ST segment at
Page 16 of 29
Baghdadi et al. Journal of Big Data
(2023) 10:144
Page 17 of 29
maximum exertion varies. ST Slope Up has a considerably lower risk of cardiovascular
disease than the other two segments. Exercise-induced angina with a ’Yes’ score is nearly
2.4 times more likely to result in heart disease than exercise-induced angina with a ’No’
score.
Performance evaluation
When dealing with imbalanced datasets, classification accuracy alone may not be the
most suitable performance metric. Therefore, authors often use additional performance
metrics to address this issue [63]. The confusion matrix is frequently employed for
expressing a classifier’s classification results, with diagonal elements indicating correctly
classified samples as positive or negative and off-diagonal elements indicating misclassification. As a consequence, performance improvement metrics such as accuracy, precision, recall (sensitivity), F1-score, and ROC curve are employed. F1-score accuracy,
recall, and precision can be calculated using the Eqs. 3, 4, 5, and 6, respectively. These
formulas are based on the numbers of False Positive (FP), False Negative (FN), True Positive (TP), and True Negative (TN) samples in the test dataset [64].
Accuracy =
TP + TN
TP + FP + FN + TN
(3)
precision ⋆ recall
precision + recall
(4)
F 1 − score = 2
Recall = positivePredictivevalue =
Precision =
TP
TP + FN
TP
TP + FP
(5)
(6)
Machine learning models
Studies are carried out using the collected dataset, which has approximately 918 rows.
The final version of the updated data was split into training and testing sets in order to
fit the model, with 70% of the data used for the learning set and 30% for the testing set.
Table 7 shows the shapes of three datasets: training, validation, and test. The training
set has 504 rows and 19 columns, while the validation and test sets both have 207 rows
and 19 columns. AdaBoost, Gradient Boost, Random Forest (RF), k-nearest neighbor
(KNN), Support Vector Machine (SVM), and Decision tree classifiers are used in this
study [64–66].
Table 7 Dataset shapes
Dataset
Shape
Training
(504, 19)
Validation
(207, 19)
Test
(207, 19)
Baghdadi et al. Journal of Big Data
(2023) 10:144
Page 18 of 29
In order to develop a robust classifier with high precision, it is vital to use an appropriate evaluation approach. One such method is the K-fold Cross-Validation, which
generates diverse data samples to determine the average correctness of a model. The
strategy of k-fold is a commonly used cross-validation technique, where a specified
value of k is chosen, such as five, and the data is divided into k subsets of equal size.
In each iteration, one of the k subsets is used as the test set, and the remaining k − 1
subsets are used for model learning. This process is repeated until all subsets have
been used as the test set once.
The k-fold cross-validation method employs computed values average as a performance metric. This approach provides a reliable estimate of the model’s generalization ability, which is particularly useful when the data is limited and cannot be split
into separate learning and testing sets.
Finally, the best hyperparameter values for each algorithm are determined through
experimentation and optimization of the model, often through methods such as grid
search and Bayesian optimization. The best hyperparameter values can serve as a
starting point for developing new models or improving existing ones, as they provide insight into the values that have yielded the best performance for each algorithm.
The results of hyper-parameter optimization of Machine learning models are shown
in Table 8.
Table 8 presents the results of hyper-parameter optimization for four machine
learning models: Extra Trees, Random Forest, AdaBoost, and Gradient Boosting. For
each model, a range of hyper-parameters was explored using cross-validation, and the
Table 8 The results of hyper-parameter optimization of Machine learning models
Model
Parameters
Best parameters
Accuracy AUC​
Extra Trees
n_estimators: [100, 105, …, 500],
criterion :(’gini’, ’entropy’),
max_depth: [5, 10, 15, 20],
min_samples_split: [2, 4, 6],
min_samples_leaf: [4, 5, 6]
criterion=’entropy’,
max_depth=15,
min_samples_leaf=4,
n_estimators=300
84.54%
0.920
Random Forest
n_estimators: [100, 105, …, 500],
criterion :(’gini’, ’entropy’),
max_depth: [3, 7, 14, 21],
min_samples_split: [2, 5, 10],
min_samples_leaf: [3, 5, 7],
max_features: [None, ’sqrt’],
max_leaf_nodes: [None, 5, 10, 15, 20],
min_impurity_decrease’: [0.001, 0.01,
0.05, 0.1],
bootstrap: [True, False]
max_depth=14,
max_features=’sqrt’,
max_leaf_nodes=15,
min_impurity_decrease=0.001,
min_samples_leaf=3,
min_samples_split=10,
n_estimators=200
85.52%
0.924
AdaBoost
n_estimators: [100, 105, …, 500],
learning_rate: [0.25, 0.5, 0.75, 0.9]
learning_rate=0.25,
n_estimators=100
84.06%
0.897
boosting_type=’dart’,
colsample_bytree=1,
learning_rate=0.5,
max_depth=3,
min_child_samples=7,
min_split_gain=1e-05,
num_leaves=30,
subsample=0.5
88.9%
0.925
Gradient Boosting boosting_type: [’gbdt’, ’dart’],
num_leaves: [20, 27, 34, …,50],
max_depth : [-1, 3, 7, 14, 21],
learning_rate: [0.0001, 0.001, 0.01, 0.1,
0.5, 1],
n_estimators’: [100, 105, …, 500],
min_split_gain: [0.00001, 0.0001, 0.001,
0.01, 0.1],
min_child_samples: [3, 5, 7],
subsample: [0.5, 0.8, 0.95],
colsample_bytree: [0.6, 0.75, 1]
Baghdadi et al. Journal of Big Data
(2023) 10:144
Page 19 of 29
best parameters were selected based on the highest accuracy and AUC scores. The
accuracy and AUC scores were calculated using a hold-out test set.
For the Extra Trees model, the best parameters were found to be criterion=’entropy’,
max_depth=15, min_samples_leaf=4, and n_estimators=300, resulting in an accuracy
of 84.54% and an AUC score of 0.920. Similarly, for the Random Forest model, the best
parameters were max_depth=14, max_features=’sqrt’, max_leaf_nodes=15, min_impurity_decrease=0.001, min_samples_leaf=3, min_samples_split=10, and n_estimators=200, resulting in an accuracy of 85.52% and an AUC score of 0.924.
The AdaBoost model achieved an accuracy of 84.06% and an AUC score of 0.897 with
the best parameters of learning_rate=0.25 and n_estimators=100. Finally, the Gradient Boosting model achieved the highest accuracy of 88.9% and the highest AUC score
of 0.925 with the best parameters of boosting_type=’dart’, colsample_bytree=1, learning_rate=0.5, max_depth=3, min_child_samples=7, min_split_gain=1e-05, num_
leaves=30, and subsample=0.5. Overall, the results indicate that hyper-parameter
optimization can significantly improve the performance of machine learning models,
and the Gradient Boosting model performed the best on this particular dataset.
The results of the Chi-Squared test are presented in Table 9. Based on the p-values,
which are less than 0.05, all discrete variables are included in the models as predictors.
The summary plot of Shapley values of feature importance in a machine learning
model provides insights into the relative importance of different features in making
predictions. The Shapley value is a concept from cooperative game theory that provides a way to allocate the main contribution of each feature to the final prediction. In
a machine learning environment, the Shapley value of a feature represents the average
contribution of that feature to the model output across all possible subsets of features.
The calculation of Shapley values requires the evaluation of the model output for all possible subsets of features, which can be computationally expensive for high-dimensional
datasets. However, there are several efficient algorithms for approximating the Shapley
values, such as the KernelSHAP algorithm, which is based on sampling.
As shown in Fig. 7, the summary plot of Shapley values displays the top 20 predictors
of heart disease in order of relevance. Each point on the graph represents a training set
observation. When the points are to the right of the 0 lines, this suggests a greater risk
of being diagnosed with heart disease, whereas points to the left of the 0 line indicate a
lower likelihood. The values of each feature are represented by the color of the points,
with light orange indicating high feature values and dark blue indicating low feature
values. The shape of the points in each row is determined by the number of observations that overlap for that feature. Along with three independent features, Cholesterol,
Table 9 The results of Chi-Squared test
Chi statistic
p-value
ExerciseAngina
222.26
0.00000
ChestPainType
268.07
0.00000
ST_Slope
355.92
0.00000
Sex
84.15
0.00000
RestingECG
10.93
0.00423
Baghdadi et al. Journal of Big Data
(2023) 10:144
Fig. 7 The summary plot of Shapley values of features importance
Age, and typical chest pain, nearly all of the variables in the plot are interaction terms
that were included to the model. The variable in the first row represents the interaction
between SBP at Rest and ST Slope Up. People with an upward ST slope and high blood
pressure have a lower risk of heart disease, according to the Shapley values. The Shapley
values for the second variable, “Sex M ST slope flat,” show that male patients with a flat
ST Slope are more likely to develop cardiovascular disease. The fourth variable in the
scatter plot, Cholesterol Sex M, indicates that men with high cholesterol are more likely
to be diagnosed with cardiovascular disease.
In addition, the order of relevance in the summary plot is determined by the feature’s
average absolute Shapley value, which quantifies the average amount by which the characteristic affects the projected chance of heart disease. There are 18 features that contribute at least 0.1 on average to the model’s prediction. Table 10 provides a listing of
the final predictors chosen and the feature importance of each. “RestingBP ST Slope”
appears in five of the top 19 of most significant predictors.
Model performance on the validation set
The ROC Curves (Fig. 8) illustrate the performance of the models at various thresholds. The y-axis indicates the True Positive Rate or Sensitivity of the models, which is
a measure of how well the model identifies patients with heart disease (true positives),
while the x-axis indicates the number of patients that the model incorrectly classifies as
false positives. A model with a curve at the upper left corner of the graph, with a higher
true positive rate and a lower false positive rate, shows a greater capacity to differentiate between the classes. On the test set, all of the models depicted in the above scatter
plot produce strong results. Overall, Gradient Boosting has the highest Area Under the
Page 20 of 29
Baghdadi et al. Journal of Big Data
(2023) 10:144
Page 21 of 29
Table 10 The list of the final predictors selected and their feature importance
Feature
Importance
Cholesterol ST_Slope_Flat
0.941
RestingBP ST_Slope_Up
0.569
Sex_M ST_Slope_Flat
0.512
ChestPainType_ATA RestingECG_Normal
0.341
Cholesterol Sex_M
0.319
Oldpeak Sex_M
0.260
Cholesterol
0.252
RestingBP ST_Slope_Flat
0.160
Cholesterol RestingECG_Normal
0.159
Age Sex_M
0.153
RestingBP MaxHR
0.151
Age MaxHR
0.150
ChestPainType_ATA ST_Slope_Up
0.146
ChestPainType_NAP ST_Slope_Up
0.144
Age
0.127
RestingBP Oldpeak
0.124
RestingBP Cholesterol
0.118
ChestPainType_ATA​
0.115
Oldpeak ExerciseAngina_Y
0.100
Fig. 8 ROC curve comparison on the test set
Curve at 0.927, but at specific thresholds, the Random Forest model offers somewhat
superior results, since the curve surpasses that of Gradient Boosting.
By using Shapley features greater than 0.1, the Extra Trees classifier achieves an AUC
of 0.89. After tweaking the model’s hyperparameters, the classifier achieves an average
accuracy of 88%, an F1-score of 89.5%, and a standard deviation of 6.7% on the validation
set. With an AUC of 0.917, the Random Forest model outperforms the Extra Trees classifier across all three criteria. On the validation set, the model achieves an average precision of 88.7% and an F1-score of almost 90%. Evidently, there is a minor performance
reduction in the AdaBoost model. The AUC declined to 0.91, and the overall accuracy
Baghdadi et al. Journal of Big Data
(2023) 10:144
Page 22 of 29
Fig. 9 Model performance on the validation set
Table 11 Classification report for Catboost_tuned model
Precision
Recall
F1-score
Support
0
0.88
0.90
0.89
112
1
0.93
0.91
0.92
164
0.91
276
Accuracy
Macro avg
0.90
0.91
0.91
276
Weighted avg
0.91
0.91
0.91
276
and F1-score fell to 86.5% and 88% respectively. Despite the fact that the model yields
a smaller standard deviation than the others. At 0.927, the Gradient Boosting model
has the greatest Area Under the Curve among the classifiers. In addition, the model
improves the validation set’s accuracy to about 87% and the F1-score to 89%.
Comparing the cross-validation results shown in the boxplots of Fig. 9, it is clear that
the Gradient Boosting model has the highest median F1-score of 90.3% and the highest
median accuracy of 88.5%. It also has the smallest standard deviation of the distribution,
at around 3. The Random Forest model comes in a close second with a median F1-score
of 89.7% and a median accuracy of 88.2%, albeit with slightly greater score variability.
The recall, precision, and accuracy values for the Catboost model are shown in
Table 11. Catboost is a model that can determine whether a patient has heart disease.
Furthermore, the Catboost algorithm is biased because it is extremely sensitive to major
class. When calculating the comprehensive performance measurement, the F1-score is
also used to compare the algorithm’s precision. Classification of heart disease was significantly improved by the Catboost method. The model achieved 93% accuracy in the
“heart illness” category and 88% accuracy in the “No disease” category. The accuracy of
the Catboost classification model was 91%.
Table 12 illustrates the classification results of the various classifiers on the dataset.
The table reports the performance of various classifiers on a given dataset, measured
in terms of Accuracy, Precision, Recall, and F1 score. Comparing the results of the proposed technique against those of other classifiers such as SVM [54], XGBoost, AdaBoost, RandomForest [58], LinearDiscriminant [67], LightGBM, GradientBoosting,
Baghdadi et al. Journal of Big Data
(2023) 10:144
Page 23 of 29
Table 12 Comparative results on the Dataset using ML
Classifier
Accuracy
Precision
Recall
F1
XGBoost
0.8297
0.8980
0.8049
0.8489
AdaBoost
0.8659
0.9262
0.8415
0.8818
LinearDiscriminant
0.8696
0.9156
0.8598
0.8868
LightGBM
0.8732
0.9057
0.8780
0.8916
GradientBoosting
0.8768
0.9276
0.8598
0.8924
Catboost
0.8804
0.9226
0.8720
0.8966
ExtraTree
0.8804
0.9281
0.8659
0.8959
KNeighbors
0.8841
0.9074
0.8963
0.9018
SVM
0.8841
0.8976
0.9085
0.9030
LogisticRegression
0.8841
0.9231
0.8780
0.9000
RandomForest
0.8877
0.9236
0.8841
0.9034
Catboost_tuned
0.9094
0.9317
0.9146
0.9231
Catboost, ExtraTree, KNeighbors [56], and LogisticRegression [68] demonstrates the
method’s utility. The results of classifiers according to various metrics are displayed. The
highest performing classifier based on all measures is Catboost_tuned, which achieved
an accuracy of 0.9094, a precision of 0.9317, a recall of 0.9146, and F1 score of 0.9231.
Other top-performing classifiers include RandomForest, LogisticRegression, SVM, and
KNeighbors, with similar accuracy and precision scores, but slightly lower recall and F1
scores. In contrast, lower-performing classifiers such as XGBoost and AdaBoost exhibit
moderate accuracy and precision scores, but relatively lower recall and F1 scores. Overall, the results suggest that the choice of classifier can have a significant affect on the
performance of a predictive model.
The present study employs a confusion matrix (Fig. 10) to report the performance of
models in accurately predicting cardiac disease for a given set of patients, with due consideration to both correctly classified and misclassified instances. Specifically, the Gradient Boosting model is found to exhibit the highest proportion of True Positives (TP) and
True Negatives (TN) when evaluated on a test set. The computation of FN, FB, TN, and
TP, values for the cardiac disease class is carried out using the Gradient Boost model,
whereby the predicted values are expected to match the actual values. For instance, TP
corresponds to the value at cell 1 of the confusion matrix, while FN is computed by adding the relevant row values, excluding TP (i.e., FN = 12). Similarly, FP is calculated as the
total of column values, excluding TP, leading to a value of 11. Lastly, TN is determined
by the combination of all columns and rows except the class under consideration (i.e.,
cardiac disease), which yields a value of 81.
Discussion
Despite the vast amount of data produced by healthcare systems, medicine faces
unique obstacles in comparison to other data-driven businesses where machine
learning has flourished. The Health Insurance Portability and Accountability Act
(HIPAA) mandates strict, center-specific Institutional Review Boards (IRBs) to govern the usage of patient data. This significantly preserves patient privacy, but it has
unwittingly created data silos across the nation [47]. Consequently, the majority of
Baghdadi et al. Journal of Big Data
(2023) 10:144
Fig. 10 The confusion matrix results for Extra Trees, RandomForest, AdaBoost, and GradientBoost classifiers
published healthcare machine learning models rely on locally acquired datasets and
lack external validation. 58% of cardiovascular prediction models, according to the
Tufts predictive analytics and comparative effectiveness cardiovascular prediction
model have never been externally verified [69]. Heart-related disorders are one of the
leading causes of deaths and morbidity on a global scale [5–7].
It is common for those with heart disease to be unaware of their condition, and it is
difficult to predict their health condition and diagnose their disease in its early stages
in order to save their lives, minimize their complications and suffering, and reduce
the global burden of disease and mortality [9]. Machine learning models are capable
of accomplishing this difficult task and can be of tremendous assistance in the early
diagnosis and prediction of heart disorders [12–14]. Medical machine learning offers
a vast array of opportunities, including the discovery of hidden patterns that can be
utilized to generate diagnostic accuracy on any medical dataset.
Previous research has demonstrated that machine learning can aid in the prediction of cardiovascular illness [15, 16]. For the diagnosis of cardiac disorders, this
prior research employed various machine learning approaches, such as neural networks, Naive Bayes, Decision Tree, and SVM, and obtained varying degrees of accuracy [18, 19]. The accuracy of the proposed feature selection methodology algorithm
(CFS+Filter Subset Eval), a hybrid method that combines CFS and Bayes theorem,
Page 24 of 29
Baghdadi et al. Journal of Big Data
(2023) 10:144
was 85.5%, according to [70]. Shouman et al. [71] presented an integrated k-means
clustering with the Naive Bayes approach for enhancing the accuracy of Naive Bayes
in diagnosing patients with heart disease, with an accuracy of 84.5%. Using both Naive
Bayesian Classification and Jelinek-Mercer smoothing techniques. Rupali et al. [72]
developed decision support for the Heart Disease Prediction System (HDPS), with
Laplacian smoothing for approximating important patterns in the data while avoiding
noise; their accuracy was 86%.
Elma et al. [73] created a classifier for predicting heart illness that merged the distance-based approach K-nearest neighbor with a statistically-based NaiveBayes classifier
(cNK) and achieved an 85.92% accuracy rate. Dulhare et al. [74] improved cardiac disease prediction methods using Naive Bayes and particle swarm optimization, attaining
an accuracy of 87.91%.
To accurately predict CVDs in the present study, Shapley values were used to create
a Gradient Boosting model with an Area Under the Curve of 0.927% for predicting the
risk of a heart disease diagnosis. Using Shapley values, Authors discovered critical cardiac disease signs and their predictive power for a positive diagnosis. Interaction effects
between a patient’s medical information were some of the most relevant predictors in
the model, particularly in features such as Age, Cholesterol, Blood Pressure, ST Slope,
and Chest Pain kind. The proposed Catboost model offered the strongest results overall
and can be utilized for the early identification and diagnosis of heart disease, with an
overall F1-Score of 92.3% and an accuracy of 90.94%, when picking the optimal model.
Overall, the proposed model is superior to earlier approaches for diagnosing cardiac
disease.
However, this study is important but many Limitations exist. First, this research
depends solely on secondary data using the available data at the selected cardiology and
internal medicine departments. Hence, there were some missing data and some variables could not be included in the analysis. The cross-sectional design of the study is the
second limitation that could not examine the longitudinal effects of the risk factors on
the development of the CVDs.
The possible future orientation of this study is to improve prediction techniques by
combining various machine learning techniques and increase the accuracy and precision of CVD prediction and early diagnosis, which has been shown to be superior to
the majority of traditional state-of-the-art methods. Based on machine learning techniques, the suggested model for the prediction of heart disorders is a robust, effective,
and efficient method for the prediction and early detection of heart ailments. It obtained
and maximized classification performance with greater accuracy and precision percentages than other current models. One of the most significant outcomes of our proposed
machine learning algorithms is that they achieved good accuracy while displaying fewer
feature sets. This is crucial for clinical medical practice, which requires the most precise
and straightforward methods for confirming a diagnosis in order to make a final therapeutic decision. Nonetheless, there are obstacles to the generality of the CVD prediction
models reported in this study. Before being implemented into the clinical guidelines, the
suggested machine learning algorithm must investigate different population datasets
to minimize variation in CVD prevalence patterns and evaluate the possible impact on
physicians’ decision making or patient outcomes.
Page 25 of 29
Baghdadi et al. Journal of Big Data
(2023) 10:144
Conclusion
Prediction of cardiovascular diseases is crucial for assisting clinicians with early disease
diagnosis. Instead of replacing clinicians, machine learning will be a supplement to the
clinical portfolio, enhancing human-led decision-making and clinical practices. Furthermore, by using machine learning techniques, the cost of conducting a long list of
expensive clinical and laboratory investigations will be eliminated, reducing the financial
burden on patients and the healthcare system. This paper proposed new robust, effective, and efficient machine learning algorithms for predicting CVD based on symptoms,
signs, and other patients’ information from hospital records in order to improve the
early prediction of CVD development in its early stages and to ensure early intervention with a warranted recovery. The new technique was more accurate and precise than
existing standard art-of-state algorithms for the classification and prediction of heart
disease. Future research evaluating the performance of the proposed machine learning
algorithms on datasets containing a greater number of modifiable and non-modifiable
risk factors will be crucial for the development of a more accurate and robust system for
the prediction and early diagnosis of heart diseases.
Acknowledgement
Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2023R 293), Princess
Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.
Author contributions
All authors participated in the research idea, conceptualization, data collection, analysis and preparation of the manuscript for publication.
Funding
Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2023R293), Princess
Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.
Availability of data and materials
The datasets generated and/or analyzed during the current study are not publicly available due to data privacy but are
available from the corresponding author on reasonable request.
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Received: 14 August 2022 Accepted: 30 August 2023
References
1. Javeed A, Rizvi SS, Zhou S, Riaz R, Khan SU, Kwon SJ. Heart risk failure prediction using a novel feature selection
method for feature refinement and neural network for classification. Mob Inf Syst. 2020;2020:1–11. https://​doi.​org/​
10.​1155/​2020/​88431​15.
2. Eckel R, Jakicic J, Ard JD. Aha/acc guideline on lifestyle management to reduce cardiovascular risk: a report of the
american college of cardiology/american heart association task force on practice guidelines. American College of
Cardiology/American Heart Association Task Force on Practice Guidelines. 2014. https://​doi.​org/​10.​1161/​01.​cir.​00004​
37740.​48606.​d1.​pmid:​24222​015.
3. Anderson KM, Wilson PW, Odell PM, Kannel WB. An updated coronary risk profile. A statement for health professionals. Circulation. 1991;83(1):356–62. https://​doi.​org/​10.​1161/​01.​cir.​83.1.​356.
4. Azmi J, Arif M, Nafis MT, Alam MA, Tanweer S, Wang G. A systematic review on machine learning approaches for
cardiovascular disease prediction using medical big data. Med Eng Phys. 2022;103825.
Page 26 of 29
Baghdadi et al. Journal of Big Data
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
(2023) 10:144
Day TE, Goldlust E. Cardiovascular disease risk profiles. Am Heart J. 2010;160(1):3. https://​doi.​org/​10.​1016/j.​ahj.​2010.​
04.​019.
Alwan A. Global status report on noncommunicable diseases. World Health Organization, 2011;293–298.
…Tsao CW, Aday AW, Almarzooq ZI, Anderson CAM, Arora P, Avery CL, Baker-Smith CM, Beaton AZ, Boehme AK,
Buxton AE, Commodore-Mensah Y, Elkind MSV, Evenson KR, Eze-Nliam C, Fugar S, Generoso G, Heard DG, Hiremath
S, Ho JE, Kalani R, Kazi DS, Ko D, Levine DA, Liu J, Ma J, Magnani JW, Michos ED, Mussolino ME, Navaneethan SD,
Parikh NI, Poudel R, Rezk-Hanna M, Roth GA, Shah NS, St-Onge M-P, Thacker EL, Virani SS, Voeks JH, Wang N-Y, Wong
ND, Wong SS, Yaffe K, Martin SS. Heart disease and stroke statistics-2023 update: a report from the American heart
association. Circulation. 2023. https://​doi.​org/​10.​1161/​CIR.​00000​00000​001123.
Wilson P, DAgostino RB, Levy D, Belanger A, Silbershatz H, Kannel W. Prediction of coronary heart disease using risk
factor categories. Circulation. 1998;97(12):1837–47. https://​doi.​org/​10.​1161/​01.​CIR.​97.​18.​1837.
Mythili T, Mukherji D, Padalia N, Naidu A. A heart disease prediction model using svm-decision trees-logistic regression (sdl). Int J Comput Appl. 2013;68(16):11–5. https://​doi.​org/​10.​1161/​01.​CIR.​97.​18.​1837.
Frieden TR, Jaffe MG. Saving 100 million lives by improving global treatment of hypertension and reducing cardiovascular disease risk factors. J Clin Hypertens. 2018;20(2):208.
Haissaguerre M, Derval N, Sacher F, Deisenhofer I, de Roy L, Pasquie J, Nogami A, Babuty D, Yli-Mayry S. Sudden
cardiac arrest associated with early repolarization. N Engl J Med. 2008;58(19):2016–23.
Kumar PM, Lokesh S, Varatharajan R, Babu GC, Parthasarathy P. Cloud and iot based disease prediction and diagnosis
system for healthcare using fuzzy neural classifier. Future Gener Comput Syst. 2018;68:527–34.
Mohan S, Thirumalai C, Srivastava G. Effective heart disease prediction using hybrid machine learning technique.
IEEE Access. 2019;7:81542–54.
Kwon JM, Lee Y, Lee S, Park J. Effective heart disease prediction using hybrid machine learning technique. J Am
Heart Assoc. 2018;7(13):1–11.
Esfahani HA, Ghazanfari M, Ecardiovascular disease detection using a new ensemble classifier. in,. IEEE 4th international conference on knowledge-based engineering and innovation (KBEI). Tehran, Iran. 2017;2017:488–96.
Gandhi M, Singh SN. Cardiovascular disease detection using a new ensemble classifier. in 2015 International Conference on Futuristic Trends on Computational Analysis and Knowledge Management (ABLAZE), Greater Noida, India,
2015;520–525
Krittanawong C, Virk HUH, Bangalore S, Wang Z, Johnson KW, Pinotti R, Zhang H, Kaplin S, Narasimhan B, Kitai T, et al.
Machine learning prediction in cardiovascular diseases: a meta-analysis. Sci Rep. 2020;10(1):16057.
Shouman TT, Stocker R. Integrating clustering with different data mining techniques in the diagnosis of heart
disease. J Comput Sci Eng 2013;20(1).
Motur S, Rao ST, Vemuru S. Frequent itemset mining algorithms: a survey. J Theor Appl Inf Technol 2018;96(3).
Javeed A, Khan SU, Ali L, Ali S, Imrana Y, Rahman A. Machine learning-based automated diagnostic systems developed for heart failure prediction using different types of data modalities: A systematic review and future directions.
Comput Math Methods Med. 2022;2022:1–30. https://​doi.​org/​10.​1155/​2022/​92884​52.
Malki Z, Atlam E, Dagnew G, Alzighaibi AR, Ghada E, Gad I. Bidirectional residual lstm—based human activity recognition. J Comput Inf Sci. 2020;13(3):1–40.
Malki Z, Atlam E-S, Hassanien AE, Dagnew G, Elhosseini MA, Gad I. Association between weather data and COVID19 pandemic predicting mortality rate: machine learning approaches. Chaos Solitons Fractals. 2020;138: 110137.
https://​doi.​org/​10.​1016/j.​chaos.​2020.​110137.
Atlam E-S, El-Raouf MMA, Ewis A, Ghoneim O, Gad I. A new approach to identify psychological impact of covid-19
on university students academic performance. Alex Eng J. 2021;61(7):5223–33.
Malki Z, Atlam E-S, Ewis A, Dagnew G, Reda A, Elmarhomy G, Elhosseini MA, Hassanien AE, Gad I. ARIMA models for
predicting the end of COVID-19 pandemic and the risk of a second rebound. J Neural Comput Appl. 2020;33(7):
2929–2948.https://​doi.​org/​10.​21203/​rs.3.​rs-​34702/​v1
Almars MM, Almaliki M, Noor TH, Alwateer MM, Atlam E. Hann: hybrid attention neural network for detecting covid19 related rumors. IEEE Access. 2022;10:12334–44.
Malki Z, Atlam E-S, Ewis A, Dagnew G, Ghoneim OA, Mohamed AA, Abdel-Daim MM, Gad I. The covid-19 pandemic:
prediction study based on machine learning model. J Environ Sci Pollut Res. 2021;28(30):40496–506.
Manjunatha MFDH, Ibrahim Gad E-SA, Ahmed A, Elmarhomy G, Elmarhoumy M, Ghoneim OA. Parallel genetic
algorithms for optimizing the sarima model for better forecasting of the ncdc weather data. Alexandria Eng J.
2020;60:1299–316.
Khan MA, Algarn F. A healthcare monitoring system for the diagnosis of heart disease in the iomt cloud environment using msso-anfis. IEEE Access. 2020;8:122259–69.
Javeed A, Zhou S, Yongjian L, Qasim I, Noor A, Nour R. An intelligent learning system based on random search algorithm and optimized random forest model for improved heart disease detection. IEEE Access. 2019;7:180235–43.
https://​doi.​org/​10.​1109/​access.​2019.​29521​07.
Meter W. World Meter. Accessed: October 2020 (2020). https://​www.​world​omete​rs.​info/​coron​avirus/.
Coronavirus: Who (2020) coronavirus (2020). www.​who.​int/​health-​topics/.
Ali L, Rahman A, Khan A, Zhou M, Javeed A, Khan JA. An automated diagnostic system for heart disease prediction
based on χ 2 statistical model and optimally configured deep neural network. IEEE Access. 2019;7:34938–45. https://​
doi.​org/​10.​1109/​access.​2019.​29048​00.
Health M. Ministry of Health, COVID-19. Accessed: October 2020. 2020. https://​covid​19.​moh.​gov.​sa/.
Ambale-Venkatesh B, Yang X, Wu CO, Liu K, Hundley WG, McClelland R, Gomes AS, Folsom AR, Shea S, Guallar E,
et al. Cardiovascular event prediction by machine learning: the multi-ethnic study of atherosclerosis. Circ Res.
2017;121(9):1092–101.
Feng Y, Leung AA, Lu X, Liang Z, Quan H, Walker RL. Personalized prediction of incident hospitalization for cardiovascular disease in patients with hypertension using machine learning. BMC Med Res Methodol. 2022;22(1):1–11.
Adam P, Parveen A. Prediction system for heart disease using naïve bayes. J Adv Comput Math Sci. 2012;3(3):290–4.
Tran H. A survey of machine learning and data mining techniques used in multimedia system. no 113 13–21 2019.
Page 27 of 29
Baghdadi et al. Journal of Big Data
(2023) 10:144
38. Gnaneswar B, Jebarani ME. A review on prediction and diagnosis of heart failure. In 2017 International Conference
on Innovations in Information, Embedded and Communication Systems (ICIIECS), 17-18 March, Coimbatore, India,
2017;1–3. https://​doi.​org/​10.​1109/​ICIIE​CS.​2017.​82760​33
39. Kusprasapta M, Ichwan M, Utami DB. Heart rate prediction based on cycling cadence using feedforward neural
network. In 2016 International Conference on Computer, Control, Informatics and its Applications (IC3INA), IEEE,
2016;72–76. https://​doi.​org/​10.​1109/​IC3INA.​2016.​78630​26
40. Singh KY, Sinha N, Singh KS. Heart disease prediction system using random forest. In International Conference on
Advances in Computing and Data Sciences, Advances in Computing and Data Sciences. ICACDS 2016. Communications in Computer and Information Science, Singapore. 2017;721:613–623. https://​doi.​org/​10.​1007/​978-​981-​10-​
5427-3_​63
41. Priya RP, SKinariwala A. Automated diagnosis of heart disease using random forest algorithm. Int J Adv Res Ideas
Innovat Technol 2017;3(2).
42. Tripoliti E, Fotiadis ID, Manis G. Automated diagnosis of diseases based on classification: dynamic determination of
the number of trees in random forests algorithm. EEE Trans Inf Technol Biomed 2012;16(4).
43. Gonsalves AH, Thabtah F, Mohammad RMA, Singh G. Prediction of coronary heart disease using machine learning:
an experimental analysis. In: Proceedings of the 2019 3rd International Conference on Deep Learning Technologies,
2019;51–56.
44. Oikonomou EK, Williams MC, Kotanidis CP, Desai MY, Marwan M, Antonopoulos AS, Thomas KE, Thomas S, Akoumianakis I, Fan LM, et al. A novel machine learning-derived radiotranscriptomic signature of perivascular fat improves
cardiac risk prediction using coronary ct angiography. Eur Heart J. 2019;40(43):3529–43.
45. El-Hasnony IM, Elzeki OM. Multi-label active learning-based machine learning model for heart disease prediction.
Sensors. 2022;22(3):1184–8. https://​doi.​org/​10.​3390/​s2203​1184.
46. Guleria P, Srinivasu PN, Ahmed S. Ai framework for cardiovascular disease prediction using classification techniques.
Electronics. 2022;11(24):1184–8. https://​doi.​org/​10.​3390/​elect​ronic​s1124​4086.
47. Javaid A, Zghyer F, Kim C, Spaulding EM, Isakadze N, Ding J, Kargillis D, Gao Y, Rahman F, Brown DE, et al. Medicine
2032: the future of cardiovascular disease prevention with machine learning and digital health technology. Am J
Prevent Cardiol, 2022;100379
48. Alaa AM, Bolton T, Di Angelantonio E, Rudd JH. Van der Schaar M. Cardiovascular disease risk prediction using automated machine learning: a prospective study of 423,604 uk biobank participants. PloS One 2019;14(5):0213653.
49. Ward A, Sarraju A, Chung S, Li J, Harrington R, Heide…

Still stressed with your coursework?
Get quality coursework help from an expert!