Applying data mining techniques to predict vitamin D deficiency in diabetic patients

Abstract

Vitamin D is among the vitamins necessary for both adults’ and children’s health. It plays a significant role in calcium absorption, the immune system, cell proliferation and differentiation, bone protection, skeletal health, rickets, muscle health, heart health, disease pathogenesis and severity, glucose metabolism, glucose intolerance, varying insulin secretion, and diabetes. Because the 25-hydroxyvitamin D (25OHD) test, which is used to measure vitamin D is expensive and may not be covered in healthcare benefits in many countries, this study aims to predict vitamin D deficiency in diabetic patients. The prediction method is based on data mining techniques combined with feature selection by using historical electronic health records. The results were compared with a filter-based feature selection algorithm, namely relief-F. Non-valuable features were eliminated effectively with the relief-F feature selection method without any performance loss in classification. The performances of the methods were evaluated using classification accuracy (ACC), sensitivity, specificity, F1-score, precision, kappa results, and receiver operating characteristic (ROC) curves. The analyses have been conducted on a vitamin D dataset of diabetic patients and the results show that the highest classification accuracy of 97.044% was obtained for the support vector machines (SVM) model using radial kernel that contains 18 features.

Keywords

diabetes logistic regression random forest relief-F support vector machines vitamin D

Introduction

Diabetes is a global public health problem affecting 387 million people, and by 2035 the number is estimated to increase to 592 million worldwide. Around 700 people in the world are diagnosed with diabetes daily, which is equivalent to one every 2 minutes.¹ Diabetic adults are likely to have hospitalization for myocardial infarction 1.8 times more and cerebral vascular accidents 1.5 times more. The reports of the International Diabetes Foundation state that more people die every year from diabetes and complications than those combined with HIV/AIDS, tuberculosis, and malaria. In addition, diabetes places a significant burden on health systems. Medical costs for the care of diabetic patients are 2.3 times higher.^2–5 Poorly controlled diabetes may cause health complications that result in hospitalization, disability, and premature death.⁶

In recent years, research show that there are factors that can be associated with diabetes risk. One of these factors is that the role of vitamin D has expanded to modulate the immune system, cell proliferation and differentiation, and glucose metabolism.^7–11 It is mainly known for its importance in maintaining skeletal health, yet research have proven that vitamin D also plays a wide range of roles from reproduction to adult chronic diseases. Studies investigate the potential involvement of vitamin D with disease pathogenesis, severity, and perhaps treatment.^12,13 In France, a study aimed at characterizing a weak population of adults over 65 years old found that almost everyone (>95% of participants) had clinical vitamin D deficiency.¹⁴ Another study conducted in Iran investigated vitamin D deficiency prevalence with a sample size of 26,042 people, and the overall prevalence was reported as 0.56.¹⁵ A great deal of evidence has shown that vitamin D plays a role in abnormal glucose metabolism, glucose intolerance, varying insulin secretion, and diabetes.^16–20 Observational studies showed that the risk of diabetes was positively associated with decreased vitamin D concentrations. Another systematic review verified this evidence by increasing vitamin D intake above 500 international units (IU)/day reduced the risk of diabetes by 13% compared to less than 200 IU/day. People with a 25OHD status higher than 25 ng/mL had a 43% lower risk of developing diabetes compared to those with a 25OHD status lower than 14 ng/mL.^21,22

Vitamin D is a secosteroid hormone synthesized through a chemical reaction in the skin during sun exposure.²³ It is naturally found only in a few foods and supplementation can raise vitamin D status. The positive effect of vitamin D supplementation to reduce the risk of rickets has been accepted for almost 100 years.^24,25

Because the 25OHD test which is used to measure vitamin D is expensive and may not be covered in healthcare benefits in many countries, this study aims to predict vitamin D deficiency in diabetic patients by using historical electronic health records. To do that data mining techniques were used. Data mining is the process of finding unknown patterns in a dataset. By using these patterns, it builds predictive models. This study implements the analysis of various data mining techniques which can be helpful for medical staff for accurate diagnosis of vitamin D deficiency in diabetic patients. These techniques are relief-F, SVM, random forest (RF), and logistic regression (LR). By using these techniques, we aimed to achieve the best possible performance of prediction accuracy of vitamin D deficiency in diabetic patients. With the aid of this study, it will be straightforward to decide on patients’ medical status.

Related studies

We found that other authors and other publications reported interesting findings related to vitamin D deficiency. However, studies on predictive models for vitamin D deficiency cannot be easily compared, due to the heterogeneity of their aims and the recruited population. As such, Guo et al. (2013) used the data of 494 Caucasian adults for vitamin D status prediction. The study was modeled using multiple linear regression and radial basis function support vector regression (RBF-SVR) to develop a 25OHD prediction score. The RBF-SVR model provided 74% accuracy.²⁶ Gonoodi et al. (2019) recruited the data from 988 adolescent girls (12–18 years old) for the assessment of the risk factors for vitamin D deficiency and used a decision tree (DT) model. The model performed sensitivity, specificity, and accuracy of 79.3%, 64%, and 77.8% respectively.²⁷ Sambasivam et al. (2020) used k-nearest neighbor, DT, RF, adaboost, bagging classifier, extra trees, stochastic gradient descent, gradient boosting, SVM, and multi-layer perceptron for analysis of vitamin D deficiency severity on the data from a total of 3044 college students (18–21 years old). RF performed the best and achieved sensitivity (96%), negative predictive value (96%), and classification accuracy (96%).²⁸ Carretero et al. (2021) used the data from 1002 hypertensive patients to predict vitamin D deficiency. They applied LR, SVM, RF, naive bayes, and extreme gradient boost. The SVM-based model using radial kernel outperformed other algorithms in terms of sensitivity (98%) negative predictive value (71%) and classification accuracy (73%).²⁹ Amiri et al. (2021) used LR and RF on the data from the survey of ultraviolet intake by nutritional approach study. They achieved an accuracy rate of 93% using RF to determine the factors affecting the response of vitamin D supplementation.³⁰ Padmajaa et al. (2021) used machine learning algorithms including RF, multi-layer perceptron, k-nearest neighbor, SVM, decision tree, gradient boosting, stochastic gradient descent, adaboost classifier, extra trees classifier algorithm, and LR. In their study in which the prognosis of vitamin D deficiency severity was estimated and extra trees classifier algorithm achieved an accuracy rate of 73.3%.³¹ As such, no studies with diabetic patients have been performed to date.

Materials and methods

Data collection

In this section, we present the details of the novel dataset which we created. The complete records of 406 diabetic patients were collected from two medical centers in southern Turkey between February 2016 and April 2019. Each record contains (Table 1).

Table 1.

Features of the dataset.

Feature name	Measurement unit	Mean ± SD
Gender	Male, female
Age	years	56.52 ± 12.79
Glucose	mg/dL	121.16 ± 42.99
Glycated hemoglobin (HbA1c)	%	6.55 ± 1.42
Creatinine	mg/dL	0.78 ± 0.19
Urea	mg/dL	31.58 ± 8.62
Uric acid	mg/dL	5.54 ± 1.37
Alanine transaminase (ALT)	U/L	21.50 ± 10.27
Aspartate aminotransferase (AST)	U/L	20.01 ± 6.65
Alkaline phosphatase (ALP)	U/L	73.88 ± 28.79
Lactate dehydrogenase (LDH)	U/L	160.24 ± 31.15
Gamma-glutamyl transferase (GGT)	U/L	28.51 ± 49.73
Albumin	g/dL	6.43 ± 8.88
Bilirubin	mg/dL	0.21 ± 0.11
Sodium	mEq/L	138.71 ± 3.16
Potassium	mEq/L	4.55 ± 0.43
Total cholesterol	mg/dL	207.26 ± 46.39
High-density lipoprotein (HDL)	mg/dL	48.06 ± 11.07
Low-density lipoprotein (LDL)	mg/dL	126.31 ± 34.20
Triglyceride	mg/dL	170.26 ± 81.20
Iron	µg/dL	72.27 ± 31.08
Unsaturated iron-binding capacity (UIBC)	µg/dL	267.20 ± 65.79
Calcium	mg/dL	9.66 ± 0.44
Phosphorus	mg/dL	3.73 ± 0.48
Magnesium	mg/dL	2.04 ± 0.23
Creatine kinase (CK)	U/L	93.12 ± 74.92
Antistreptolysin O (ASO)	IU/mL	52.42 ± 80.87
C-Reactive protein (CRP)	mg/dL	0.80 ± 1.35
White blood cell (WBC)	×10^3/µL	7.35 ± 1.89
Red blood cell (RBC)	×10^6/µL	4.68 ± 0.45
Hemoglobin (HGB)	g/dL	13.18 ± 1.44
Hematocrit (HCT)	%	40.27 ± 3.89
Mean corpuscular volume (MCV)	fL	86.30 ± 6.46
Mean corpuscular hemoglobin (MCH)	Pg	28.27 ± 2.54
Red blood cell distribution width (RDW-CV)	%	14.16 ± 1.29
Platelet (PLT)	×10^3/µL	266.34 ± 66.00
Mean platelet volume (MPV)	fL	9.51 ± 1.22
Procalcitonin (PCT)	%	0.25 ± 0.06
Platelet distribution width (PDW)	%	49.44 ± 7.55
Lymphocytes (LY)	×10^3/µL	2.26 ± 0.63
Monocytes (MO)	×10^3/µL	0.42 ± 0.12
Neutrophils (NE)	×10^3/µL	4.30 ± 1.59
Eosinophil (EO)	×10^3/µL	0.21 ± 0.14
Basophils (BA)	×10^3/µL	0.04 ± 0.02
Sedimentation	Mm	12.05 ± 8.88
Insulin	µIU/mL	10.85 ± 4.67
Triiodothyronine (T3)	pg/mL	2.95 ± 0.44
Thyroxine (T4)	ng/dL	1.17 ± 0.19
Thyroid-stimulating hormone (TSH)	uIU/mL	2.25 ± 1.74
Ferritin	ng/mL	45.32 ± 41,09
Vitamin B12 (VB12)	pg/mL	403.60 ± 274.88
Vitamin D deficiency	0,1

The dataset consists of 115 male and 291 female patients. The class label was vitamin D deficiency (1, “sufficient” label, or 0, “deficient” label). The ratio of deficient class to sufficient class was 2.7:1 (298, “sufficient”; 108, “deficient”). Since the deficient class is the majority class and also studies in class imbalance concentrated on imbalance ratios ranging from 1:4 up to 1:100, a class imbalance is of negligible importance in this study.³² Also the data which is used to form the dataset was collected manually and the redundant data was removed before the formation of the dataset.

Figure 1 shows the overall flow of our model using relief-F, SVM, RF, and LR to search for the best model to predict vitamin D deficiency in diabetic patients. Firstly, pre-processing was applied to the raw data and the features necessary for our study were recorded. Diabetic patients and vitamin D deficiency were identified manually. After pre-processing, the dataset was formed, and 406 observations of blood test results were listed which can be used for later analyses. Then, relief-F scores of every feature in the training set were computed and sorted descendingly. Later, a subset of the training was generated including the features with the best relief-F scores and it was validated by using the testing set. During these two steps 10-fold cross-validation, the most commonly used evaluation technique, was carried out to find the optimized values and avoid overfitting. The dataset was initially partitioned into 10 folds, where nine folds were used to train, and the remaining fold was used to test the prediction model. Then the folds were rotated, and all folds were used to train and test the dataset. The final performance metrics were averaged using the 10 estimates from each test fold assuring that 10 independent sets were used to test the dataset.³³

Figure 1.

The overall flow of our model.

This process is implemented until all of the features are listed with respect to their relief-F scores.^34,35

Machine learning methods

In the clinical diagnosis area, a small feature subset means lowering test expenses. The advantage of limiting the number of input features is having a decent, predictive, and less intensive model. In this study, the relief-F algorithm was used for feature selection.

The purpose of machine learning is to recognize complex patterns in a given dataset automatically and allow for prediction in a new dataset. Many developers recognize that it can be easier to train a system by showing examples of desired responses than programming manually by anticipating desired responses for all possible inputs. The most widely used methods in machine learning are supervised learning methods. Each case in a supervised learning method is characterized by a label which is used to generate a function that maps the dataset by minimizing the classification error. Supervised learning methods used in this study are SVM, RF, and LR.^36,37

Relief-F algorithm

Relief-F is the best-known variant of relief-based algorithms which is proposed by Kononenko. For each target instance, relief-F relies on a user parameter k specifying the use of k nearest hits and k nearest misses in the scoring update. In multi-class problems, relief-F finds k nearest misses from each class, and based on the prior probability of each class, relief-F averages the weight update. Encouraging the algorithm to estimate the ability of features to separate all pairs of classes regardless of which two classes are closest to one another. Since it is expected that as the parameter approaches the total number of instances, the quality of the weight estimates will become more reliable. Kononenko proposed the simplifying assumption that every instance in the dataset gets to be the target instance one time.³⁸

Support vector machines

SVM are very powerful supervised learning algorithm proposed as a statistical learning method for prediction by Vapnik in 1995. SVM use support vectors to identify the decision boundaries between various classes based on a linear machine in a high-dimensional feature space. SVM have enabled the improvement of fast training techniques with a large number of input variables. SVM algorithm creates a model that recognizes patterns in the training data, each belonging to one of two different classes, and then estimates new instances of the data. Simply put, SVM are discriminatory classifiers, and SVM algorithm classifies new samples by the most appropriate separation decision boundary considered as the hyperplane of SVM.

If the training dataset is linearly separable, linear SVM can be used by choosing two boundaries that separate the samples. Consider the problem of separating the set of training vectors that belongs to two linearly separable classes,

(x_{i,} y_{i}), x_{i} \in R^{n}, y_{i} \in {+ 1, - 1}, i = 1, \dots, n

(1)

where x_i is a real-valued n-dimensional input vector and y_i is a label determining the class of x_i. A separating hyperplane can be written as

w . x + b = 0

(2)

where w is a weight vector, namely, w = {w₁, w₂, …, w_n}; n is the number of attributes; and b is a bias. The parameters w and b are constrained by

\min_{i} | w . x_{i} + b | \geq 1

(3)

In canonical form, a separating hyperplane must satisfy the following constraints

y_{i} (w . x_{i} + b) \geq 1

(4)

The hyperplane optimally separates the data is the one that minimizes

ϕ (w) = \frac{1}{2} (w . w)

(5)

The optimal function is given by

\max_{\propto} [\sum_{i = 1}^{n} \propto_{i} - \frac{1}{2} \sum_{i, j = 1}^{n} \propto_{i} \propto_{j} y_{i} y_{j} (x_{i} . x_{j})]

(6)

SVM provide separability to non-linear regions by using kernel functions avoiding the local minimum issues by implementing quadratic optimization. If the training dataset is not linearly separable, SVM maps the input vector into a high-dimensional feature space. SVM build an optimal separating hyperplane in this higher dimensional space by defining a non-linear mapping usually defined as

φ (\cdot) : R^{n} \to R^{n h}

(7)

The optimal function becomes

\max_{\propto} [\sum_{i = 1}^{n} \propto_{i} - \frac{1}{2} \sum_{i, j = 1}^{n} \propto_{i} \propto_{j} y_{i} y_{j} K (x_{i} . x_{j})]

(8)

where

K (x_{i}, x_{j}) = {φ (x_{i}) . φ (x_{j})}

(9)

is the kernel function performing the non-linear mapping into feature space.^39–45

Random forest

RF is the collection of DTs. Rather than seeking the most important feature, RF seeks the best feature among a random subset of features. RF uses multiple DTs and each DT in RF is formed by randomly chosen training data. The prediction of every DT may be low, but as the predictions are combined, the accuracy of RF increases. Since RF has randomness, overfitting is not a problem. By reducing generalization errors, RF is not affected by noise and outliers and has high accuracy.

RF can be built using bagging in tandem with random attribute selection. Consider a training set, D, of d tuples, is given. For each iteration, a training set D_i is sampled using replacement from D. Let F be the number of attributes to be used to determine the split at each node. The trees are grown to maximum size and are not pruned.^43,46–49

Logistic regression

LR is a form of regression that examines the relationship between a dependent binary outcome (such as present or absent) and independent variables of any type (such as historical electronic health records). Prediction studies using LR are widely used in healthcare and analysis has been used frequently in biology, medicine, economics, agriculture, veterinary and transportation fields. The outcome variables can be both continuous and categorical. The following equation describes the relationship between the predictor variables and disease presence:

Log (\frac{p}{1 - p}) = β_{0} + β_{1} x_{1} + β_{2} x_{2} + \dots + β_{n} x_{n}

(10)

x₁, x₂, …, x_n denote n predictor variables, y denotes the presence or absence of disease (ie, the probability that y = 1), and p denotes the probability of disease presence. where β₀ is a constant and β₁, …, β_n are the regression coefficients of the predictor variables x₁, x₂, …, x_n.^50–54

Performance metrics

In Table 2, a confusion matrix containing information about actual and predicted classifications is presented.

Table 2.

Confusion matrix representation.

Actual	Predicted
Actual	Positive	Negative
Positive	True positive (TP)	False negative (FN)
Negative	False positive (FP)	True negative (TN)

Performance metrics can be calculated by using the following equations:

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(11)

S e n s i t i v i t y = \frac{T P}{T P + F N}

(12)

S p e c i f i c i t y = \frac{T N}{T N + F P}

(13)

F 1 - s c o r e = \frac{2 . T P}{2 . T P + F N + F P}

(14)

P r e c i s i o n = \frac{T P}{T P + F P}

(15)

K a p p a = \frac{2 . (T P . T N - F N . F P)}{(T P + F P) (F P + T N) + (T P + F N) (F N + T N)}

(16)

Results and discussion

In this section, we present empirical analysis to compare the results of the subset including reduced features against all features of the vitamin D deficiency problem in diabetic patients. The analyses were conducted on a novel vitamin D dataset of diabetic patients as mentioned before in section 3.1. There were two classes such as deficient and sufficient vitamin D status indicated by 0 and one respectively. We investigated the effects of machine learning techniques to predict vitamin D deficiency in diabetic patients automatically with fewer features. To achieve this, we used SVM, RF, and LR classification algorithms using 10-fold cross-validation and relief-F feature selection algorithm from the Weka data mining software package. Since we also investigated important features and for comparison purposes, in the first stage of our study we conducted analyses using all 51 features that were obtained from historical electronic health records. We obtained our baseline as shown in Table 3 by using 51 features with SVM, RF, and LR classification algorithms. Table 3 shows each classifier’s classification accuracy, sensitivity, specificity, F1-score, precision, and kappa results. The accuracy rates of SVM, RF, and LR classifiers are 97.044, 82.759, and 71.183 respectively.

Table 3.

Classification accuracy, sensitivity, specificity, F1-score, precision, and kappa results for each classifier.

	ACC (%)	Sensitivity	Specificity	F1-score	Precision	Kappa
SVM	97.044	0.970	0.889	0.970	0.972	0.922
RF	82.759	0.828	0.861	0.835	0.858	0.605
LR	71.183	0.712	0.000	0.610	0.534	−0.043

Since we aimed to predict the most accurate vitamin D deficiency in diabetic patients with the minimum number of features, in the second stage, using our model’s steps as mentioned before in Figure 1, we conducted analyses using the features ranked by the relief-F feature selection method from the Weka data mining software package. According to our empirical studies we chose top ranked 18 features to measure classification performance. Table 4 shows each classifier’s classification accuracy, sensitivity, specificity, F1-score, precision, and kappa results. The accuracy rates of SVM, RF, and LR classifiers are 97.044, 88.177, and 74.138 respectively. The SVM model also showed the highest performance among the given metrics. Sensitivity, specificity, F1-score, precision, and kappa results for the SVM model are 0.970, 0.889, 0.970, 0.972, and 0.922 respectively. By comparing accuracy, sensitivity, specificity, F1-score, precision, kappa results, and ROC curves for SVM, RF, and LR classifiers, this study chose the optimum prediction model to forecast vitamin D deficiency in diabetic patients. Among these classifiers, the performance evaluation metrics of classification for the SVM model outperformed both RF and LR. Thus, the SVM model is our optimum prediction model.

Table 4.

Classification accuracy, sensitivity, specificity, F1-score, precision, and kappa results for each classifier with their optimum number of features.

	Feature count	ACC (%)	Sensitivity	Specificity	F1-score	Precision	Kappa
SVM	18	97.044	0.970	0.889	0.970	0.972	0.922
RF	35	88.177	0.882	0.843	0.884	0.888	0.709
LR	14	74.138	0.741	0.046	0.647	0.735	0.056

The dataset we worked on is a two-class problem and it is free of noise and balanced as mentioned before in Section 3.1. It is proven that SVM perform very well with medical datasets. The radial kernel algorithm provides better accuracy than other SVM kernels while studying on a non-linearly separable data.²⁹ Since our dataset has similar characteristics, the maximum margin hyperplane was determined by the SVM model using radial kernel.

The most important parameter while using an RF is the number of DTs. More stable models require a large number of trees but exhaust computational resources.⁵⁵ Empirically in this study, using 100 trees to compute RF results was deemed appropriate.

LR is mostly used for multivariate analysis in medical literature but in this study, LR performed worse because of probable nonlinear relation between variables.⁵⁶

The ROC curves that show the tradeoff between sensitivity and specificity; and the area under the ROC curves (AUC) for the SVM, RF, and LR models are presented in Figures 2–4. While observing a ROC curve the bigger area means better classifier performance.

Figure 2.

SVM model.

Figure 3.

RF model.

Figure 4.

LR model.

To summarize our prediction performance, confusion matrixes of classifiers are given in Table 5. As seen in Table 5, the error rates of SVM classifier are lower than others.

Table 5.

Confusion matrixes for the SVM, RF, and LR models.

	SVM model		RF model		LR model
Actual	Predicted		Predicted		Predicted
Actual	Deficient	Sufficient	Deficient	Sufficient	Deficient	Sufficient
Defficient	298	0	267	31	296	2
Sufficient	12	96	17	91	103	5

The highest classification accuracy with the least number of features is achieved with SVM classifier.⁵⁷ Instead of considering all 51 variables, relief-F selected the 18 most relevant ones used with SVM classifier. Table 6 lists features significantly impacting diabetic patients with vitamin D deficiency in descending order of variable importance.

Table 6.

Ranking of relative importance with relief-F for the best input variables.

Ranking of the attributes	Attributes
1	Gender
2	Magnesium
3	Glucose
4	Phosphorus
5	Age
6	Sedimentation
7	PCT
8	Iron
9	T3
10	PLT
11	LDL
12	RDW-CV
13	AST
14	Albumin
15	HDL
16	Triglyceride
17	MPV
18	VB12

Considering these attribute rankings, we conducted our new literature review. According to this review gender significantly affects vitamin D status; recommended amount of magnesium consumption is essential to obtain the optimal benefits of vitamin D; abnormality of glucose metabolism is linked to vitamin D deficiency; the utilization of vitamin D in the body raises calcium and phosphorus levels; abnormally low levels of plasma 25OHD are more common in the elderly; erythrocyte sedimentation rate is higher in patients with vitamin D deficiency; vitamin D supplementation may interfere with the ability to use PCT; a positive relationship exists between iron status and vitamin D; rats receiving vitamin D restricted diet were made thyrotoxic by subcutaneous injection of T3; vitamin D deficiency may increase PLT and MPV; vitamin D supplementation significantly decreases LDL, elevated RDW-CV levels may be attributed to the adverse effects of risk factors such as vitamin D; vitamin D treated rats had significantly lower serum levels of AST; albumin belongs to a family of proteins that includes vitamin D binding protein (VDP); low vitamin D has been associated with low levels of HDL; low serum vitamin D levels in children is associated with high levels of triglyceride; significant negative correlation exists between vitamin D and B12. In this way, we confirmed our attribute selection with the literature.^58–74

Conclusions

In this study, a medical decision-making system made use of SVM, RF, and LR classifiers combined with feature selection using relief-F has been applied to diagnose vitamin D deficiency in diabetic patients. The analyses were conducted on a novel vitamin D dataset of diabetic patients. According to these analyses, the proposed model yielded the highest classification accuracy of 97.044 for a subset containing 18 features: gender, magnesium, glucose, phosphorus, age, sedimentation, PCT, iron, T3, PLT, LDL, RDW-CV, AST, albumin, HDL, triglyceride, MPV, VB12. The proposed model reduced the time required to classify the test dataset by reducing the feature space without any accuracy loss in classification. Further performance measures such as sensitivity, specificity, F1-score, precision, kappa results, and ROC curves are also represented for the proposed model. In consideration of these results, our SVM model using radial kernel combined with feature selection gave very auspicious results in predicting vitamin D deficiency in diabetic patients. Extracting knowledge from medical databases offers assistance unavailable before the mining process. Data mining for medical purposes will find its place in the years to come. Keeping the systems compliant with technical standards will probably provide the pivotal point for the deeper use of the different datasets in the decision-making process.^75,76 We consider that our model can be very helpful as an information platform for assisting physicians to make better clinical decisions at an earlier stage through adapting existing healthcare systems. Furthermore, the dataset we created will be useful for researchers who will work in this field.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Uğur Engin Eşsiz

Oya Hacire Yüregir

Esra Saraç

References

Guariguata

Whiting

Hambleton

, et al. Global estimates of diabetes prevalence for 2013 and projections for 2035. Diabetes Res Clin Pract 2014; 103(2): 137–149.

Centers for Disease Control and Prevention . National diabetes statistics report. Atlanta, GA: U.S. Dept of Health and Human Services, 2020.

International Diabetes Federation . IDF diabetes Atlas. Brussels, Belgium: International Diabetes Federation, 2015.

Signorello

Schlundt

Cohen

, et al. Comparing diabetes prevalence between African Americans and whites of similar socioeconomic status. Am J Publ Health 2007; 97(12): 2260–2267.

American Diabetes Association . Diagnosis and classification of diabetes mellitus. Diabetes Care 2010; 33(Supplement_1): S62–S69.

Napoli

Chandran

Pierroz

IOF Bone and Diabetes Working Group , et al. Mechanisms of diabetes mellitus-induced bone fragility. Nat Rev Endocrinol 2017; 13(4): 208–219.

Goldsmith

. Vitamin D as an immunomodulator: risks with deficiencies and benefits of supplementation. Healthcare 2015; 3(2): 219–232.

Bhalla

Amento

Clemens

, et al. Specific high-affinity receptors for 1,25-dihydroxyvitamin D3 in human peripheral blood mononuclear cells: presence in monocytes and induction in T lymphocytes following activation. J Clin Endocrinol Metab 1983; 57(6): 1308–1310.

Provvedini

Tsoukas

Deftos

, et al. 1,25-Dihydroxyvitamin D 3 receptors in human leukocytes. Science 1983; 221(4616): 1181–1183.

10.

Bikle

. Nonclassic actions of Vitamin D. J Clin Endocrinol Metab 2009; 94(1): 26–34.

11.

Weyland

Grant

Howie-Esquivel

. Does sufficient evidence exist to support a causal association between vitamin d status and cardiovascular disease risk? An assessment using hill’s criteria for causality. Nutrients 2014; 6(9): 3403–3430.

12.

Ross

Manson

Abrams

, et al. The 2011 report on dietary reference intakes for calcium and vitamin D from the institute of medicine: what clinicians need to know. J Clin Endocrinol Metab 2011; 96(1): 53–58.

13.

Wobke

Sorg

Steinhilber

. Vitamin D in inflammatory diseases. Front Physiol 2014; 5: 244. [cited 2022 May 27];5.

14.

Leslie

Hankey

. Aging, nutritional status and health. Healthcare 2015; 3(3): 648–658.

15.

Vatandost

Jahani

Afshari

, et al. Prevalence of vitamin D deficiency in Iran: a systematic review and meta-analysis. Nutr Health 2018; 24(4): 269–278.

16.

Melamed

Michos

Post

, et al. 25-Hydroxyvitamin D levels and the risk of mortality in the general population. Arch Intern Med 2008; 168(15): 1629–1637.

17.

Palomer

González-Clemente

Blanco-Vaca

, et al. Role of vitamin D in the pathogenesis of type 2 diabetes mellitus. Diabetes Obes Metabol 2008; 10(3): 185–197.

18.

Chiu

Chu

VLW

, et al. Hypovitaminosis D is associated with insulin resistance and β cell dysfunction. Am J Clin Nutr 2004; 79(5): 820–825.

19.

Pittas

Sun

Manson

, et al. Plasma 25-hydroxyvitamin D concentration and risk of incident type 2 diabetes in women. Diabetes Care 2010; 33(9): 2021–2023.

20.

Thorand

Zierer

Huth

, et al. Effect of serum 25-hydroxyvitamin D on risk for type 2 diabetes may be partially mediated by subclinical inflammation: results from the MONICA/KORA Augsburg study. Diabetes Care 2011; 34(10): 2320–2322.

21.

Pittas

Dawson-Hughes

. Vitamin D and diabetes. J Steroid Biochem Mol Biol 2010; 121(1–2): 425–429.

22.

Sowell

Keen

Uriu-Adams

. Vitamin D and reproduction: from gametes to childhood. Healthcare 2015; 3(4): 1097–1120.

23.

Taghizadeh

Sharifan

Ekhteraee Toosi

, et al. The effects of consuming a low-fat yogurt fortified with nano encapsulated vitamin D on serum pro-oxidant-antioxidant balance (PAB) in adults with metabolic syndrome, a randomized control trial. Diabetes Metabol Syndr 2021; 15(6): 102332.

24.

Armas

LAG

Hollis

Heaney

. Vitamin D 2 is much less effective than vitamin D 3 in humans. J Clin Endocrinol Metab 2004; 89(11): 5387–5391.

25.

Holick

Biancuzzo

Chen

, et al. Vitamin D2 is as effective as vitamin d3 in maintaining circulating concentrations of 25-hydroxyvitamin D. J Clin Endocrinol Metab 2008; 93(3): 677–681.

26.

Guo

Lucas

Ponsonby

Ausimmune Investigator Group . A novel approach for prediction of vitamin d status using support vector regression. PLoS One 2013; 8(11): e79970.

27.

Gonoodi

Tayefi

Saberi-Karimian

, et al. An assessment of the risk factors for vitamin D deficiency using a decision tree model. Diabetes Metabol Syndr 2019; 13(3): 1773–1777.

28.

Sambasivam

Amudhavel

Sathya

. A predictive performance analysis of vitamin D deficiency severity using machine learning methods. IEEE Access 2020; 8: 109492–109507.

29.

Garcia Carretero

Vigil-Medina

Barquero-Perez

, et al. Machine learning approaches to constructing predictive models of vitamin D deficiency in a hypertensive population: a comparative study. Inf Health Soc Care 2021; 46(4): 355–369.

30.

Amiri

Nosrati

Sharifan

, et al. Factors determining the serum 25-hydroxyvitamin D response to vitamin D supplementation: data mining approach. Biofactors 2021; 47(5): 828–836.

31.

Padmaja

Battu Ramya Reddy

RVS

Pradhan

, et al. Prognosis of Vitamin D deficiency severity using SMOTE optimized machine learning models. Turk J Comput Math Educ 2021; 6: 4553–4567.

32.

Krawczyk

. Learning from imbalanced data: open challenges and future directions. Prog Artif Intell 2016; 5(4): 221–232.

33.

Santos

Soares

Abreu

, et al. Cross-validation for imbalanced datasets: avoiding overoptimistic and overfitting approaches [research frontier]. IEEE Comput Intell Mag 2018; 13(4): 59–76. DOI: 10.1109/MCI.2018.2866730

34.

Akay

. Support vector machines combined with feature selection for breast cancer diagnosis. Expert Syst Appl 2009; 36(2): 3240–3247.

35.

Ehrentraut

Ekholm

Tanushi

, et al. Detecting hospital-acquired infections: a document classification approach using support vector machines and gradient tree boosting. Health Inf J 2018; 24(1): 24–42.

36.

Jordan

Mitchell

. Machine learning: trends, perspectives, and prospects. Science 2015; 349(6245): 255–260.

37.

Caixinha

Nunes

. Machine learning techniques in clinical vision sciences. Curr Eye Res 2017; 42(1): 1–15.

38.

Urbanowicz

Melissa Meeker

WLC

Olson

, et al. Relief-based feature selection: introduction and review. J Biomed Inf 2018; 85: 189–203, ISSN 1532-0464, DOI: 10.1016/j.jbi.2018.07.014

39.

Baymani

Salehi-M

Mansoori

. Applying norm concepts for solving interval support vector machine. Neurocomputing 2018; 311: 41–50.

40.

Aidu

Vapnik

. Estimating the probability density by the stochastic regularization method. Avtomat. i Telemekh. 1989; 50(4): 84–97.

41.

Amiri

Armano

. Segmentation and Feature Extraction of Heart Murmurs in Newborns. JOLST 2013; 1: 107–112.

42.

Cortes

Vapnik

. Support-vector networks. Mach Learn 1995; 20(3): 273–297.

43.

Han

Kamber

. Data mining: concepts and techniques. 3rd edition. Amsterdam, Netherlands: Elsevier, 2012.

44.

Aguiñaga

Ramirez

MAL

. Emotional states recognition, implementing a low computational complexity strategy. Health Inf J 2018; 24(2): 146–170.

45.

Scholkopf

Sung

Kah-Kay

Burges

CJC

, et al. Comparing support vector machines with Gaussian kernels to radial basis function classifiers. IEEE Trans Signal Process 1997; 45(11): 2758–2765.

46.

Shafi

Bukhari

Iqbal

, et al. Cleft prediction before birth using deep neural network. Health Inf J 2020; 26(4): 2568–2585.

47.

Byeon

. Developing a random forest classifier for predicting the depression and managing the health of caregivers supporting patients with Alzheimer’s Disease. Han J, editor. THC 2019; 27(5): 531–534.

48.

Witten

Frank

Hall

. Data mining: practical machine learning tools and techniques. 3rd edition. Burlington, MA: Morgan Kaufmann, 2011.

49.

Saberi -Karimian

Safarian -Bana

Mohammadzadeh

, et al. A pilot study of the effects of crocin on high -density lipoprotein cholesterol uptake capacity in patients with metabolic syndrome: a randomized clinical trial. Biofactors 2021; 47(6): 1032–1041.

50.

Reddy

Agrawal

. Predicting and explaining inflammation in Crohn’s disease patients using predictive analytics methods and electronic medical record data. Health Inf J 2019; 25(4): 1201–1218.

51.

Oğuzlar

. Lojistik Regresyon Analizi Yardımıyla Suçlu Profilinin Belirlenmesi. Turkey: Atatürk Üniversitesi İktisadi ve İdari Bilimler Dergisi, 2010, pp. 21–35.

52.

Agresti

. An introduction to categorical data analysis. 2nd edition. Hoboken, NJ: Wiley Interscience, 2007.

53.

Ayer

Chhatwal

Alagoz

, et al. Informatics in radiology: comparison of logistic regression and artificial neural network models in breast cancer risk estimation. Radiographics 2010; 30(1): 13–22. DOI: 10.1148/rg.301095057. Epub 2009 Nov 9. PMID: 19901087; PMCID: PMC3709515.

54.

Chhatwal

Alagoz

Lindstrom

, et al. A logistic regression model based on the national mammography database format to aid breast cancer diagnosis. AJR Am J Roentgenol 2009; 192(4):1117-1127. DOI: 10.2214/AJR.07.3345. Erratum in: AJR Am J Roentgenol. 2009 May;192(5):1167. PMID: 19304723; PMCID: PMC2661033.

55.

García-Carretero

Holgado-Cuadrado

Barquero-Pérez

. Assessment of classification models and relevant features on nonalcoholic steatohepatitis using random forest. Entropy 2021; 23(6): 763.

56.

Garcia-Carretero

Vigil-Medina

Barquero-Perez

. the use of machine learning techniques to determine the predictive value of inflammatory biomarkers in the development of type 2 diabetes mellitus. Metab Syndr Relat Disord 2021; 19(4): 240–248.

57.

Bellazzi

Zupan

. Predictive data mining in clinical medicine: current issues and guidelines. Int J Med Inf 2008; 77(2): 81–97.

58.

Verdoia

Schaffer

Barbieri

Novara Atherosclerosis Study Group NAS , et al. Impact of gender difference on vitamin D status and its relationship with the extent of coronary artery disease. Nutr Metabol Cardiovasc Dis 2015; 25(5): 464–470. Epub 2015 Feb 7. PMID: 25791862. DOI: 10.1016/j.numecd.2015.01.009

59.

Uwitonze

Razzaque

. Role of magnesium in vitamin D activation and function. J Am Osteopath Assoc 2018; 118(3): 181–189.

60.

Tai

Need

Horowitz

, et al. Vitamin D, glucose, insulin, and insulin sensitivity. Nutrition 2008; 24(3): 279–285.

61.

DeLuca

. The Vitamin D system in the regulation of calcium and phosphorus metabolism. Nutr Rev 1979; 37(6): 161–193.

62.

Baker

Peacock

Nordin

. The decline in vitamin D status with age. Age Ageing 1980; 9(4): 249–252. DOI: 10.1093/ageing/9.4.249 PMID: 6971044.

63.

Kaya

Akçay

EÜ

Ertürk

, et al. The relationship between vitamin D deficiency and erythrocyte sedimentation rate in patients with diabetes. Turk J Med Sci 2018; 48(2): 424–429.

64.

Wolf

Wimalawansa

Razzaque

. Procalcitonin as a biomarker for critically ill patients with sepsis: effects of vitamin D supplementation. J Steroid Biochem Mol Biol 2019; 193: 105428.

65.

Azizi-Soleiman

Vafa

Abiri

, et al. Effects of iron on Vitamin D metabolism: a systematic review. Int J Prev Med 2016; 7(1): 126.

66.

Colston

Cleeve

HJW

. Effect of triiodothyronine on intestinal and kidney isoenzymes of alkaline phosphatase and on vitamin D metabolism in adult female rats. Comp Biochem Physiol B 1986; 83(3): 681–684.

67.

Cumhur Cure

Cure

Yuce

, et al. Mean platelet volume and vitamin D level. Ann Lab Med 2014; 34(2): 98–103.

68.

Schwetz

Scharnagl

Trummer

, et al. Vitamin D supplementation and lipoprotein metabolism: a randomized controlled trial. J Clin Lipidol 2018; 12(3): 588–596.

69.

Bujak

Wasilewski

Osadnik

, et al. The prognostic role of red blood cell distribution width in coronary artery disease: a review of the pathophysiology. Dis Markers. 2015; 2015: 1–12.

70.

Seif

Abdelwahed

. Vitamin D ameliorates hepatic ischemic/reperfusion injury in rats. J Physiol Biochem 2014; 70(3): 659–666.

71.

Carter

. Structure of serum albumin. In: Advances in protein chemistry. Amsterdam, Netherlands: Elsevier, 1994, pp. 153–203. [cited 2022 May 27].

72.

Kazlauskaite

Powell

Mandapakala

, et al. Vitamin D is associated with atheroprotective high-density lipoprotein profile in postmenopausal women. J Clin Lipidol 2010; 4(2): 113–119.

73.

Rodríguez-Rodríguez

Ortega

González-Rodríguez

UCM Research Group VALORNUT 920030 , et al. Vitamin D deficiency is an independent predictor of elevated triglycerides in Spanish school children. Eur J Nutr 2011; 50(5): 373–378.

74.

Tariq

Lone

. Interplay of vitamin D, vitamin B 12, homocysteine and bone mineral density in postmenopausal females. Health Care Women Int 2018; 39(12): 1340–1349.

75.

Orfanidis

Bamidis

Eaglestone

. Data quality issues in electronic health records: an adaptation framework for the greek health system. Health Inf J 2004; 10(1): 23–36. DOI: 10.1177/1460458204040665

76.

Stilou

Bamidis

Maglaveras

, et al. Mining association rules from clinical databases: an intelligent diagnostic process in healthcare. Stud Health Technol Inf 2001; 84(Pt 2): 1399–1403. PMID: 11604957.