Machine learning predictive models of LDL-C in the population of eastern India and its comparison with directly measured and calculated LDL-C

Abstract

Background

LDL-C is a strong risk factor for cardiovascular disorders. The formulas used to calculate LDL-C showed varying performance in different populations. Machine learning models can study complex interactions between the variables and can be used to predict outcomes more accurately. The current study evaluated the predictive performance of three machine learning models—random forests, XGBoost, and support vector Rregression (SVR) to predict LDL-C from total cholesterol, triglyceride, and HDL-C in comparison to linear regression model and some existing formulas for LDL-C calculation, in eastern Indian population.

Methods

The lipid profiles performed in the clinical biochemistry laboratory of AIIMS Bhubaneswar during 2019–2021, a total of 13,391 samples were included in the study. Laboratory results were collected from the laboratory database. 70% of data were classified as train set and used to develop the three machine learning models and linear regression formula. These models were tested in the rest 30% of the data (test set) for validation. Performance of models was evaluated in comparison to best six existing LDL-C calculating formulas.

Results

LDL-C predicted by XGBoost and random forests models showed a strong correlation with directly estimated LDL-C (r = 0.98). Two machine learning models performed superior to the six existing and commonly used LDL-C calculating formulas like Friedewald in the study population. When compared in different triglycerides strata also, these two models outperformed the other methods used.

Conclusion

Machine learning models like XGBoost and random forests can be used to predict LDL-C with more accuracy comparing to conventional linear regression LDL-C formulas.

Keywords

Low density lipoprotein cholesterol machine learning Friedewald formula Martin formula

Introduction

Cardiovascular diseases are the leading cause of mortality globally.¹ People with cardiovascular disease or with high-risk factors require early detection and management. LDL cholesterol is established as one of the strongest predictors of cardiovascular diseases.^2,3 As NCEP ATP III focuses on serum LDL-C level as the primary treatment target and basis for risk categorization, accurate laboratory estimation of LDL-C is very important.⁴ The reference method for LDL-C estimation is beta quantification. But this technique involves complex handling techniques, requires a large sample volume, expensive, and needs long ultracentrifugation time.⁵ Hence, beta quantification is not suited for routine laboratory evaluation.

Many laboratories use direct homogenous assays for LDL-C measurement.⁶ Comparing with the reference method, many of these homogenous assays measure LDL-C with a sufficient degree of accuracy and precision. Still, these are also not performed routinely in clinical laboratories as many of such methods are costly. Instead, most clinical laboratories use Friedewald formula to calculate LDL-C.⁷ It is simple, cost-effective, and widely used for routine LDL estimation as well as in many clinical studies. This formula has some limitations. It cannot be used for LDL-C calculation in a sample with triglycerides above 400 mg/dL and can be applied only in the fasting samples.⁸ Several formulas have been developed to overcome these limitations. Formulas such as Chen, Vujovic, Anandaraja, Puavikai, Hatta, and Martin are used for LDL-C calculation.^9–14 These formulas are validated in different populations and require further studies to be applied to the general population. Previous studies comparing these formulas with Friedewald formula gave contrasting results over different triglycerides levels.^15–19

Most of these formulas were generated using linear regression. The advancement in data science made machine learning (ML) algorithms more popular in clinical researches also. Machine learning models were already used to predict the risk for developing cardiovascular disease,²⁰ for delayed wound healing,²¹ for predicting and stratifying early-onset neonatal sepsis,²² and for predicting hypercholesterolemia.²³ Since in most cases there will be multiple dependent factors or risk factors that interact in a very complex manner resulting in the outcome, linear regression models sometimes fall short in these scenarios for the prediction of the outcomes. Machine learning algorithms are the best alternative approach to overcome these limitations. Machine learning models can learn the nonlinear and complex interactions between the dependent variables and thus can improve the prediction accuracy drastically.²⁴ Random forests model was used in a study to predict LDL-C, and the performance was found to better than Friedewald and Martin formulas.²⁵

In this light, we used three machine learning models—random forests, XGBoost, and support vector regression (SVR) models, to predict LDL-C from total cholesterol, triglyceride, and HDL-C in the eastern Indian population. We compared the LDL-C predicted by these models with directly estimated LDL-C. We also checked the predictive performance of these machine learning models comparing with a linear regression model and some existing formulas for LDL-C calculation.

Materials and methods

Study population

This study was conducted in the Department of Biochemistry of AIIMS Bhubaneswar as a retrospective comparative study. This study included the lipid profiles performed at the clinical biochemistry laboratory of AIIMS Bhubaneswar for the out-patients and in-patients between the time period 2019 January and 2021 March. Majority of the outpatients and inpatients of the institute are the residents of eastern Indian states, Odisha and West Bengal. The study was approved by the Ethics committee of AIIMS Bhubaneswar. Lipid profile data was collected from the laboratory database. Standard lipid profile analysis included the patient’s total cholesterol (TC), triglyceride (TG), HDL-C, and LDL-C on the same day. Lipid profile results with missing total cholesterol/triglycerides/HDL-C/LDL-C and results beyond the linearity limit of the specific assays used were excluded. If the same patient had undergone lipid profile test multiple times during the study period, only the first result was included in the dataset. After the review of the laboratory data, 13,391 lipid profiles were selected for the current study.

Lipid profile testing

All lipid profile parameters were measured on automatic chemistry analyzer Beckman Coulter AU 5800 of AIIMS Bhubaneswar clinical biochemistry laboratory. Total cholesterol (TC) was estimated using the enzymatic cholesterol oxidase/PAP method, while triglyceride (TG) was estimated using the enzymatic glycerol phosphate oxidase/ PAP method. HDL-C was measured by direct homogenous assays without precipitation (using anti-human β lipoprotein antibody). LDL-C was measured by a direct homogenous assay that uses a selective protective agent to separate LDL from chylomicrons, HDL-C, and VLDL and then estimated by Cholesterol oxidase/PAP method. All tests were performed using Beckman Coulter AU reagents and calibrators while used Bio-Rad quality control material.

Machine learning

Machine learning models were built using python library SciKit Learn. After data collection, the dataset was divided into a training dataset (70%) and a test dataset (30%). LDL values were set as target variables, whereas triglyceride, HDL, and total cholesterol were the independent variables. Three models were deployed, namely, random forest regressor, support vector regressor (SVR), and XG boost regressor. These models were initially trained with the training dataset. Then, the performance of these models was assessed on the test set. The evaluation was done with metrics like correlation and mean absolute difference (MAD).

Derivation of formula using linear regression

Most of the existing LDL calculating formulas are derived using linear regression analysis. So, we also applied multivariate linear regression in training data to find the relationship between directly estimated LDL-C, total cholesterol, triglycerides, and HDL-C. Formula derived using the multivariate linear regression is applied to test data. LDL-C calculated using this formula was compared with direct estimated LDL-C. Accuracy of the formula was evaluated and compared with ML models and six existing LDL calculation formulas.

LDL calculation formulas

Performance of LDL was estimated using machine learning algorithm and linear regression compared with some of the existing LDL calculation formulas. LDL-C was calculated using Friedewald, Martin, Chen, Vujovic, Anandaraja, and Puavikai formulas. To determine the LDL-C values as per the Martin equation, LDL calculator is used (http://www.LDL-Calculator.com). In the test data, for each patient, LDL was calculated using these formulas and compared to directly estimated LDL-C. The accuracy of these existing formulas to calculate the LDL-C was compared with the machine learning algorithm and formula generated using linear regression.

TC – HDLc – TG/5—Friedewald et al.⁷

(TC – HDLc) × 0.9 – (TG × 0.1)—Chen et al.⁹

TC-TG/6.85-HDLc—Vujovic et al.¹⁰

0.9TC-0.9TG/5-28—Anandaraja et al.¹¹

TC–HDL–TG/novel factor—Martin et al.¹⁴

TC-HDLc-TG/6—Puavikai et al.¹²

Statistical analysis

Statistical analysis was performed using SPSS. 13,391 lipid profile data were included in the study. Continuous variables are represented as mean and standard deviation. Collected data of 13,391 profiles were randomly divided into a training set (70% of total data) and test set (30% data). Thus, 9373 lipid profiles data were used to formulate the machine learning models and the rest 4018 lipid profiles were used to test the models. On the assumption that directly measured LDL-C is most accurate, LDL-C predicted using the three machine learning models, linear regression, and by the five existing formulas was compared to the directly measured LDL-C. Paired t-test was used for the comparison of means. Pearson correlation test is performed to assess the correlation of directly measured LDL-C with ML algorithm models, linear regression model, and six existing LDL-C calculating formulas. p-value <0.05 was taken as statistically significant. To check the accuracy of the ML models, linear regression formulas and existing LDL-C formulas, mean absolute difference (MAD) was calculated. MAD is the average of absolute deviations from the mean in a dataset. Models or formula having a low MAD score is better compared to those having a high MAD score.

\begin{array}{l} M A D = \frac{1}{n} \sum_{i = 1}^{n} | x_{i} - μ | \\ n - number of data values \\ xi - data values in set \\ μ - mean value of the data set \end{array}

We also assessed the performance of these models and formulas in classifying the patients to correct treatment groups as per NCEP ATP III guidelines.² Based on the directly estimated LDL-C, patients classified into groups—<100 mg/dL (Optimal), 100–129 mg/dL (near optimal/above optimal), 130–159 mg/dL (borderline high), 160–189 mg/dL (high) and ≥190 mg/dL (very high). If the patient was classified into the same treatment group with these models and formulas, it was taken as correctly classified. Percentage of patients classified correctly into appropriate treatment groups was calculated and compared. In order to check the accuracy of these models and formulas to classify the patients to correct treatment groups, confusion matrices were constructed. Accuracy of these models and formulas was calculated as the sum of correct classification divided by the total number of classifications.

Since many existing formula performances varied widely across different TG ranges, we classified the test data into four groups based on triglycerides (TG <100, TG 100–199, TG 200–399, TG ≥ 400). Performance of these machine learning models, linear regression formula, and existing formulas was assessed in these four groups.

Results

In the current study, we retrospectively reviewed 13,391 lipid profiles performed in the clinical biochemistry lab of AIIMS Bhubaneswar. The baseline characteristics of the population are shown in Table 1. Mean LDL-C was 113.75 mg/dL. In our study population, median value of triglyceride/VLDL ratio was 6.52 (VLDL calculated by subtracting directly estimated LDL-C from non-HDL cholesterol). 70% of the total data randomly used as a training set to develop the three machine learning algorithm models and linear regression formula to predict the LDL-C. Comparison of LDL-C predicted by different ML algorithms, linear regression, and different LDL-C formulas with directly measured LDL-C is showed in Table 2.

Table 1.

Characteristics of the Study Population (N = 13,391).

Characteristics	Value
Age (years)	46.33 ± 14.59
Sex
Male	8620 (64.4%)
Female	4771 (35.6%)
Total cholesterol (mg/dL)	183.32 ± 64.1
Triglycerides (mg/dL)	168.92 ± 139.54
HDL-C (mg/dL)	43.07 ± 13.42
Non-HDL cholesterol (mg/dL)	140.25 ± 57.27
LDL-C (mg/dL)	113.75 ± 42.89
TG/VLDL	6.52

Note: Values are presented as mean ± SD, except age and TG/VLDL. Age presented as number and percentage. TG/VLDL presented as median.

Table 2.

Comparison of LDL-C predicted by different ML algorithms, linear regression, and different LDL-C formulas with directly measured LDL-C (N = 4018).

Method	Mean ± SD (mg/dL)	MAD Score	Correlation (r)	p value^a
LDL—direct	113.75 ± 42.89
XGBoost	113.77 ± 42.07	4.52	0.98^b	0.868
Random forests	113.88 ± 41.54	3.12	0.98^b	0.367
SVR	110.82 ± 22.35	18.64	0.64^b	<0.001
Linear regression	113.84 ± 42.77	7.21	0.95^b	0.367
Friedewald et al.	106.47 ± 52.33	13.22	0.89^b	<0.001
Chen et al.	109.33 ± 47.51	9.25	0.94^b	<0.001
Vujovic et al.	115.59 ± 52.26	10.18	0.93^b	<0.001
Anandaraja et al.	106.59 ± 53.14	16.26	0.88^b	<0.001
Puavikai et al.	112.10 ± 52.16	10.82	0.91^b	<0.001
Martin et al.	112.02 ± 50.53	9.19	0.93^b	0.001

^ap value obtained by paired t-test, p <0.05 taken as significant.

^bCorrelation significant at the 0.01 level (two-tailed).

All three ML learning models used to predict LDL-C correlated significantly with directly estimated LDL-C. But among the three models, XGBoost and random forests model–predicted LDL-C showed a strong correlation with directly measured LDL-C. r value was 0.98 for both of these models. While SVR-predicted LDL-C showed a correlation coefficient of 0.64 only (Figure 1 and Table 2). With the training set, we generated a formula using multivariate linear regression analysis to calculate LDL-C from total cholesterol, triglycerides, and HDL-C. Formula was (0.80 × TC) − (0.66 × TG) − (0.73 × HDL) + 11.43. LDL-C calculated with this formula showed an r value of 0.95. Six LDL-C calculating formulas we used also showed a significant correlation with directly estimated LDL-C. Among the formulas, Chen et al. formula showed the highest correlation (r = 0.94) while Anandaraja formula showed the lowest correlation (r = 0.88). Friedewald formula, which is the most popular formula currently in use, showed a correlation coefficient of 0.89. Other formulas like Vujovic et al., Martin et al., and Puavikai et al. formula showed a better correlation than Friedewald formula (r value—0.93, 0.93, 0.91, respectively). ML learning models XGBoost and random forests showed better correlation than all the six existing formulas, but SVR performed poorly compared to these existing formulas. On comparing the mean by paired t-test, LDL-C predicted by XGBoost, random forests, and linear regression formula generated showed no statistically significant difference from the directly measured LDL-C, while SVR and all six existing formulas we used showed a significant difference.

Figure 1.

Correlation of directly measured LDL-C with LDL-C predicted/calculated by different methods: (A) XGBoost (r = 0.98, p < 0.001); (B) random forests (r = 0.98, p < 0.001); (C) SVR model (r = 0.64, p < 0.001); (D) linear regression model (r = 0.95, p < 0.001); (E) Friedewald formula (r = 0.89, p < 0.001); (F) Chen (r = 0.94, p < 0.001); (G) Vujovic (r = 0.93, p < 0.001); (H) Anandaraja (r = 0.88, p < 0.001); (I) Puavikai (r = 0.91, p < 0.001); (J) Martin (r = 0.93, p <0.001).

To compare the accuracy of different methods we used to predict LDL-C, we calculated mean absolute difference (MAD) score. Among the different methods we used, random forests showed the least MAD score. XGBoost was the second-best accurate among the different methods, with a MAD score of 4.52 (Figure 2 and Table 2). SVR was the worst accurate among all the methods used (MAD score—18.64). Among the six formulas currently in use, we found Martin et al. and Chen et al. as the better ones with MAD scores of 9.19 and 9.25, respectively. Friedewald formula showed a MAD score of 13.22.

Figure 2.

Mean absolute difference (MAD) of different methods used to predict/calculate LDL-C.

Since LDL-C value is important in determining the treatment of the patients, we also assessed how accurate these methods act to classify patients to correct treatment groups. Random forests and XGBoost models performed well in assigning patients to correct LDL-C groups. 92% of patients were classified into the correct group by random forests model. XGBoost classified 90% of patients into the correct group (Figure 3). Among the existing LDL-C formulas, Martin et al. formula showed the best concordance. This formula correctly classified 79% of patients into proper treatment groups. Chen et al. performed well, classified 78% of patients correctly. Anandaraja formula was the least accurate among the existing formulas we used in terms of assigning correct LDL-C treatment groups. Friedewald formula classified 74% of patients correctly, while 19% of patients misclassified into a lower treatment group. The linear regression model was better in classifying patients comparing to all six existing formulas we used. SVR models performed the worst in the classification of patients. SVR model was inferior to all six existing formulas we used in classifying patients into different treatment groups. Confusion matrices were also constructed to check the accuracy of these models/formulas in classifying the patients into correct LDL-treatment groups (Figure 4). Random forests model was found to perform best in classifying patient into proper groups according to ATPIII classification (accuracy—0.92). XGBoost model also showed good accuracy in classifying patient into proper groups (accuracy—0.90). Among the existing LDL-C formulas, Martin et al. formula showed highest accuracy (0.79). Performance of SVR model was the worst in assigning patient to proper LDL treatment groups (accuracy—0.65).

Figure 3.

Comparison of concordance in classification of patients to different treatment groups based on the NCEP ATPIII guidelines. Bars represent the percentage of patients correctly classified or misclassified into higher/lower treatment groups based on directly measured LDL-C.

Figure 4.

Confusion matrix for the classification of patients into LDL treatments groups as per NCEP ATPIII guidelines. Columns represent the directly estimated LDL-C groups; rows represent calculated LDL-C using different models/formulas. Each cell represents the number of patients. Cell color represents relative proportion (number of patients in an individual cell compared to total number of patients in the corresponding columns). Accuracy of each model/formula calculated as the sum of correct classification (values in the cells of main diagonal of confusion matrix) divided by the total number of classifications (n = 4018). (1) Random forests model (accuracy—0.92); (2) XGBoost model (accuracy—0.90); (3) SVR model (accuracy—0.65); (4) linear regression (accuracy—0.81); (5) Friedewald (accuracy—0.74); (6) Chen (accuracy—0.78); (7) Vujovic (accuracy—0.77); (8) Anandaraja (accuracy—0.67); (9) Puavikai (accuracy—0.76); (10) Martin (accuracy—0.79).

In previous studies, different formulas in use showed varying performance in different triglyceride ranges. So, we assessed the performance of the ML models and these formulas for different triglyceride ranges also. In the study, we classified the test set to four groups based on their triglycerides value—<100, 100–199, 200–399, and ≥400 mg/dL. Comparison of the different methods in different triglycerides groups is showed in Table 3. In all four groups, LDL-C predicted by random forests model performed the best. It showed the highest correlation and least MAD score in all four groups (Figure 5). XGBoost also performed well in all groups and was second behind random forests model. But the other ML model we used, SVR model performed the worst in all four triglycerides groups. Its performance was worse than all six LDL-C calculating formulas we used. Formulas we generated using multivariate linear regression performed better than the six formulas in all the groups. Among the formulas, Martin et al. formula and Chen et al. formula performed better in all four groups comparing to Friedewald and Anandaraja formula.

Table 3.

Comparison of LDL-C predicted by different ML algorithms, linear regression, and different LDL-C formulas with directly measured LDL-C in different triglyceride ranges.

Method	Mean ± SD (mg/dL)	MAD Score	Correlation (r)	p value^a
Triglyceride <100 mg/dL (N = 975)
LDL—direct	90.70 ± 32.57
XGBoost	90.75 ± 31.65	4.26	0.97^b	0.837
Random forests	91.07 ± 31.76	3.01	0.97^b	0.132
SVR	102.13 ± 17.00	20.46	0.57^b	<0.001
Linear regression	93.09 ± 30.44	6.13	0.97^b	<0.001
Friedewald et al.	90.48 ± 37.80	7.29	0.96^b	0.518
Chen et al.	87.69 ± 34.28	6.80	0.96^b	<0.001
Vujovic et al.	94.71 ± 37.97	7.50	0.96^b	<0.001
Anandaraja et al.	93.46 ± 40.59	11.87	0.92^b	<0.001
Puavikai et al.	93.09 ± 37.91	7.22	0.96^b	<0.001
Martin et al.	89.45 ± 37.35	7.17	0.96^b	<0.001
Triglyceride 100–199 mg/dL (N = 2047)
LDL—direct	114.56 ± 36.12
XGBoost	114.91 ± 35.78	4.07	0.98^b	0.015
Random forests	114.75 ± 35.51	2.66	0.98^b	0.200
SVR	110.78 ± 22.06	15.91	0.67^b	<0.001
Linear regression	114.43 ± 34.32	6.55	0.97^b	0.539
Friedewald et al.	110.43 ± 42.61	10.01	0.96^b	<0.001
Chen et al.	110.73 ± 38.59	8.20	0.97^b	<0.001
Vujovic et al.	118.09 ± 42.76	8.80	0.97^b	<0.001
Anandaraja et al.	109.81 ± 44.03	12.82	0.94^b	<0.001
Puavikai et al.	115.15 ± 42.70	8.77	0.96^b	0.028
Martin et al.	113.62 ± 41.12	8.11	0.97^b	<0.001
Triglyceride 200–399 mg/dL (N = 863)
LDL—direct	133.05 ± 49.30
XGBoost	132.16 ± 47.93	4.54	0.99^b	<0.001
Random forests	132.46 ± 47.25	3.00	0.99^b	0.021
SVR	119.37 ± 24.19	20.54	0.57^b	<0.001
Linear regression	130.75 ± 49.18	7.57	0.98^b	<0.001
Friedewald et al.	117.01 ± 60.72	19.50	0.98^b	<0.001
Chen et al.	126.33 ± 55.05	11.13	0.98^b	<0.001
Vujovic et al.	131.20 ± 60.95	11.77	0.98^b	0.001
Anandaraja et al.	115.06 ± 63.15	22.02	0.97^b	<0.001
Puavikai et al.	125.77 ± 60.85	13.84	0.98^b	<0.001
Martin et al.	129.81 ± 57.83	10.36	0.98^b	<0.001
Triglyceride ≥400 mg/dL (N = 133)
LDL—direct	145.02 ± 68.64
XGBoost	145.66 ± 66.76	13.22	0.91^b	0.792
Random forests	147.04 ± 61.09	11.92	0.9^b	0.439
SVR	119.76 ± 24.21	34.85	0.49^b	<0.001
Linear regression	148.55 ± 92.28	22.84	0.77^b	0.491
Friedewald et al.	94.29 ± 132.57	65.26	0.66^b	<0.001
Chen et al.	136.24 ± 105.11	31.20	0.75^b	0.152
Vujovic et al.	128.98 ± 121.63	40.82	0.72^b	0.034
Anandaraja et al.	98.22 ± 125.32	64.00	0.71^b	<0.001
Puavikai et al.	115.70 ± 125.36	49.29	0.7^b	<0.001
Martin et al.	137.53 ± 113.35	32.95	0.72^b	0.281

^ap value obtained by paired t-test, p <0.05 taken as significant.

^bCorrelation significant at the 0.01 level (two-tailed).

Figure 5.

Mean absolute difference (MAD) score of different methods to predict/calculate LDL-C in different triglycerides ranges.

Results of this study point out that ML algorithms can be used to predict LDL-C from the total cholesterol, triglycerides, and HDL-C. These models can perform better than traditional LDL-C calculating formulas. Among the three ML models we used, random forests and XGBoost models accurately predicted LDL-C. These models showed highly superior performance compared to different LDL-C calculating formulas including Friedewald formula. These two models also performed better than the linear regression model we generated. While the third ML model we used, SVR performed worst among all the methods we used. Among the six existing LDL-C calculating formula we used in the eastern Indian study population, we found that Martin et al. and Chen et al. formulas were more accurate.

Discussion

Machine learning algorithms can learn the complex relationship between variables and can utilize this knowledge to predict the outcomes. In this era of artificial intelligence, machine learning models are started to be explored for the better prediction and risks of various diseases.²⁶ Previous studies already demonstrated the superior performance of some of these ML models in assessing cardiovascular disease severity compared to conventional methods currently in use.²⁷ LDL-C being a strong marker of CAD and the primary target for therapy, estimation of accurate LDL-C value is crucial.²⁸ Our effort was to validate the performance of ML models in predicting LDL-C. We applied three different machine learning algorithms (XGBoost, random forests, SVR) to predict LDL-C from total cholesterol, triglycerides, and HDL-C in the data obtained from the population of eastern India and compared its accuracy with the directly estimated LDL-C. The performance of these models was compared with a linear regression model and six existing LDL-C calculation formulas. To the best of our knowledge, it is the first study in an Indian population seeking a better ML model to predict LDL-C with more accuracy.

Our study demonstrated random forests and XGBoost machine learning models predict LDL-C far better on comparing to conventional formulas like Friedewald formula. These models showed a correlation coefficient of 0.98 as compared to Chen et al. which was found to be better among the existing LDL-C formulas in the study population (correlation coefficient—0.94). The accuracy of these models was further demonstrated by evaluating MAD Score. XGBoost and random forests showed a MAD score of 4.52 and 3.12, respectively, while Friedewald showed a much higher MAD score (MAD Score—13.22). Further, we attempted to evaluate the accuracy of these models and formulas in classifying patients into correct LDL-C treatment groups. Random forests and XGBoost models were demonstrated to have highest accuracy in classifying patients into correct LDL-C treatment groups as per NCEP ATPIII guidelines. This is important as the wrong classification may lead to the under or over-treatment of these patients. These two models outperformed other methods also when compared in different triglycerides strata. For patients with triglyceride more than >400 mg/dL, these models showed a good correlation of 0.9, and statically, there was no significant difference between the mean values. While Friedewald formula usage is limited to patients with triglycerides <400 mg/dL, these ML algorithms with help of a large dataset can predict LDL-C even with high triglycerides value. XGBoost and random forests models outperformed the linear regression model also. Most of the LDL-C formulas were developed using multivariate linear regression analysis only. Our results underline the advantage of ML models over the linear regression analysis in the prediction of LDL-C. But it is noteworthy to point out that the other ML model we used—support vector regression (SVR) model performed poorly compared to the six existing formulas we used. Previously, Singh et al. explored the performance of random forests model to predict LDL-C in comparison to Friedewald and Martin equations. This study documented a correlation coefficient of 0.982 between the directly estimated LDL-C and LDL-C predicted using random forests, similar to the present study. Random forests model–predicted LDL-C was found to be better correlating than Martin and Friedewald formula in all the three strata of triglycerides used for classification (triglyceride 0–150, 150–500, >500 mg/dL). For the group with triglyceride >500 mg/dL, previous study showed a correlation coefficient of 0.99 between the random forests model–predicted LDL-C and directly estimated LDL-C. But our study demonstrated lesser correlation coefficient (0.90) in the group of patients with triglyceride >400 mg/dL for the random forests model.²⁵

We evaluated six formulas—Friedewald, Martin, Chen, Vujovic, Puavikai, and Anandaraja in our study population. The performance of these formulas was found to be varying greatly in different populations in previous studies. Anandaraja formula was developed in the Indian population and to predict LDL-C more accurately than Friedewald formula.¹¹ But some previous studies showed its performance was not better than Friedewald in the Indian population.²⁹ Wadhwa et al. evaluated seven formulas in the Indian population and found Vujovic formula as better than other formulas like Friedewald and Anandaraja.³⁰ Friedewald formula is still the most popular one among the different LDL-C calculating formulas. Friedewald considered TG: VLDL ratio as a five and used this to calculate LDL-C. But many studies showed this ratio can vary.³¹ Formulas like Puavikai and Hatta use different factors 6 and 4, respectively, as TG: VLDL ratio.^12,13 In our study population, median TG: VLDL ratio was 6.48. Martin et al. suggested using an adjustable novel factor instead of this fixed factor. Martin formulas were found to perform better than Friedewald in many studies.^32,33 In our study also, Martin et al. equation was found to be more accurate than Friedewald formula. Among the six formulas we used, Martin et al. and Chen et al. were found to better than Friedewald formulas and other formulas in our study population. With growing evidence suggestive of the poor performance of Friedewald formula in different populations, it is time to reconsider switching to novel methods like machine learning algorithms or better formulas like the Martin equation for predicting LDL-C.

Our study has some limitations also. This study included retrospectively collected lipid profile data from the laboratory database; hence, many clinical characteristics of the patients could not be included and analyzed. The ML models have better accuracy than the linear regression models and can predict the LDL-C more effectively. However, the complexity of these models is the major limitation in their application. Integration of these ML models into Electronic Health Record (EHR) systems will overcome this barrier and would be beneficial for calculating LDL-C at point of care and in planning the proper LDL lowering management protocol for patients.

Conclusion

In the present study, XGBoost and random forests predicted LDL-C more accurately and showed a strong correlation with the directly estimated LDL-C. Performance of XGBoost and random forests models in predicting LDL-C was found to be superior in comparison to six existing commonly used LDL-C calculating formulas and formulas we generated using linear regression. Physicians and laboratories could try to integrate these models for better prediction of LDL-C into their routine clinical practice instead of using the traditional formulas like Friedewald formulas.

Footnotes

Acknowledgments

We are thankful to our colleagues and staff of the clinical biochemistry laboratory, AIIMS Bhubaneswar.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Ethical approval

This study was approved by Institutional Ethical committee of AIIMS Bhubaneswar (T/IM-NF/BIOCHEM/20/162).

Guarantor

Contributorship

APP and SK conceived the study. PP helped in acquisition of data. Development of the machine learning model and statistical analysis was done by ASR and SN. Drafting of the article was done by APP and SK. All authors reviewed and edited the article.

ORCID iDs

Anudeep P P

Suchitra Kumari

References

Chareonrungrueangchai

Wongkawinwoot

Anothaisintawee

, et al. Dietary factors and risks of cardiovascular diseases: an umbrella review. Nutrients 2020; 12(4): 1088.

Penson

Pirro

Banach

. LDL-C: lower is better for longer-even at low risk. BMC Med 2020; 18(1): 320–326.

Arnett

Blumenthal

Albert

, et al. 2019 ACC/AHA guideline on the primary prevention of cardiovascular disease: a report of the American College of Cardiology/American Heart Association Task Force on Clinical Practice Guidelines. J Am Coll Cardiol 2019; 74(10): e177–e232.

National Cholesterol Education Program (US). Expert Panel on Detection, Treatment of High Blood Cholesterol in Adults . Third report of the National Cholesterol Education Program (NCEP) Expert Panel on detection, evaluation, and treatment of high blood cholesterol in adults (Adult Treatment Panel III). National Cholesterol Education Program, National Heart, Lung, and Blood Institute, National Institutes of Health, 2002.

Chung

. Update on low-density lipoprotein cholesterol quantification. Curr Opinion Lipidol 2019; 30(4): 273–283.

Nauck

Warnick

Rifai

. Methods for measurement of LDL-cholesterol: a critical assessment of direct measurement by homogeneous assays versus calculation. Clin Chem 2002; 48(2): 236–254.

Friedewald

Levy

Fredrickson

. Estimation of the concentration of low-density lipoprotein cholesterol in plasma, without use of the preparative ultracentrifuge. Clin Chem 1972; 18(6): 499–502.

Yano

Matsunaga

Harada

, et al. Comparison of two homogeneous LDL-cholesterol assays using fresh hypertriglyceridemic serum and quantitative ultracentrifugation fractions. J Atheroscler Thromb 2019; 26: 979–988.

Chen

Zhang

Pan

, et al. A modified formula for calculating low-density lipoprotein cholesterol values. Lipids Health Dis 2010; 9(1): 52.

10.

Vujovic

Kotur-Stevuljevic

Spasic

, et al. Evaluation of different formulas for LDL-C calculation. Lipids Health Dis 2010; 9(1): 27.

11.

Anandaraja

Narang

Godeswar

, et al. Low-density lipoprotein cholesterol estimation by a new formula in Indian population. Int J Cardiol 2005; 102(1): 117–120.

12.

Puavillai

Laorugpongse

. Is calculated LDL-C by using the new modified Friedewald equation better than the standard Frieldewald equation. J Med Assoc Thai 2004; 87(6): 589–593.

13.

Hata

Nakajima

. Application of Friedewald's LDL-cholesterol estimation formula to serum lipids in the Japanese population. Jpn Circ J 1986; 50(12): 1191–1200.

14.

Martin

Blaha

Elshazly

, et al. Comparison of a novel method vs the Friedewald equation for estimating low-density lipoprotein cholesterol levels from the standard lipid profile. JAMA 2013; 310(19): 2061.

15.

Gasko

. Low-density lipoprotein cholesterol estimation by the Anandaraja's formula--confirmation. Lipids Health Dis 2006; 5(1): 18.

16.

Onyenekwu

Hoffmann

Smit

, et al. Comparison of LDL-cholesterol estimate using the Friedewald formula and the newly proposed de Cordova formula with a directly measured LDL-cholesterol in a healthy South African population. Ann Clin Biochem 2014; 51(6): 672–679.

17.

de Cordova

. A new accurate, simple formula for LDL-cholesterol estimation based on directly measured blood lipids from a large cohort. Ann Clin Biochem 2013; 50(1): 13–19.

18.

Rasouli

Mokhtari

. Calculation of LDL-cholesterol vs. direct homogenous assay. J Clin Lab Anal 2017; 31(3): e22057.

19.

Karkhaneh

Bagherieh

Sadeghi

, et al. Evaluation of eight formulas for LDL-C estimation in Iranian subjects with different metabolic health statuses. Lipids Health Dis 2019; 18(1): 231.

20.

Quesada

Lopez-Pineda

Gil-Guillén

, et al. Machine learning to predict cardiovascular risk. Int J Cin Pract 2019; 73(10): e13389.

21.

Jung

Covington

Sen

, et al. Rapid identification of slow healing wounds. Wound Repair Regen 2016 Jan; 24(1): 181–188.

22.

Escobar

Puopolo

, et al. Stratification of risk of early-onset sepsis in newborns≥ 34 weeks’ gestation. Pediatrics 2014; 133(1): 30–36.

23.

Pina

Helgadottir

Mancina

, et al. Virtual genetic diagnosis for familial hypercholesterolemia powered by machine learning. Eur J Prev Cardiol 2020; 27(15): 1639–1646.

24.

Weng

Reps

Kai

, et al.

Can machine-learning improve cardiovascular risk prediction using routine clinical data?

PLoS One 2017; 12(4): e0174944.

25.

Singh

Hussain

, et al. Comparing a novel machine learning method to the Friedewald formula and Martin-Hopkins equation for low-density lipoprotein estimation. PLoS One 2020; 15(9): e0239934.

26.

Ngiam

Khor

. Big data and machine learning algorithms for health-care delivery. Lancet Oncol 2019; 20(5): e262–e273.

27.

Yang

Jin

, et al. Study of cardiovascular disease prediction model based on random forest in eastern China. Sci Rep 2020; 10(1): 5245.

28.

Silverman

Ference

, et al. Association between lowering LDL-C and cardiovascular risk reduction among different therapeutic interventions: a systematic review and meta-analysis. JAMA 2016; 316(12): 1289–1297.

29.

Gupta

Verma

Singh

. Does LDL-C estimation using Anandaraja’s formula give a better agreement with direct LDL-C estimation than the Friedewald’s formula?. Indian J Clin Biochem 2012; 27(2): 127–133.

30.

Wadhwa

Krishnaswamy

. Comparison of LDL-cholesterol estimate using various formulae with directly measured LDL-cholesterol in Indian population. J Clin Diag Res 2016; 10(12): BC11–BC13.

31.

Ephraim

Acheampong

Swaray

, et al. Developing a modified low-density lipoprotein (M-LDL-C) Friedewald’s equation as a substitute for direct LDL-C measure in a Ghanaian population: a comparative study. J Lipids 2018; 2018: 7078409.

32.

Piani

Cicero

Ventura

, et al. Evaluation of twelve formulas for LDL-C estimation in a large, blinded, random Italian population. Int J Cardiol 2021; 330: 221–227.

33.

Kang

Kim

Lee

, et al. Martin’s equation as the most suitable method for estimation of low-density lipoprotein cholesterol levels in Korean adults. Korean J Family Med 2017; 38(5): 263.