Genetic factor increased identication eciency of predictive models for dyslipidemia: a prospective cohort study

Background: Few studies have developed risk models for dyslipidemia, especially for rural population. Further, the performance of genetic factors in predicting dyslipidemia was not explored. The purpose is to develop and evaluate the prediction models with and without genetic factor for dyslipidemia in Chinese rural population. Methods: A total of 3596 individuals from the Henan Rural Cohort study were included in this study. All subjects were divided into training set and testing set in a ratio of 7:3. The conventional models and conventional+GRS models were developed with COX regression, articial neural network (ANN), random forest (RF), and gradient boosting machine (GBM) classiers in training set. Area under the receiver operating characteristic curve (AUC), net reclassication index (NRI), and integrated discrimination index (IDI) were used to assess the discrimination ability of models and the calibration curve was used to show calibration ability in testing set. Results: Compared to the lowest GRS quartile, HR (95%CI) of individuals in the highest GRS quartile was 1.23(1.07, 1.41) in total population. Age, family history of diabetes, physical activity, BMI, TG, HDL-C, and LDLC-C were included and developed the conventional models, and the AUC of COX, ANN, RF, and GBM classiers were 0.702(0.673, 0.729), 0.736(0.708, 0.762), 0.787 (0.762, 0.811), and 0.816(0.792, 0.839), respectively. After adding GRS, the AUC increased by 0.005, 0.018, 0.023, and 0.015 with COX, ANN, RF, and GBM classiers, respectively. The corresponding NRI and IDI were 25.6%, 7.8%, 14.1%, 18.1% and 2.3%, 1.0%, 2.5%, 1.8%, respectively. Conclusion: Genetic factors could improve the predictive ability for dyslipidemia risk model, suggesting

The established conventional models showed better performance in predicting dyslipidemia, especially using gradient boosting machine (GBM) classi er. More importantly, the predictive ability of models was signi cantly improved when incorporating genetic factor in conventional models, suggesting the better potential utility of genetic factor in predicting dyslipidemia in rural population.

Background
Dyslipidemia is an important risk factor for the development of cardiovascular diseases (CVD) [1]. Studies have shown that about 20% of patients with atherosclerosis have either high triglyceride (TG) or low highdensity lipoprotein cholesterol (HDL-C) lipid levels [2], and low HDL-C levels could reduce the incidence of heart disease and ischemic stroke [3]. Elevated serum levels of total cholesterol (TC), TG, and low-density lipoprotein cholesterol (LDL-C) could be used as independent predictors of CVD due to the close relationship [1,4,5]. The prevalence of dyslipidemia has declined in the developed countries such as the United States in nearly a decade [6], while the prevalence in China, the biggest developing country, remains at a high level and continue growth [7]. A total of 9.2 million cardiovascular events will occur due to the serum cholesterol levels in Chinese population between 2010 and 2030 [8]. In the rural area, the agestandardized prevalence of dyslipidemia was 32.21% in adults with a relatively low rate of awareness, treatment, and control (15.07%, 7.23%, 3.25%, respectively) [9]. The prevention of dyslipidemia remains a huge public health problem in China, especially in the rural area. The establishment of disease risk prediction models has received extensive attention globally in preventing diseases. In previous studies, the effective disease prediction models for CVD and diabetes were built based on the Framingham study [10,11]. In recent years, some researchers had established other effective risk models to diagnose and predict varieties of diseases [12][13][14][15]. Few studies have involved the prediction model for dyslipidemia [16][17][18][19], and most of them were limited to certain groups of people such as children and adolescents to some extent.
As reported, a genetic risk score consists of multiple single nucleotide polymorphisms (SNPs) conferred a strong prediction in risk but each SNP only donated little individually [20]. Although the role of SNPs in dyslipidemia are well known [21][22][23], no studies have been interpreted that how polygenetic genetic risk scores (GRS) affect dyslipidemia when the prediction of the risk of dyslipidemia is needed, especially in the resource-limited area. To that end, this study was constructed to set up the dyslipidemia prediction model and to reveal the prediction performance of the model incorporating genetic factors in predicting the occurrence of dyslipidemia in Chinese rural adults.

Study population
Participants were recruited from the Henan Rural Cohort study which was registered in Chinese Clinical Trial Register (Registration number: ChiCTR-OOC-15006699). The baseline examination and follow-up information have been previously described in detail [24]. In brief, the baseline investigation included a questionnaire interview, anthropometry measurements, blood tests. The subjects were then asked about the occurrence of chronic diseases, including the type and duration of the disease, as well as the status of treatment and medication at the follow-up survey.
A total of 6930 individuals had completed follow-up survey, and 3596 individuals were nally analyzed after excluding participants who 1) had dyslipidemia at baseline; 2) were using lipid-lower drugs; 3) were missing important information about the key variables. According to a ratio of 7: 3, the subjects were then randomly divided into training set (n = 2517) for model construction and testing set (n = 1079) for performance evaluation of the models. De nition of dyslipidemia As reported by the Chinese guidelines on prevention and treatment of dyslipidemia in adults [7], dyslipidemia was de ned as having one or more of the following conditions: TC ≥ 6.2 mmol/L (240 mg/dl); TG ≥ 2.3 mmol/L (200 mg/dl); HDL-C ≤ 1.0 mmol/L (40 mg/dl); LDL-C ≥ 4.1 mmol/L (160 mg/dl) or use of lipid-lower drugs in recent two weeks. Calculation of weighted genetic risk score (GRS) A weighted genetic risk score (GRS) was calculated using 21 SNPs to assess the predictive performance of genetic factors. GRS was calculated by multiplying the weight of each SNP by the number of risk alleles. As shown in Table S1, the weights of each SNP were calculated based on our own population. The mean value and standard deviation of GRS were 1.329 and 0.337, ranging from 0.195 to 2.451. Classi er COX regression model, also known as the "proportional hazards model", is a semiparametric regression model. The model takes survival outcome and survival time as dependent variables. This model has been widely used in medical follow-up studies and is by far the most used multi-factor analysis method in survival analysis.
Arti cial neural network (ANN), which simulates neuron activity with a mathematical model, is an information processing system based on imitating the structure and function of brain neural networks. Compared with traditional data processing methods, neural network technology has obvious advantages in processing fuzzy data, random data, and non-linear data, and is particularly suitable for systems with large scale, complex structure, and ambiguous information.
Random forest (RF) mixes up the bagging ensemble learning theory and random subspace technique. In training set, numerous decision trees are produced to randomly separate data. Random feature selection is bringing in feature selection process in RF. After randomly selecting a subset containing K attributes from the attribute set of each base decision tree node, an optimal attribute is adopted from the sub-set for partitioning. At last, RF chooses the classi cation with the most votes which are voted by all trees.
Gradient boosting machine (GBM) trains a stronger classi er by combining different weak classi ers which are trained in the same training set. As an iterative algorithm, GBM can master inadequacy of the combination of weak classi er by putting each weak classi er into iteration process, and these series of iteration will ameliorate the results of classi cation. At the time when training weak classi er, GBM strengthen models using the residual of training set which was adapted by preceding weak classi er.

Statistical analysis
Statistical signi cance was inferred at a two-tailed value of P < 0.05. T-test and chi-square test were used to compare differences in characteristics between training and testing set. All the subjects were divided into quartiles according to GRS. Taking Q1 as a reference, we calculated the hazard ratios (HRs) of the remaining three groups of subjects in total population, as well as training and testing set.
Previous studies revealed a dozen of variables as predictors of dyslipidemia: age, sex, educational level, smoking, high-fat diet, more vegetable and fruit intake, family history of hyperlipidemia, physical activity, waist circumference (WC), family history of diabetes, BMI, TG, HDL-C and LDL-C (Table S2) [16,18]. In the training set, all the variables were analyzed using univariate COX regression. Then, those variables presenting a signi cant impact on dyslipidemia entered the conventional models. GRS mentioned above was then incorporated into the conventional models to constitute the conventional + GRS models. COX regression, arti cial neural network (ANN), random forest (RF), and gradient boosting machine (GBM) were employed to construct the conventional models and conventional + GRS models. In COX classi ers, the conventional model and conventional + GRS model were constructed in training set, as for ANN, RF, and GBM, prediction models were trained and tested through 10-fold cross-validation during the iteration process, which repeated 100 times.
Model performance was calculated in the testing set. In COX classi er, the coe cients in training set was used to predict dyslipidemia risk in testing set. The parameters of each model were determined by grid search and ten cross-validations of the training set to ensure the best performance value with ANN, RF, and GBM. The discrimination of models was assessed using the area under the receiver operating characteristic curve (AUC). Net reclassi cation index (NRI) and integrated discrimination index (IDI) were used to evaluate the improvement of predictive ability of the conventional models when adding GRS. The calibration of models was assessed by calibration curves. Statistical analyses were performed with R 3.6.2, and Python 3.8.

Baseline characteristics
The baseline characteristics for training and testing set were shown in Table 1. The average age of all subjects was 50.49 ± 12.16, and the proportion of men was 31.2%. No signi cant differences of demographic characteristics and lipid measurements were observed between training and testing set (all P > 0.05). , respectively, when adjusting for age, family history of diabetes, physical activity, BMI, TG, HDL-C, LDL-C, which suggested a steady increase in the risk of dyslipidemia occurrence with the rise of GRS. By the same token, adjusted and crude HRs also showed the same constant increment in training set and testing set.

Development and evaluation of the conventional models
In training set, the 14 predictors were analyzed using univariate COX regression, and 8 variables (age, family history of diabetes, physical activity, WC, BMI, TG, HDL-C, and LDL-C) showed statistically signi cant correlation with dyslipidemia. Eventually, the conventional models were composed of age, family history of diabetes, physical activity, BMI, TG, HDL-C, and LDL-C (Table 3, above), considering the collinearity between WC and BMI. It was worth noting that there was no collinearity between TC, HDL-C, and LDL-C. The AUC and differences of 4 conventional models with different classi ers were shown in Fig. 1 and Table 4. In testing set, the AUC of conventional models with COX, ANN, RF, and GBM classi ers

Development and evaluation of the conventional models with GRS
The conventional + GRS models combined conventional factors and GRS (Table 3, bellow). Table 4 showed the differences of discrimination between conventional model and conventional + GRS model. With the COX classi er, the addition of GRS improved the predictive ability of the conventional model in a limited way. The conventional model showed moderate discrimination, and the addition of GRS slightly increased AUC to 0.707(0.679, 0.734); the difference in AUC was 0.00491 but showed no statistical signi cance at a P = 0.0549. Notwithstanding, the addition of GRS resulted in a continuous NRI of 25.6% (13.8%, 35.8%) and an IDI of 2.3% (1.1%, 3.7%), which were statistically signi cant. As for ANN classi er, the addition of GRS increased AUC to 0.754 (0.727, 0.779); difference in AUC was 0.0183 (P = 0.0031).
Nevertheless, the continuous NRI and IDI were 7.8% (-2.7%, 18.5%) and 1.0% (-0.3%, 2.4%), presenting no statistically signi cant. Additionally, the conventional + GRS model with RF classi er resulted in signi cant improvements (NRI: 14.1% (1.1%, 26.1%); IDI: 2.5% (0.5%, 4.2%)), announcing a competent progress of GRS in predicting dyslipidemia. The discrimination of the prediction model showed signi cant improvements better than the GBM classi er when adding GRS into the conventional model. Figure 2 provided the receiver-operating characteristic curves for conventional and conventional + GRS models in different classi ers. Results suggested the addition of GRS could improve the prediction performance of the conventional models. Besides, the GBM classi er presented the best performance with an AUC of 0.831 (0.808, 0.853) of the conventional model. Figure 3 demonstrated the calibration of conventional and conventional + GRS models. The calibration curves of the conventional + GRS models were closer to the reference line than the conventional models.
The brier scores, which can be considered as a "calibration" measure of a set of probabilistic predictions, also declined with the addition of GRS (COX declined 0.048, ANN classi er slightly declined 0.005, and GBM declined 0.006), indicating models were provided with better calibration when incorporating GRS (The lower the brier score value, the better the prediction calibration). Other statistics for instance sensitivity, speci city, etc. were also provided in Table S3, which proved that the predictive ability of models was improved by adding GRS.

Discussion
As far as we know, this is the rst study explored the utility of genetic factors in the prediction of dyslipidemia in resource-limited area based on a prospective study. Results of this study suggested those in higher GRS quartiles displayed increasing risk of dyslipidemia onset compared to participants with the lowest quartile of GRS. Then, the conventional models were constructed with COX, ANN, RF, and GBM classi ers, and the model with GBM classi er signi cantly outperformed the other classi ers. More importantly, the accession of GRS convincingly improved the capability of the conventional models to predict dyslipidemia, implying the genetic factors perform a meaningful role in predicting the occurrence of dyslipidemia.
This study elaborated the correlation between the genetic factor (GRS) and dyslipidemia by dividing GRS into quartiles. A previous study divided all participants into 3 groups according to GRSs of LDL-C, HDL-C, and TG, the highest GRS groups all presented higher lipids levels than the lowest GRS groups in HDL-C, LDL-C, and TG [23]. Similarly, in this study, we found that the higher GRS was associated with a higher risk of dyslipidemia onset regardless of age, family history of diabetes, physical activity, BMI, TG, HDL-C, and LDL-C. Although not every HR was statistically signi cant, dyslipidemia risk increased within each quartile of GRS, and a similar trend was observed in training set and testing set. The above announced a statistically signi cant enhanced occurrence of dyslipidemia risk with incremental GRS in rural area population.
Results showed that the conventional model with GBM classi er presented the best predictive performance. Yet, 7 variables demonstrated statistical signi cance in univariate COX regression analysis and nally were included in the conventional models. Based on the results, univariate COX regression tagged baseline lipoprotein including TG, HDL-C, and LDL-C as predictors, which was a reasonable result that currently plasma lipoproteins leading to abnormal future blood lipids. Besides, HRs of predictors in the conventional model were comparable to those reported in other reasearshes [9,[25][26][27][28][29].
Correspondingly, the HRs of these 7 variables were also consistent with those in early published studies [16,18,19]. What is noteworthy is that the three serum lipid parameters showed no collinearity. The ndings pointed out that GBM classi er could predict the incidence of dyslipidemia better, which had been con rmed in the previous study [30]. This might be due to the GBM classi er could deal with the intricate relationship between predictors and dyslipidemia.
Considering the moderate but strong association between GRS and dyslipidemia, the performance of GRS to predict the occurrence of dyslipidemia was then gured out. All the 4 classi ers (COX, ANN, RF, and GBM) manifested that the discrimination and calibration of the prediction model were moderately improved by adding GRS into the conventional models. The NRI and IDI were not signi cantly corrected with the inclusion of GRS (P > 0.05) in the ANN classi er though the number of NRI and IDI were slightly increased. Still, a major improvement was observed in COX, RF, and GBM classi ers. As is shown in an earlier study [22], in the transition from childhood to adulthood, the predictive power of GRSs on HDL-C, LDL-C, and TG had been proved to be valuable in predicting adulthood lipids level. Any abnormal lipid index can be de ned as dyslipidemia; thus, GRS might have a predictive effect on dyslipidemia, and our results con rm this. Further, the result also suggested the application of the machine learning technic might have a better effect on disease prediction than the statistical method, which was consistent with the results of other studies [31,32]. By the same token, the elevation of other statistical (Table S3) value exhibited that GRS played a relatively important role in dyslipidemia prediction. Principally, the results of this study revealed that GRS could be a crucial predictor to the occurrence of dyslipidemia.
As was demonstrated in a former study [33], the disclosure of coronary heart disease risk estimates indicated that the inclusion of genetic risk information resulted in lower levels of LDL-C compared to the disclosure based on conventional risk factors only. Genetic risk information for common diseases could be incorporated into the conventional predictive model and used to guide treatment. Considering how lipids level impressed CVD [34,35], it's reasonable to infer that the addition of the GRS into the prediction model of dyslipidemia might help individuals prevent abnormal blood lipid levels and thus contribute to the prevention of cardiovascular events.
Strengths and limitations: this research clari ed the crucial impact of genetic information in predicting dyslipidemia in rural area, signifying the certain guiding role of the gene in the prevention and treatment of clinical dyslipidemia. To some extent, the research indicated that the machine learning method might have certain advantages in the construction of the disease prediction model. As well, a cohort study was used to construct the conventional model and to analyze the relationship between genetic factors and dyslipidemia, making the results more convincing. Yet, several limitations need to be remarked. The integration of the four lipid measurements (TC, TG, LDL-C, and HDL-C) into dyslipidemia might gloss over the ability of genetic information in each lipid indexes. But there was no denying that genetic information was impressive in blood lipids, providing a foundation for the follow-up studies about genetic factors and lipid levels. Another limitation concerns that the brier score failed to test statistically in assessing the calibration of models, though the value has declined. Thirdly, the extrapolation of the conclusions is restricted by the lack of external validation. However, 30% of subjects were randomly selected to conduct internal veri cation to increase the credibility of the study. Meanwhile, the representation might limit as a result of the recruited subjects only came from the rural area in China.

Conclusion
Based on the prospective cohort study, eight dyslipidemia prediction models were developed and evaluated with and without the genetic factor (GRS), respectively. The conventional models included age, family history of diabetes, physical activity, BMI, TG, HDL-C, and LDL-C, which showed better performance in predicting dyslipidemia, especially with GBM classi er. After adding genetic factor, the prediction performance of the conventional models was effectively enhanced. The results set the stage for future research to study the prediction ability of genetic factors in different lipid indexes. Miaomiao Niu: Drafted the manuscript. Liying Zhang, Jian Hou, Wenqian Huo, and Zhenxing Mao: modi ed the manuscript. All the authors contributed to the revision of the manuscript and approved the nal manuscript. Figure 1 Receiver-operating characteristic curves of conventional models with four classi ers. Abbreviations: ANN: arti cial neural network; RF: random forest; GBM: gradient boosting machine.