Risk stratification of ST-segment elevation myocardial infarction (STEMI) patients using machine learning based on lipid profiles

Background Numerous studies have revealed the relationship between lipid expression and increased cardiovascular risk in ST-segment elevation myocardial infarction (STEMI) patients. Nevertheless, few investigations have focused on the risk stratification of STEMI patients using machine learning algorithms. Methods A total of 1355 STEMI patients who underwent percutaneous coronary intervention were enrolled in this study during 2015–2018. Unsupervised machine learning (consensus clustering) was applied to the present cohort to classify patients into different lipid expression phenogroups, without the guidance of clinical outcomes. Kaplan-Meier curves were implemented to show prognosis during a 904-day median follow-up (interquartile range: 587–1316). In the adjusted Cox model, the association of cluster membership with all adverse events including all-cause mortality, all-cause rehospitalization, and cardiac rehospitalization was evaluated. Results All patients were classified into three phenogroups, 1, 2, and 3. Patients in phenogroup 1 with the highest Lp(a) and the lowest HDL-C and apoA1 were recognized as the statin-modified cardiovascular risk group. Patients in phenogroup 2 had the highest HDL-C and apoA1 and the lowest TG, TC, LDL-C and apoB. Conversely, patients in phenogroup 3 had the highest TG, TC, LDL-C and apoB and the lowest Lp(a). Additionally, phenogroup 1 had the worst prognosis. Furthermore, a multivariate Cox analysis revealed that patients in phenogroup 1 were at significantly higher risk for all adverse outcomes. Conclusion Machine learning-based cluster analysis indicated that STEMI patients with increased concentrations of Lp(a) and decreased concentrations of HDL-C and apoA1 are likely to have adverse clinical outcomes due to statin-modified cardiovascular risks. Trial registration ChiCTR1900028516 (http://www.chictr.org.cn/index.aspx). Supplementary Information The online version contains supplementary material available at 10.1186/s12944-021-01475-z.


Background
Dyslipidemia has been considered as a risk factor in atherosclerotic progression [1]. Plasma lipoproteins, including cholesterol esters, apolipoproteins, and triglycerides, can predict adverse outcomes in patients with coronary artery disease (CAD) [2][3][4][5]. Nevertheless, the complex and joint relationship of plasma lipoproteins that may interact in a physiological or pathophysiological manner can complicate the analysis and integration in clinical settings [6,7]. Moreover, none of these lipoproteins could be identified with the "one-size-fits-all" marker of CAD prognosis.
ST-segment elevation myocardial infarction (STEMI) has been recognized as the most acute manifestation of CAD. Although the prognosis of patients with STEMI has improved with the implementation of reperfusion and lipid-lowering strategies, hospitalization and 1-year mortality rates are still at 5-6% and 7-18%, respectively [8]. Statins, well-recognized recommendations for universal use of evidence-based drugs, mainly decrease levels of low-density lipoprotein cholesterol (LDL-C) and long-term mortality [9,10]. Recent evidence demonstrated that lipid alterations beyond LDL-C are also associated with cardiovascular risk [11]. Several emerging medications displaying direct effects on lipoproteins other than LDL-C have also been investigated [12]. However, other components of the lipid profile as a potentially important part of the overall absolute STEMI related risk assessment have not been fully evaluated. Hence, it is crucial to develop novel strategies for identifying high-risk STEMI subgroups considering lipid profiles.
Unsupervised clustering algorithm, which is an agnostic approach, can segregate patients with similar phenotype without the guidance of an a priori classification system [13]. Previous studies have utilized unsupervised cluster analysis to divide patients with heart failure, pulmonary artery disease, and CAD [14][15][16]. However, nearly no studies have focused on unsupervised clustering in STEMI patients. Accordingly, this study aimed to generate lipid-derived phenogroups using an unsupervised machine learning method to identify high risk patients with STEMI during follow-up.

Study population and design
In this study, patients diagnosed with STEMI were consecutively enrolled in the First Affiliated Hospital of Chongqing Medical University between December 2014 and December 2018. All participants had STEMI defined by (1) typical chest pain or equal symptoms persisting for more than 30 min, (2) continuous ST-segment elevation in at least two contiguous leads or new left bundlebranch block on an electrocardiogram, and (3) elevated levels of a myocardial enzyme more than twice the upper limit value. Patients were excluded if they were admitted for more than 24 h since symptom's onset, had missing data, and did not receive primary percutaneous coronary intervention (PCI). Ultimately, 1355 patients had been involved in this study. After admission, the patients were administered medication in adherence to the guideline for STEMI therapy [17]. Written informed consent was provided by all participants, and the study was executed according to the Declaration of Helsinki.
In this cohort, the lipid-associated phenotyping approach entailed (i) an unsupervised consensus clustering analysis to identify the STEMI phenogroups without the constraint of a priori clinical data, (ii) a comparison of clinical characteristics among the lipid-derived clusters, and (iii) a multivariate Cox analysis to validate the association of STEMI phenogroups with all adverse events during follow-up.

Data collection
Two physicians independently collected the demographic, clinical, laboratory, angiographic, and medication characteristics of the STEMI patients through the hospital record system. The Gensini score, which indicates atherosclerotic plaque burden, was calculated through angiography before the PCI [18,19]. The postprocedural thrombolysis in myocardial infarction grade was defined according to the operative record files.
Overnight fasting venous blood specimens were obtained for lipid profiles in 24 h of symptom onset. The levels of lipid panels were calculated with a Cobas c701 biochemistry analyzer from Roche Diagnostics (Basel, Switzerland). The following seven candidate lipoprotein variables displaying a strong cardiovascular risk association were chosen for further unsupervised clustering analysis: All participants included in the study were regularly contacted (typically every 3 months) via telephone interviews and office visits. The endpoints after discharge were defined as all-cause mortality, and all-cause and cardiac rehospitalization events. With a 904-day median follow-up (interquartile range: 587-1316), 166 deaths were ultimately registered. All follow-up activities were ended on May 1, 2020.

Unsupervised machine learning clustering analysis
The normality of the distribution of the seven lipoprotein variables was first assessed. Lp(a) was converted as Ln [Lp(a)] given a shewed distribution, and then the logtransformed variable was applied to the subsequent analysis. Thereafter, the seven variables (TC, TG, HDL-C, LDL-C, apoA1, apoB, and Ln [Lp(a)]) were Z-score transformed (to a mean of 0 and variance of 1) to minimize the effect of variables with a larger variance on clustering. Then an unsupervised consensus clustering was implemented to sort STEMI patients into phenogroups based on the lipid profile using the "Consensu-sClusterPlus" package in R [20]. Consensus clustering with 1000 resampling iterations (80% of patients/subsample) among a cluster number (k) range of k = 2-20 was utilized. The k optimal clustering stability was verified through the proportion of ambiguously clustered pairs (PAC) and consensus matrix heatmaps [20,21]. Four algorithms (namely, the k-means, hierarchical, partitioning around medoids, and k-medoids algorithms) with seven different distance metrics (28 total combinations) were applied to determine the input parameters of clusters with the best internal validity through the "fpc" package. Ultimately, the k-medoids algorithm and Pearson distance were applied for consensus clustering, and k = 3 was chosen as the best optimal number of clusters through PAC and consensus heatmap (Supplementary Table 1, and Supplementary Figure 1).

Principal component analysis (PCA)
To identify the discriminative performance of the unsupervised machine learning algorithm, principal component analysis (PCA) was applied as a dimensional reduction technique to summarize the overall clinical variation of the lipid profiles. The first three principal components (PCs) (accounting for more than 80% variance) were selected for further analysis (Supplementary Figure 2A). Differences in PC1, PC2, and PC3 among the three phenogroups were also identified (Supplementary Figure 2B-D). Finally, patients were mapped into a coordinate system based on the first three PCs (Supplementary Figure 3).

Clinical comparison of phenogroups
Differences in demographic, clinical, laboratory, angiographic and medication characteristics among the lipidderived phenogroups were compared. Continuous variables were summarized as mean (SD) or median (interquartile range) depend on their normal or non-normal distribution, correspondingly, whereas categorical variables were summarized as frequencies (percentage). To examine the differences among the phenogroups, oneway analysis of variance and Kruskal-Wallis test were conducted on the normally and non-normally distributed data. A Chi-squared test was used for categorical variables.
Next, the all-cause mortality approximations obtained with the Kaplan-Meier curves were compared across different phenogroups through a log-rank test. Patients in different phenogroups who were re-admitted to the hospital due to all-cause or cardiac events (including remyocardial infarction, heart failure, cardiogenic shock, arrhythmia, major bleeding, and cardiac mortality) were also compared via Kaplan-Meier curves.
Multivariate Cox proportional hazards regression was implemented to explore the association of the phenogroups with all adverse outcomes. Multivariable models were adjusted for age, gender, history of diabetes, hypertension, smoking status, culprit artery, creatinine, left ventricular ejection fraction, high-sensitivity Creactive protein, cardiac troponin I, time to balloon (h), and thrombolysis in myocardial infarction grade (≤II/ III). All statistical analyses were conducted with R version 3.6.3 (R Foundation for Statistical Computing, Vienna, Austria). For these analyses, Pvalue ≤ 0.05 was considered as statistically significant.

Results
Machine learning-based lipid-derived phenogroups Figure 1 shows the overall design of the study. The three phenogroups that displayed a distinct lipid profile pattern were identified using an unsupervised machine learning algorithm. The lipid profile levels for the different phenogroups are illustrated in Fig. 2. Patients in phenogroup 1 had the lowest concentrations of apoA1 and HDL-C and moderate levels of TC, TG, LDL-C, and apoB, whereas those in phenogroup 2 had the lowest concentrations of TC, TG, LDL-C, and apoB and the highest levels of apoA1 and HDL-C. Conversely, patients in phenogroup 3 had the highest levels of TC, TG, LDL-C, and apoB and intermediate levels of HDL-C and apoA1. Lp(a) decreased from phenogroup 1 to 3. All seven lipoprotein variables were significantly different among the three phenogroups (P < 0.001, Table 1).
Pearson correlations among the seven lipoprotein variables were conducted (Fig. 3a). There were moderately strong positive associations (r > 0.5) among TC, LDL-C, and apoB and between apoA1 and HDL-C. The other correlations demonstrated either a weak positive or a negative correlation.

Baseline characteristics among phenogroups
The baseline characteristics, including demographics, clinical signs, angiographic findings and medications are presented in Table 2. Patients in phenogroup 2 were older (P < 0.001) and had the lowest percentage of men (P < 0.001), dyslipidemia (P < 0.001), and smoking status (P < 0.001). Further, they had the highest percentage of left anterior descending artery occlusion (P = 0.02). Patients in phenogroup 2 had the lowest hemoglobin A1c (P < 0.001) but the highest hemoglobin (P < 0.001) and free thyroxine (P = 0.002). Patients in phenogroup 1 had the highest creatinine (P < 0.001) levels. Serum free triiodothyronine levels were lower in phenogroup 1 and 2 than in phenogroup 3 (P = 0.004). No significant difference in statin use (P = 0.56) was detected among the phenogroups, but aspirin (P = 0.003) and beta-blocker (P = 0.024) use were the highest, and diuretic (P = 0.004) use was the lowest in phenogroup 3.

Discussion
In this first unsupervised machine learning-based clustering study of STEMI patients, three distinct phenogroups were identified according to multiple serum lipoproteins levels, revealing different lipoprotein expression patterns, and baseline characteristics. Patients in phenogroup 1 with the highest Lp(a) and lowest apoA1 and HDL-C had the worst prognosis in the adjusted Cox analysis.
STEMI, which is one of the most critical clinical situations of CAD, is caused by plaque rupture or erosion with a thrombus obstruction of the epicardial coronary artery and then transmural ischemia [22]. Despite the substantial improvement of prognosis among STEMI patients due to the development of reperfusion and preventive measures over several decades [23], STEMI remains the leading cause of mortality and morbidity globally [24,25]. Szummer et al. [26] reported that the first-year mortality of STEMI patients in Sweden remains at 14.1%, even with the wide implementation of a variety of treatment strategies, including PCI, use of statin and beta-blocker, dual antiplatelet therapy, and implementation of angiotensin-converting enzyme inhibitor/ angiotensin-receptor blocker. Hence, improvement in the risk stratification of STEMI patients is necessary for further improvement to prognoses. Machine learning algorithms can identify an underlying pattern in complicated and various data. Furthermore, unsupervised clustering analysis can shed light on the non-linear interactions among variables without a priori attention to clinical events [13]. Recently, machine learning based approaches have been implemented to stratify patients with heart failure based on echocardiographic parameters [27][28][29]. Additionally, machine learning analyses have been used to phenomap prognostic categories and discover the responders of cardiac resynchronization therapy among heart failure patients through mixed-data phenotypic variables [14,30]. However, no studies have focused on recognizing the different patterns of lipoprotein expression through unsupervised consensus clustering in STEMI patients; Moreover, all lipoprotein variables included in this study were associated with cardiovascular risk. Hence, investigating the lipoprotein expressed features in phenogroups with poor prognosis could be helpful for risk stratification.
Unsupervised clustering algorithm is an informationdriven method to analyze the intrinsic relationship of high-dimensional data and then identify the existence of specific subtype of patients [20]. This method is also helpful in exploring the complicated lipoprotein variables. Furthermore, this analysis is focused on extracting valuable insights from the dataset, not associating with clinical outcomes. Hence, this method provides an open-ended exploratory perspective on the data and can identify new lipoprotein phenogroups [31].
Surprisingly, phenogroup 3 with the highest levels of LDL-C, TC, apoB, and TG was associated with the best prognosis, whereas phenogroup 2 with the lowest levels of LDL-C, TC, TG, and apoB and highest levels of HDL-C and apoA1 had relatively increased risk for adverse clinical outcomes. The reason for this result is that phenogroup 2 comprised much older patients and a larger percentage of female patients compared to the other phenogroups (in phenogroup 2, 29.4% of patients were female and 59.9% were ≥ 65 years old). Recent findings support that female STEMI patients have enhanced risk of death compared to male STEMI [32]. Furthermore, patients in phenogroups 2 and 3 showed a similar risk of clinical outcomes after discharge in the multivariate Cox analysis, which indicates that higher LDL-C increases the risk in younger phenogroups. On the contrary, patients in phenogroup 1 with the highest Lp(a) and lowest apoA1 and HDL-C levels had the worst clinical outcomes even after the differences in age and gender were adjusted. Hence, the lipoprotein characteristics of phenogroup 1 must be identified.
After years of lipid-lowering therapy development, statins have been widely used, especially in STEMI patients.  The Statins Evaluation in Coronary Procedures and Revascularization (SECURE-PCI) study revealed that statin therapy during hospitalization brought significant benefits for STEMI patients undergoing PCI [33]. Statin primarily acts on LDL-C and high-intensity statin therapy is predicted to decrease LDL-C by more than 50% [34]. Furthermore, almost every patient included into the study was treated with a statin and this standard post-STEMI treatment was equally distributed across different phenogroups. However, residual cardiovascular risk continues to be high, despite statin therapy [35]. Lipoprotein variables, including HDL-C, apoA1 and Lp(a), have been reported as predictors of statin-modified cardiovascular risk. Mechanically, the major lipid effect of statins is the lowering of circulating concentrations of LDL-C and TG [36]. However, the influence of statins on HDL-C is minimal [37]. A newly published meta-analysis, which enrolled 20 randomized controlled trials among Asian population, revealed that statin/ezetimibe combination therapy slightly increased HDL-C by 0.02 mmol/L [38]. Furthermore, it has been recently documented that an augmentation in the serum concentration of Lp(a) is associated with statin therapy [39].
ApoA1, the main protein constituent of high-density lipoprotein (HDL) particles, plays a critical role in reverse cholesterol transport, anti-inflammatory, antithrombotic, and antioxidant activities [40]. Furthermore, emerging evidence indicates that HDL and apoA1 are correlated with the improvement of stent biocompatibility after PCI [41]. Other clinical trials identified that lower HDL-C is associated with cardiovascular events in patients with type 2 diabetes mellitus and stable ischemic heart disease even with an optimal control of LDL-C levels [42,43]. In contrast, the unexpected ratio of relatively high prognostic risk to patients with higher levels of HDL-C and apoA1 in phenogroup 2 in this study may be due to older age and a larger percentage of female patients than those in phenogroup 3.
Lp(a) consists of LDL-like particles containing apoB-100 and its covalently linked glycoprotein apo(a) particle, which is determined by the LPA gene [44]. A Mendelian randomization study demonstrated that elevated Lp(a) is a strong and causal risk factor of atherosclerotic cardiovascular disease [45]. Traditional lipid-lowering therapies including statins, fibrates, and ezetimibe inefficiently lower Lp(a) levels [46]. New and emerging medicines such as proprotein convertase subtilisin/kexin type 9 (PCSK9) inhibitors and antisense oligonucleotides targeting apolipoprotein(a) (IONIS-APO(a) Rx and IONIS-APO(a)-L Rx) could reduce Lp(a) by 30-40 and 60%-70%, respectively [47][48][49][50]. Recent guidelines recommend that patients with extremely high Lp(a) levels should be treated with PCSK9 inhibitors instead of statin [9]. However, more evidence related to the clinical application of new Lp(a)-lowering drugs in STEMI patients is lacking and nearly none of the patients in the present study had received PCSK9 inhibitor therapy. Additionally, patients in phenogroups 1 and 3 could be classified into type IIa and type IIb according to Fredrickson classification, respectively [51,52]. Homma Y et al. [53] observed that simvastatin did not alter Lp(a) levels in either type IIa or type IIb dyslipidemia. Furthermore, an observational study reported that high concentrations of Lp(a) through low LPA kringl-IV type-2 number of repeats were associated with a high risk of mortality in the general population [54]. Additionally, two prospective trials demonstrated that cardiovascular disease risk associated with elevated Lp(a) remained with LDL-C levels below 2.5 mmol/L [55]. Moreover, individuals who underwent PCI with LDL-C levels below 2.6 mmol/L still had worse all-cause mortality and acute coronary syndrome after their levels of Lp(a) had increased [56].

Study strengths and limitations
This study is the first to identify the association of different lipoprotein phenogroup with prognosis in STEMI patients through machine learning analysis. More importantly, the relationship between lipid-derived phenogroups and outcomes is still significant after adjusting.
However, limitations of the study should be noticed. First, participants enrolled in this study were patients with STEMI only from one hospital; hence, a prospective and multicenter data may be needed in the future. Second, another independent dataset should be used for validation of the unsupervised clustering analysis. Third, the levels of lipoproteins during follow-up were not collected in this study, which may be important for further stratifying work. Finally, limitations of the follow-up method in this study led to be incapable of exploring the association between lipid-derived phenogroups and cardiac mortality, and the mean follow-up time of 2.5 years was relatively shorter than some other studies.

Conclusions
The present study identified three phenogroups with different lipoprotein features by using machine learning algorithm in STEMI patients. Patients in phenogroup 1 with the highest Lp(a) but the lowest apoA1 and HDL-C had highest mortality, all-cause and cardiac rehospitalization rates at follow-up. This association remained significant in multivariable adjusted Cox models. Our findings revealed that STEMI patients with high Lp(a), and low HDL-C and apoA1 should be concerned, regardless of age and gender. The administration of Lp(a)lowering drugs such as PCSK9 inhibitors and antisense oligonucleotides in STEMI patients with high Lp(a) may need to be recommended in the future guidelines.