Skip to main content

Risk factor analysis and risk prediction study of obesity in steelworkers: model development based on an occupational health examination cohort dataset



Obesity is increasingly recognized as a grave public health concern globally. It is associated with prevalent diseases including coronary heart disease, fatty liver, type 2 diabetes, and dyslipidemia. Prior research has identified demographic, socioeconomic, lifestyle, and genetic factors as contributors to obesity. Nevertheless, the influence of occupational risk factors on obesity among workers remains under-explored. Investigating risk factors specific to steelworkers is crucial for early detection, prediction, and effective intervention, thereby safeguarding their health.


This research utilized a cohort study examining health impacts on workers in an iron and steel company in Hebei Province, China. The study involved 5469 participants. By univariate analysis, multifactor analysis, and review of relevant literature, predictor variables were found. Three predictive models—XG Boost, Support Vector Machine (SVM), and Random Forest (RF)—were employed.


Univariate analysis and cox proportional hazard regression modeling identified age, gender, smoking and drinking habits, dietary score, physical activity, shift work, exposure to high temperatures, occupational stress, and carbon monoxide exposure as key factors in the development of obesity in steelworkers. Test results indicated accuracies of 0.819, 0.868, and 0.872 for XG Boost, SVM, and RF respectively. Precision rates were 0.571, 0.696, and 0.765, while recall rates were 0.333, 0.592, and 0.481. The models achieved AUCs of 0.849, 0.908, and 0.912, with Brier scores of 0.128, 0.105, and 0.104, log losses of 0.409, 0.349, and 0.345, and calibration-in-the-large of 0.058, 0.054, and 0.051, respectively. Among these, the Random Forest model demonstrated superior performance.


The research indicates that obesity in steelworkers results from a combination of occupational and lifestyle factors. Of the models tested, the Random Forest model exhibited superior predictive ability, highlighting its significant practical application.


Obesity, a metabolic disorder, leads to various health and psychological issues [1]. The World Health Organization recognizes obesity as a major global public health challenge, impacting individual and societal health and escalating healthcare costs. Obesity risk factors are multifaceted, encompassing demographic and socioeconomic elements (age, gender, ethnicity, education, income, marital status, and residency) [2,3,4,5]; lifestyle factors (dietary status, smoking, alcohol consumption, and physical activity) [6,7,8,9,10]; and genetic influences [11, 12]. While some risk factors for obesity are immutable, others can be modified. Identifying modifiable risk factors is critical for developing effective prevention and intervention strategies to reduce obesity. For occupational groups, it is also essential to consider job-related exposure factors. Studies indicate that obesity prevalence in occupational groups significantly exceeds that in the general population [13]. This is particularly evident in the steel industry, characterized by diverse job roles, hazardous work conditions, a large workforce, limited health awareness, and unhealthy habits. A 2021 study of iron and steel workers in Beijing, Tianjin, and Hebei revealed an obesity rate of 63.16%, substantially higher than the 50.70% rate among Chinese adults [14], highlighting a major health risk for these workers.

Previous studies have shown that specific occupation-related factors in steel enterprises have a significant impact on obesity in steel workers [15, 16]. Steelworkers, frequently exposed to high temperatures, noise, and dust, and often engaged in shift work, face unique obesity risks compared to the general population. Given these findings, investigating occupational risk factors for obesity and devising protective strategies and measures is imperative. Early detection and lifestyle interventions for at-risk steelworkers can significantly reduce the incidence of obesity.

Recent advancements in medicine have seen the rapid evolution and widespread integration of machine learning (ML) technologies, particularly in diagnosing, prognosticating, and managing diseases. The use of ML to model epidemiological data is gaining prominence in published scientific literature. Compared to traditional methods, prior research suggests that ML techniques enhance the prediction of health outcomes [17]. While numerous studies have employed ML to forecast obesity prevalence [18,19,20,21], these models typically focus on disease risk within the general population and overlook specific characteristics of occupational groups. As a result, such models are not suitable for steelworkers. Consequently, there is a pressing need to develop a new obesity risk prediction model tailored to steelworkers, aiming to improve their health and quality of life. This study, using physical examination data of steelworkers from 2017 to 2022, aims to identify obesity risk factors specific to this group and determine the best obesity prediction model applicable to steelworkers.


Study subject

This study draws from the “Cohort Study on Health Effects of Occupational Populations in Beijing-Tianjin-Hebei Region,” part of the National Key Research and Development Initiative. A baseline survey, conducted in January-September 2017, focused on workers in an iron and steel enterprise in Tangshan City (ISCO-08: 8122). Four follow-up data collections were completed in 2019, 2020, 2021, and 2022. Inclusion criteria were: age 18 to 60 years; regular employment status in the organization; a working tenure exceeding one year; and a non-obese status at baseline. Exclusion criteria included a working tenure of less than one year, being over 60 years of age, loss to follow-up, or incomplete survey information. All participants provided informed consent. The North China University of Technology Ethics Committee granted approval for the study on May 12, 2016, in accordance with the Declaration of Helsinki (approval number: 16,040). Figure 1 depicts the participant selection procedure.

Fig. 1
figure 1

The process of study participant selection

Information collection

The study employed a survey questionnaire and conducted one-on-one interviews with steelworkers, carried out by PhD and MSc students from North China University of Science and Technology. Licensed medical examiners followed standard testing procedures when doing physical assessments on these workers. Fasting blood samples were collected before 9:00 a.m. daily for laboratory analysis, utilizing a Myriad automatic biochemical analyzer (BS-800) for standard blood biochemical testing.

Data collection primarily included: (1) demographic information such as age, education level, marital status, and household income; (2) lifestyle habits like smoking, drinking, exercising, and diet; (3) physical and laboratory tests including blood biochemistry, height, and weight; and (4) occupational hazard exposure, covering aspects like shift work, service duration, dust, high temperatures, noise, and CO exposure.

Obesity diagnostic criteria

Body mass index (BMI) was calculated by obtaining the height and weight of the survey respondents based on survey measurements. The criteria for defining obesity based on BMI differ slightly internationally, reflecting regional population characteristics. In 2002, China conducted an extensive epidemiological survey of over 240,000 adults across 21 provinces, including Taiwan [22, 23], and established its obesity criteria: a BMI of ≥ 28.0 kg/m2.

Variable definition


The three categories of smoking status among the participants were never smoked, former smoker, and current smoker, following the World Health Organization’s definitions [24]. ‘Current smoker’ denotes smoking for over six months at a minimum of one cigarette per day; ‘former smoker’ refers to those who had quit smoking for at least six months.


Alcohol consumption was classified as never drinking, former drinker, or current drinker, as per guidelines from the Chinese Center for Disease Control and Prevention [25]. ‘Current drinker’ implies regular alcohol consumption for over six months, at least once per week; ‘former drinker’ denotes abstaining for at least six months.


The study assessed consumption of red meat, processed meats, sugary drinks, grains, vegetables, fruits, milk, nuts, and legumes, along with sodium intake. Dietary scores, based on the Dietary Approaches to Stop Hypertension (DASH) criteria [26], were assigned. Each food category was scored from 1 to 5 based on weekly intake frequency. The total dietary score ranged from 8 to 40. With a median DASH score of 25, this study divided dietary patterns into two categories: DASH < 25 and DASH ≥ 25.

Physical activity

This study assessed the physical activity of employees in the iron and steel industries using the International Physical Activity Questionnaire (IPAQ) [27]. The questionnaire covered daily work, transportation, lifestyle activities, exercise, recreation, sedentary time, and sleep duration. Each activity in the IPAQ was assigned a metabolic equivalent task (MET) value (Table 1). An individual’s weekly level of physical activity was calculated as MET × weekly frequency × daily duration. The intensity of various activities was summed to determine the total weekly physical activity level (MET-min/week). Based on intensity, frequency, and total weekly activity, physical activity levels were categorized as “low,” “medium,” or “high” (Table 2).

Table 1 The physical activity attributes and their MET assignments in the IPAQ long form
Table 2 Individual physical activity level grouping criteria

High temperature

Following the national standard “Measurement of Physical Factors in the Workplace Part 7: High Temperature” [28], operations with a WBGT index ≥ 25 °C and a significant heat source are classified as high-temperature operations.


According to the national standard “Measurement of Physical Factors in the Workplace Part 8: Noise“ [29], operations are considered noisy if the equivalent sound level exposure is ≥ 80 dB(A) over 8 h per day or 40 h per week.

Dust exposure

Based on the national standard “Determination of dust in workplace air part 1: total dust concentration” [30]. Computation of cumulative personal dust exposure using the steel firm’s real daily testing data and an on-site total dust concentration test conducted by a qualified testing organization.

CO exposure

Following the national standard “Determination of Air Toxic Substances in Workplaces Inorganic Carbonaceous Compounds” [31], individual cumulative CO exposure was calculated based on on-site CO concentration assessments conducted by qualified testing companies and the daily actual test results from steel companies.

Occupational stress

A modified version of the work content questionnaire (JCQ) [32], was used to quantify occupational stress. It consisted of three dimensions: job demands (5 items), job autonomy (9 items), and social support (8 items). Each item was rated on a 1 to 4 scale, with the total score for each dimension reflecting job demands, autonomy, and social support levels. Occupational stress was assessed using the demand/control (D/C) ratio, calculated as follows:

$$\mathrm D/\mathrm{Cration}\frac{\mathrm{Job}\;\mathrm{requirement}\;\mathrm{factor}\;\mathrm{score}}{\mathrm{Degree}\;\mathrm{of}\;\mathrm{job}\;\mathrm{autonomy}\;\mathrm{factor}\;\mathrm{score}\;\mathrm x\;\mathrm C}$$

In this equation, C represents the ratio of job demand items to job autonomy items (5/9). If a D/C ratio ≤ 1 denotes the lack of occupational stress, a D/C ratio > 1 implies occupational stress.

Shift work: Shift work was categorized as never, former (previously but not currently shifted), and current.

Sleep quality: Sleep quality was evaluated using the Athens Insomnia Scale (AIS). This scale includes 8 items, each scored from 0 to 3, with the total score determining the AIS score. According to AIS criteria: AIS < 4 indicates no sleep disorder; 4 ≤ AIS ≤ 6 suggests suspected insomnia; AIS > 6 confirms insomnia.

Sample size calculation

The model’s predictive accuracy was assessed based on the average outcome events. Reviewing literature revealed that the prevalence of obesity among steel company workers is 20.01% [14]. Placing a 0.05 margin of error (δ), a minimum of 248 study subjects was required, as demonstrated in Eq. (2),


The predictor variable p was roughly 20, and the mean absolute percentage error (MAPE) was set at 0.05 to indicate the least mean error for each predicted value [33]. Consequently, a minimum of 1,459 study subjects was deemed necessary, as shown in Eq. (3).

$$\begin{array}{c}n=exp\left(\frac{-0.508\ +\ 0.259\ ln\left(\varphi\right)\ +\ 0.504\;ln\left(p\right)\ -\ ln\left(MAPE\right)}{0.544}\right)\end{array}$$

Minimizing overfitting is critical for the model’s predictive accuracy. Riley et al. [34, 35] recommend careful consideration of sample size and the number of predictor variables, particularly with smaller shrinkage rates (≤ 0.1, with an expected shrinkage factor S ≥ 0.9). To ensure an expected contraction rate of 10% and reduce model overfitting, the expected contraction rate R2 CS was set to 0.1, the expected contraction factor S was set to 0.9, and the number of study variables P was roughly 20. It was calculated that a minimum of 1125 study subjects were needed. As shown in Eq. (4).

$$\begin{array}{c}n=\frac p{\left(s-1\right)\;ln\;\left(1-\frac{R_{CS}^2}S\right)}\end{array}$$

Furthermore, the prediction model’s sample size should ensure minimal discrepancy between the developed model and the optimal adjustment value of R2 CS. With maxR2 CS set at 0.75, the required sample size was calculated to be 497, as detailed in Eqs. (5) and (6).

$$\begin{array}{c}S'=\frac{R_{CS}^2}{R_{CS}^2+\delta max\left(R_{CS}^2\right)}\end{array}$$
$$\begin{array}{c}n=\frac P{\left(S'-1\right)ln\left(1-\frac{R_{CS}^2}{S'}\right)}\end{array}$$

Therefore, the study necessitated a minimum of 1,459 participants. With a total of 5,469 participants, the sample size was well-suited for the research objectives.

Model construction and evaluation

Three predictive models—XG Boost, Support Vector Machines, and Random Forests—were developed using Python 3.8.10. The sample data were randomly divided in a 7:2:1 ratio into training, test, and validation sets using the pandas and NumPy libraries in Python.

A comprehensive assessment and comparison of these models were conducted using various metrics, including (1) accuracy, (2) precision, (3) recall, (4) AUC, (5) calibration curve, (6) Brier score, (7) log loss, and (8) calibration-in-the-large, which are defined as follows:







Here, TP (True Positives) refers to correctly classified positive samples, FP (False Positives) to negative samples misclassified as positive, TN (True Negatives) to correctly classified negative samples, and FN (False Negatives) to positive samples misclassified as negative.

  1. (4)

    AUC: The area under the ROC curve, or AUC, reflects the diagnostic value of the model. An AUC closer to 1 signifies superior diagnostic performance.

  2. (5)

    Calibration curve: The model’s calibration is more accurate the closer this curve is to the diagonal line.

  3. (6)

    Brier score: This metric quantifies the model’s calibration degree, with values ranging from 0 to 0.25. Values closer to 0 indicate better calibration; a score of 0.25 suggests the model lacks predictive capability.

  4. (7)

    Log loss: Commonly used in logistic regression and neural networks, as well as certain variants of the expectation-maximization algorithm, this metric evaluates the probabilistic output of a classifier.

  5. (8)

    Calibration-in-the-large: This refers to the calibration curve’s intercept. A value closer to 0 indicates more accurate model calibration.

Statistical analysis

The original database was compiled using Excel 2016. Statistical analyses were conducted with IBM SPSS 24.0. The count data were displayed as composition ratios or rates, and the Chi-square test was used to compare groups of data; ordinal data were similarly described and compared using the Kruskal-Wallis test. COX proportional hazards regression modeling was used to carry out multifactor analysis. With a significance level α set at 0.05, every test was conducted in two-sided.

Quality control

Investigators strictly followed inclusion and exclusion criteria and were trained uniformly. To guarantee data authenticity, data entry was double-checked, and computer and human verification as well as logical error checks were used. Devoted staff members maintained and calibrated measurement instruments on a regular basis. The data was analyzed using appropriate statistical techniques, guaranteeing the validity of the test results.

Research findings

During the follow-up period, the incidence of new obesity cases among the study participants varied annually: 1,055 cases in 2019, 120 in 2020, 72 in 2021, and 74 in 2022. By the end of the follow-up, the total number of new obesity cases reached 1,319, comprising 1,246 males and 73 females. The overall obesity prevalence among steelworkers was 24.1%.

Single-factor analysis

The demographic characteristics of the study population indicated a decreasing risk of obesity with increasing age. Incidence rates were higher among males than females and varied significantly across marital statuses and educational levels (P < 0.05) (Table 3).

Table 3 Analysis of demographic characteristics of research objects

Behavioral lifestyle analysis of the steelworkers revealed that those with lower DASH diet scores had a significantly higher obesity incidence compared to those with higher scores. Additionally, obesity prevalence was higher among workers who smoked and consumed alcohol. Workers with low physical activity levels also showed a higher incidence of obesity compared to their more active counterparts, underscoring the potential role of unhealthy lifestyles as a risk factor for obesity. These findings are presented in Table 4.

Table 4 Analysis of the behavior and lifestyle of the study subjects

Analysis of occupational hazards indicated an upward trend in obesity prevalence among steelworkers with increasing age. Factors such as shift work, exposure to high temperatures, CO, and occupational stress were identified as obesity risk factors (Table 5).

Table 5 Analysis of occupational factor exposure of research subjects

Multifactor analysis

Multifactor analysis of steelworkers’ data was conducted using the Cox proportional hazards model. The influencing factors for obesity in steelworkers were identified as sex, age, smoking status, alcohol consumption, DASH diet score, physical activity, shift work, and CO exposure (Table 6).

Table 6 COX regression analysis of factors affecting obesity among steel workers

Model effectiveness evaluation

Incorporating results from both univariate and multivariate analyses, as well as relevant literature, the study selected 10 significant independent variables for the model: age, sex, smoking status, drinking status, DASH diet score, physical activity level, shift work, high-temperature exposure, CO exposure, and occupational stress.

Training on 3828 samples (70%) demonstrated that for the random forest model, precision, AUC, log loss, and calibration-in-the-large were 0.823, 0.873, 0.340, and 0.049, respectively. For the support vector machine model, accuracy, recall, and Brier scores were 0.861, 0.602, and 0.105, respectively. Initially, these two models performed better, with the XG Boost model lagging. Model parameters were refined during training and tested using validation set data. Results from 547 validation samples (10%) showed that the random forest model’s metrics—precision, AUC, Brier score, log loss, and calibration-in-the-large—were 0.684, 0.849, 0.122, 0.388, and 0.051, respectively, surpassing the other models. Testing on 1,094 test set samples (20%) confirmed that the random forest model’s accuracy, precision, AUC, log loss, Brier score, and calibration-in-the-large outperformed the other two models, indicating its optimal overall performance (Table 7).

Table 7 Evaluation of three risk models

The three models were compared in terms of the Area Under the ROC Curve (AUC). The XG Boost model fared the lowest in the training set, whereas the random forest model surpassed each of the other two. Similar conclusions were noted for the test and validation sets, demonstrating the Random Forest model’s superior predictive capability. These results are illustrated in Fig. 2a-c.

Fig. 2
figure 2

Three models’ ROC curves: a Training set; b Validation set; c Test set

The calibration curves of the random forest model in the training, test, and validation sets were closely aligned with the diagonal, indicating minimal bias. The calibration curves for all three models in the respective sets are displayed in Fig. 3a-c.

Fig. 3
figure 3

Three models’ calibration curves: a Training set; b Validation set; c Test set

Additionally, the data were analyzed using a more traditional logistic regression model. This analysis revealed that the logistic regression model’s predictive performance was superior to that of the XG Boost model, yet inferior to the Random Forest and SVM models (Table 8). The calibration and ROC curves for the logistic regression model are presented in Fig. 4a-b.

Table 8 Evaluation indicators of logistic regression
Fig. 4
figure 4

a ROC curves of logistic regression; b Calibration curves of logistic regression


Timely identification, diagnosis, and treatment are of great help for tertiary prevention. Machine learning techniques have recently enhanced the field of disease risk prediction. While obesity prediction in the general population has been extensively studied, research on occupational populations, particularly steelworkers, is limited. Occupational hazards are recognized risk factors for obesity in this group, but studies focusing on steelworkers are scarce [16, 36,37,38]. Steelworkers’ lifestyles, heavily influenced by their work environment and conditions, underscore the need to identify modifiable obesity risks in this demographic to develop effective prevention methods and policies. This study, conducted over five years with 5469 iron and steel workers, found a five-year cumulative obesity prevalence of 24.1% among these workers. The study suggests that obesity in steelworkers is influenced not only by lifestyle factors but also by various occupational factors. By constructing and comparing Random Forest, XG Boost, and Support Vector Machine risk prediction models, and referencing classical logistic regression model metrics, the Random Forest model emerged as the most effective in this study.

In this research, factors such as age, gender, DASH diet score, drinking and smoking habits, degree of physical activity, shift work, high-temperature exposure, CO exposure, and occupational stress were identified as significant in the development of obesity among steelworkers. Notably, shift work, high-temperature exposure, CO exposure, and occupational stress are distinct factors for this group compared to the general population. The obesity rate was notably higher among workers engaged in or with a history of shift work, possibly due to disruptions in circadian rhythms and sleep-wake cycles, leading to abnormal lipid metabolism and insulin secretion disturbances. This finding aligns with Grundy et al.‘s study [15]. Moreover, shift work often coincides with night light exposure, another significant factor in obesity development [39]. The effects of high-temperature exposure on obesity are not widely researched. Prolonged high-temperature exposure may reduce brown adipose tissue activity, necessary for maintaining constant body temperature, thus decreasing energy expenditure and increasing susceptibility to insulin resistance and fat accumulation. Epidemiological studies indicate a negative correlation between brown adipose tissue and obesity prevalence, with individuals having higher proportions of this tissue at a lower obesity risk [40, 41]. The findings on CO exposure in this study were unexpected. Prolonged excessive CO exposure may inhibit heme oxygenase (HO), leading to disturbances in lipid metabolism and thereby contributing to obesity development. In both animal and human studies, upregulation of HO has been shown to ameliorate obesity and enhance vascular function [42, 43]. CO, being a toxic and hazardous gas, necessitates vigilance in industrial settings. Effective measures are required to protect workers from CO exposure, including enhanced protective measures and improved ventilation, especially considering CO generation is often an inevitable aspect of production operations. Mental health disorders and negative emotions stemming from occupational stress can contribute to abnormal eating behaviors and sedentary lifestyles, further escalating the risk of obesity [44]. Associated depression and anxiety frequently lead to insomnia, a significant obesity risk factor [45, 46]. In this study, lifestyle factors that contribute to obesity, such as smoking and alcohol consumption, were found, and the conclusion is consistent with previous studies. Wannamethee SG et al. [10] discovered that heavy drinkers typically had higher BMIs than nondrinkers or moderate drinkers, but could reach similar BMI levels to nondrinkers after five years of controlled drinking. A 2017 study in Korea by Rha EY et al. [47] reported a positive association between alcohol consumption and central obesity prevalence. Furthermore, epidemiological evidence indicated a positive correlation between obesity prevalence and smoking duration, a finding echoed in a related study in China [48]. In a cross-sectional analysis of a Chinese multiethnic cohort, Tang Dan et al. concluded that adherence to the DASH diet reduces obesity risk [49]. The outcomes of this study support earlier findings by demonstrating that steelworkers who score higher on the DASH diet had a lower risk of obesity.

This study not only evaluated several models mentioned in the text but also compared logistic regression, a traditional statistical prediction model, with the three aforementioned machine learning models. The limitations of logistic regression, particularly when dealing with non-independent disease risk factors and potential nonlinear relationships, impacted its predictive accuracy. Adjustments to the logistic regression model, such as transforming numerical variables into ordered categorical variables, improved its performance. This aligns with previous findings where logistic regression’s predictive power diminishes if data requirements are not met [50]. Casanova et al. [51] compared Random Forest and logistic regression in classifying 3,443 patients with diabetic retinopathy and found Random Forest to be more accurate. XG Boost, an enhancement of the GBDT-based Boosting algorithm [52]. Despite its effectiveness, XG Boost was not the preferred method for predicting obesity in steelworkers due to its relatively lower performance on evaluation indices compared to the other models. Support Vector Machines have shown promise in previous obesity studies [53], and in this study, while only the recall in the final test set was higher than that of the Random Forest model, the differences in other indicators were minimal. However, this model requires data preprocessing and parameterization for large sample sizes and presents challenges in monitoring and visualization. The Random Forest model has excelled in chronic disease prediction. Alghamdi et al. [54] used methods including decision trees, naive Bayesian, logistic regression, and random forest for diabetes prediction in the Henry Ford Exercise Trial project database, finding Random Forest to be the most effective. In this study, the Random Forest model not only effectively differentiated between normal and abnormal BMI but also showed the highest agreement between predicted and actual results, making it particularly suited for analyzing obesity data among steelworkers. Additionally, the model can attribute internal importance to predictor variables, aiding in subsequent model visualization. Based on these findings, the Random Forest model is recommended for obesity risk prediction in steelworkers.

Study strengths and limitations

This five-year follow-up study included 5,469 individuals and was based on the Beijing-Tianjin-Hebei cohort. Its findings are highly complete and credible. Unlike previous obesity studies, this research incorporated both conventional and occupational factors, aligning the conclusions more closely with the characteristics of the occupational population. This study is novel in using machine learning methods to predict obesity risk in steelworkers, providing new methodological support for future obesity-related disease prevention. Although previous studies have shown associations between high temperature and CO exposures with obesity, their specific impacts on obesity development in steelworkers were not explored until now.

However, the study has limitations. It did not include genetic data from steelworkers, considering genetics are immutable and their inclusion would not aid in providing practical obesity prevention recommendations. Furthermore, this study only built and completed internal validation of the model for predicting the risk of obesity in steelworkers; external validation was not conducted. Moreover, while the optimal model for predicting obesity in steelworkers was identified, further investigation is needed on how to effectively visualize and apply this model.


A five-year observational study involving 5,469 steelworkers found that age, sex, drinking and smoking habits, DASH diet score, physical activity level, shift work, exposure to high temperatures, and CO exposure were the main factors influencing the development of obesity in this group. A Random Forest Model specifically suited for predicting obesity in steelworkers was successfully developed and demonstrated superior predictive ability compared to other models.

Availability of data and materials

The datasets used in this study are available from the corresponding author upon reasonable request.



Support vector machine


Random forest


Machine learning


Body mass index


Athens insomnia scale


Dietary Approaches to Stop Hypertension


Area under the ROC curve


  1. Withrow D, Alter DA. The economic burden of obesity worldwide: a systematic review of the direct costs of obesity. Obes Rev. 2011;12:131–41.

    Article  CAS  Google Scholar 

  2. Mirzazadeh A, Sadeghirad B, Haghdoost AA, Bahrein F, Kermani MR. The prevalence of obesity in Iran in recent decade; a systematic review and Meta-analysis study. Iran J Public Health. 2009;38:1–11.

    Google Scholar 

  3. Cui HY. Analysis of overweight and obesity status and risk factors in Haidian District. China Public Health Manag. 2008;05:529–30.

  4. Yao YH, Zhong L, Liu YC, Fu Y, Zhu YJ, Pan Y, Liu JW, Yao Y, Han WQ, Li ZJ, et al. Epidemiological characteristics of overweight and obesity among adults in Jilin Province and investigation and analysis of influencing factors. J Jilin Univ (Medical Edition). 2013;39:1051–6.

    Google Scholar 

  5. Chen JQ, Brown TR, Russo J. Regulation of energy metabolism pathways by estrogens and estrogenic chemicals and potential implications in obesity associated with increased exposure to endocrine disruptors. Biochim Biophys Acta. 2009;1793:1128–43.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  6. Li ZW. A study on the relationship between dietary patterns and overweight and obesity in the community elderly in Beichen District. Tianjin. M.S: Tianjin Medical University; 2015.

    Google Scholar 

  7. Expert consensus on the prevention. And treatment of obesity in the Chinese population. Chin J Prev Med. 2022;23:321–39.

    Google Scholar 

  8. Brock DW, Thomas O, Cowan CD, Allison DB, Gaesser GA, Hunter GR. Association between insufficiently physically active and the prevalence of obesity in the United States. J Phys Act Health. 2009;6:1–5.

    Article  PubMed  PubMed Central  Google Scholar 

  9. Yang X, Telama R, Leskinen E, Mansikkaniemi K, Viikari J, Raitakari OT. Testing a model of physical activity and obesity tracking from youth to adulthood: the cardiovascular risk in young finns study. Int J Obes (Lond). 2007;31:521–7.

    Article  PubMed  CAS  Google Scholar 

  10. Wannamethee SG, Shaper AG. Alcohol, body weight, and weight gain in middle-aged men. Am J Clin Nutr. 2003;77:1312–7.

    Article  PubMed  CAS  Google Scholar 

  11. Ma ZL. Association of STAT3 gene polymorphism with obesity and lipid metabolism disorders in Chinese Han population. M.S Southern Medical University. 2014;15:12258–69.

  12. Ou ZJ. Association of CRTC3 and UCP1 gene polymorphisms with obesity and lipid metabolism disorders in Chinese Han population. D Southern Medical University. 2013;03:99.

  13. Xiao MY, Wang C, Fan HM, Che CL, Lu Y, Cong LX, Gao XJ, Liu YJ, Yuan JX, Li SM, et al. Effect of shift work on overweight/obesity in male steel workers. Chin J Epidemiol. 2016;37:1468–72.

    CAS  Google Scholar 

  14. Wu JH. Construction and prediction of health index for workers in steel enterprises based on Beijing-Tianjin-Hebei occupational cohort. D North China University of Science and Technology. 2021;04:144.

  15. Grundy A, Cotterchio M, Kirsh VA, Nadalin V, Lightfoot N, Kreiger N. Rotating shift work associated with obesity in men from northeastern Ontario. Health Promot Chronic Dis Prev Can. 2017;37:238–47.

    Article  PubMed Central  Google Scholar 

  16. Flouris AD, Dinas PC, Ioannou LG, Nybo L, Havenith G, Kenny GP, Kjellstrom T. Workers’ health and productivity under occupational heat strain: a systematic review and meta-analysis. Lancet Planet Health. 2018;2:e521-531.

    Article  Google Scholar 

  17. Chatterjee A, Gerdes MW, Martinez SG. Identification of risk factors associated with obesity and overweight-a machine learning overview. Sens (Basel). 2020;20:2734.

    Article  Google Scholar 

  18. Zhang S, Tjortjis C, Zeng X-j, Qiao H, Buchan IE, Keane JA. Comparing data mining methods with logistic regression in childhood obesity prediction. Inform Syst Front. 2009;11:449–60.

    Article  CAS  Google Scholar 

  19. Golino HF, Amaral LS, Duarte SF, Gomes CM, Soares Tde J, Dos Reis LA, Santos J. Predicting increased blood pressure using machine learning. J Obes. 2014;2014:637635.

    Article  PubMed  PubMed Central  Google Scholar 

  20. Zheng Z, Ruggiero K. Using machine learning to predict obesity in high school students. In 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). 2017;2017:2132–8.

    Article  Google Scholar 

  21. DeGregory KW, Kuiper P, DeSilvio T, Pleuss JD, Miller R, Roginski JW, Fisher CB, Harness D, Viswanath S, Heymsfield SB, et al. A review of machine learning in obesity. Obes Rev. 2018;19:668–85.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  22. China Working Group on Obesity: Guidelines for the prevention and control of overweight and obesity in Chinese adults (excerpt). J Nutr. 2004;01:1–4.

  23. China Obesity Working Group Data Aggregation and Analysis Collaborative Group: Predictive value of body mass index and waist circumference for abnormal risk factors of related Diseases in Chinese adults: a study of appropriate body mass index and waist circumference cut points. Chin J Epidemiol. 2002;01:10–5.

  24. Schultz H. Tobacco or health: a global status report. Ann Saudi Med. 1998;18:195.

    Article  PubMed  CAS  Google Scholar 

  25. Chinese Center for Disease Control and Prevention. : Report on the Monitoring of Chronic Diseases and Their Risk Factors in China 2010. Report on the Monitoring of Chronic Diseases and Their Risk Factors in China 2010; 2012.

  26. Fung TT, Chiuve SE, McCullough ML, Rexrode KM, Logroscino G, Hu FB. Adherence to a DASH-style diet and risk of coronary heart disease and stroke in women. Arch Intern Med. 2008;168:713–20.

    Article  PubMed  Google Scholar 

  27. Lou X, He Q. Validity and reliability of the International Physical Activity Questionnaire in Chinese Hemodialysis patients: a multicenter study in China. Med Sci Monit. 2019;25:9402–8.

    Article  PubMed Central  Google Scholar 

  28. Measurement of physical factors in the workplace Part 7: High temperature. vol. GBZ/T 189.7–2007. pp. 5p:A4: National Standard of the People’s Republic of China. ; 2007:5p:A4.

  29. Measurement of physical factors in. the workplace Part 8: Noise GBZ/T 189.8–2007.

  30. Determination of dust in workplace air Part 1. Total dust concentration. Vol. GBZ/T 192.1–2007. National Standard of the People’s Republic of China; 2007. p. 9.

  31. Labor Health Research Institute of Benxi Iron and Steel Company. : Determination of air toxic substances in the workplace Inorganic carbon-containing compounds. vol. GBZ/T 160.28–2004. pp. 5P.;A4; 2004:5P.;A4.

  32. Sun RC, Lan YJ. A study on the association between job fit and occupational stress among nursing staff. Chin J Prev Med. 2020;54:1197–201.

    CAS  Google Scholar 

  33. van Smeden M, Moons KG, de Groot JA, Collins GS, Altman DG, Eijkemans MJ, Reitsma JB. Sample size for binary logistic prediction models: beyond events per variable criteria. Stat Methods Med Res. 2019;28:2455–74.

    Article  PubMed  Google Scholar 

  34. Riley RD, Snell KI, Ensor J, Burke DL, Harrell FE Jr, Moons KG, Collins GS. Minimum sample size for developing a multivariable prediction model: PART II - binary and time-to-event outcomes. Stat Med. 2019;38:1276–96.

    Article  PubMed  Google Scholar 

  35. Van Houwelingen JC, Le Cessie S. Predictive value of statistical models. Stat Med. 1990;9:1303–25.

    Article  Google Scholar 

  36. Hannerz H, Albertsen K, Nielsen ML, Tuchsen F, Burr H. Occupational factors and 5-year weight change among men in a Danish national cohort. Health Psychol. 2004;23:283–8.

    Article  PubMed  Google Scholar 

  37. Foraster M, Eze IC, Vienneau D, Schaffner E, Jeong A, Heritier H, Rudzik F, Thiesse L, Pieren R, Brink M. Long-term exposure to transportation noise and its association with adiposity markers and development of obesity. Environ Int. 2018;121:879–89.

    Article  Google Scholar 

  38. Kenji O, Keigo S, Junko I, Nozomi O, Kimiko T, Satoko N, Yoshito I, Norio K. Exposure to light at night, nocturnal urinary melatonin excretion, and Obesity/Dyslipidemia in the Elderly: a cross-sectional analysis of the HEIJO-KYO study. J Clin Endocrinol Metabolism. 2013;98:337–44.

    Article  Google Scholar 

  39. Mcfadden E, Jones ME, Schoemaker MJ, Ashworth A, Swerdlow AJ. The relationship between obesity and exposure to light at Night: cross-sectional analyses of over 100,000 women in the breakthrough generations study. Am Polit Sci Rev. 2014;65:358–75.

    Google Scholar 

  40. Kenny GP, Flouris AD. The human thermoregulatory system and its response to thermal stress. Protective Cloth. 2014:319–65.

  41. Kenny GP, Poirier MP, Metsios GS, Boulay P, Dervis S, Friesen BJ, Malcolm J, Sigal RJ, Seely AJ, Flouris AD. Hyperthermia and cardiovascular strain during an extreme heat exposure in young versus older adults. Temp (Austin). 2017;4:79–88.

    Google Scholar 

  42. Otterbein LE, Bach FH, Alam J, Soares M, Tao Lu H, Wysk M, Davis RJ, Flavell RA, Choi AM. Carbon monoxide has anti-inflammatory effects involving the mitogen-activated protein kinase pathway. Nat Med. 2000;6:422–8.

    Article  PubMed  CAS  Google Scholar 

  43. Peterson SJ, Dave N, Kothari J. The effects of heme oxygenase upregulation on obesity and the metabolic syndrome. Antioxid Redox Signal. 2020;32:1061–70.

    Article  PubMed  CAS  Google Scholar 

  44. Pan X-F, Wang L, Pan A. Epidemiology and determinants of obesity in China. Lancet Diabetes Endocrinol. 2021;9:373–92.

    Article  PubMed  Google Scholar 

  45. Jean-Louis G, Williams NJ, Sarpong D, Pandey A, Youngstedt S, Zizi F, Ogedegbe G. Associations between inadequate sleep and obesity in the US adult population: analysis of the national health interview survey (1977–2009). BMC Public Health. 2014;14: 290.

    Article  PubMed  PubMed Central  Google Scholar 

  46. Fan CX, Yang LL, Wang CF, Wu YL. Across-sectional study of the relationship between sleep duration and obesity in American adults. J Qingdao Univ School Med. 2016;52:169–71.

    Google Scholar 

  47. Rha EY, Kim HJ, Han K, Park Y, Yoo G. Gender-specific relationship between alcohol consumption and injury in the South Korean adults: a nationwide cross-sectional study. Med (Baltim). 2017;96: e5385.

    Article  Google Scholar 

  48. Liu TT, Zhou XT, Li WL, Peng YY, Liu SJ, Wang JJ, Ren T, Wang LP, Yuan P. Analysis of the current prevalence of overweight and obesity among adults in Mianyang City, Sichuan Province. J Sichuan Univ (Medical Edition). 2017;48:946–8.

    Google Scholar 

  49. Tang D, Xiao X, Chen L, Kangzhu Y, Deng W, Basang u, Yang S, Long L, Xie X, Lu J, et al. Association of dietary patterns with obesity and metabolically healthy obesity phenotype in Chinese population: a cross-sectional analysis of China multi-ethnic cohort study. Br J Nutr. 2022;128:2230–40.

    Article  PubMed  CAS  Google Scholar 

  50. Wang QQ, Yu SC, Qi X, Hu YH, Zheng WJ, Shi JX, Mo HY. Logistic family regression and its applications. Chin J Prev Med. 2019;53:6.

    Google Scholar 

  51. Casanova R, Saldana S, Chew EY, Danis RP, Greven CM, Ambrosius WT. Application of random forests methods to diabetic retinopathy classification analyses. PLoS ONE. 2014;9: e98587.

    Article  PubMed  PubMed Central  Google Scholar 

  52. Guan X, Zhang B, Fu M, Li M, Yuan X, Zhu Y, Peng J, Guo H, Lu Y. Clinical and inflammatory features based machine learning model for fatal risk prediction of hospitalized COVID-19 patients: results from a retrospective cohort study. Ann Med. 2021;53:257–66.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  53. Selya AS, Anshutz D. Machine Learning for the Classification of Obesity from Dietary and Physical Activity Patterns. In: Advanced Data Analytics in Health. Edited by Giabbanelli PJ, Mago VK, Papageorgiou EI. Cham: Springer International Publishing; 2018;93:77–97.

  54. Alghamdi M, Al-Mallah M, Keteyian S, Brawner C, Ehrman J, Sakr S. Predicting Diabetes Mellitus using SMOTE and ensemble machine learning approach: the Henry Ford ExercIse Testing (FIT) project. PLoS ONE. 2017;12: e0179805.

    Article  PubMed  PubMed Central  Google Scholar 

Download references


The authors are grateful to the participants in this study and all members involved in collecting the baseline data.


This research was funded by the Youth Talent Promotion Program of the School of Public Health, North China University of Science and Technology (2023002).

Author information

Authors and Affiliations



Design research, Z.Z. and L.X. ; Methodology, L.H., J.W. and R.M. ; Project administration, Z.S., H.W., X.W. and J.C. ; Software, Y.Z., H.W. (Huan Wang) and J.H. ; Validation, Z.Z. (Ziqi Zhao) and H.Z. ; Writing an original draft, Z.Z. ; Writing review, X.L. and L.X. All authors responded to the modification of the study protocol and approved the final manuscript.

Corresponding authors

Correspondence to Xiaoming Li or Ling Xue.

Ethics declarations

Ethics approval and consent to participate

The study received approval from the Ethics Committee of the North China University of Science and Technology (No. 16040). Informed consent was obtained from all participants.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhao, Z., Lu, H., Meng, R. et al. Risk factor analysis and risk prediction study of obesity in steelworkers: model development based on an occupational health examination cohort dataset. Lipids Health Dis 23, 10 (2024).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: