实用医学杂志 ›› 2026, Vol. 42 ›› Issue (7): 1158-1164.doi: 10.3969/j.issn.1006-5725.2026.07.006

• 专题报道:结核病 • 上一篇    

运用机器学习构建肺癌与肺结核鉴别诊断模型

周游1,2,陈纪飞2,何希2,刘爱梅2,杨小兵2,黄一芳1()   

  1. 1.广西医科大学第一附属医院检验科,广西高校临床检验诊断学重点实验室 (广西 南宁 530021 )
    2.广西壮族自治区胸科医院科教科生物样本库 (广西 柳州 545005 )
  • 收稿日期:2025-09-17 修回日期:2025-11-14 接受日期:2025-12-02 出版日期:2026-04-10 发布日期:2026-04-13
  • 通讯作者: 黄一芳 E-mail:YFY004462@sr.gxmu.edu.cn
  • 基金资助:
    广西科技重大专项(编号:桂科AA22096027);广西医疗卫生适宜技术开发与推广应用项目(S2023050)

Constructing a differential diagnosis model for lung cancer and pulmonary tuberculosis using machine learning

You ZHOU1,2,Jifei CHEN2,Xi HE2,Aimei LIU2,Xiaobing YANG2,Yifang HUANG1()   

  1. 1.Department of Clinical Laboratory,the First Affiliated Hospital of Guangxi Medical University,Key Laboratory of Clinical Laboratory Medicine of Guangxi Department of Education,Nanning 530021,Guangxi,China
    2.Department of Biological Sample Bank of Science and Education,Guangxi Zhuang Autonomous Region Chest Hospital,Liuzhou 545005,Guangxi,China
  • Received:2025-09-17 Revised:2025-11-14 Accepted:2025-12-02 Online:2026-04-10 Published:2026-04-13
  • Contact: Yifang HUANG E-mail:YFY004462@sr.gxmu.edu.cn

摘要:

目的 运用机器学习方法构建预测模型,用于肺癌与肺结核的鉴别诊断。 方法 回顾性分析2020年7月至2023年5月在广西壮族自治区胸科医院就诊的585例患者临床资料,年龄14 ~ 90岁,其中男457例,女128例。根据临床最终诊断结果,将585例病例分为肺癌组和肺结核组。比较两组病例肿瘤标志物检测结果的差异。运用Lasso和单因素logistic回归分析筛选肺癌与肺结核鉴别诊断的特征变量。构建随机森林模型,并对重要预测变量因素进行排序。构建Lasso-logistic回归模型;通过ROC曲线分析比较随机森林模型和Lasso-logistic回归模型的预测效能。 结果 肺癌组血清肿瘤标志物CA125、CEA、CYFRA21-1、NSE、SCCA明显高于肺结核组,差异有统计学意义(P < 0.05)。Lasso/单因素logistic回归分析筛选出肺癌与肺结核鉴别诊断的特征变量为性别、年龄、CA125、CEA、CYFRA21-1、NSE、SCCA。通过随机森林模型对特征变量进行重要预测变量排序,依次为CYFRA21-1、CEA、SCCA、NSE、CA125、年龄、性别;Lasso-logistic回归分析结果显示,CYFRA21-1、CEA、NSE水平和年龄是区分肺癌与肺结核的独立危险因素(P0.05);随机森林模型和Lasso-logistic回归模型鉴别诊断肺癌与肺结核的AUC、灵敏度、特异度、准确度、约登指数分别为0.938、90.38%、87.50%、0.888、0.779和0.958、86.54%、92.19%、0.879、0.787。 结论 肿瘤标志物CA125、CEA、CYFRA21-1、NSE、SCCA检测在肺癌和肺结核鉴别诊断中具有重要临床意义,本研究构建的随机森林模型和Lasso-logistic回归模型均能很好地区分肺癌和肺结核;Lasso-logistic回归模型确定,CYFRA21-1、CEA、NSE水平和年龄是鉴别肺癌与肺结核的独立危险因素。

关键词: 机器学习, 随机森林, 鉴别诊断, 肺癌, 肺结核

Abstract:

Objective To develop a predictive model for differentiating lung cancer and pulmonary tuberculosis, machine learning methods were employed. Methods A retrospective analysis was conducted on the clinical data of 585 patients who visited Guangxi Chest Hospital from July 2020 to May 2023. The patients' ages ranged from 14 to 90 years old, with 457 males and 128 females. Based on the final clinical diagnosis results, the 585 cases were divided into the lung cancer group and the pulmonary tuberculosis group. The differences in tumor marker test results between the two groups of cases were compared. Lasso and single-factor logistic regression analysis were used to screen feature variables for differentiating lung cancer from pulmonary tuberculosis. A random forest model was constructed, and the important predictive variables were ranked. A Lasso-logistic regression model was constructed. The predictive efficacy of the random forest model and the Lasso-logistic regression model was compared through ROC curve analysis. Results The levels of serum tumor markers CA125, CEA, CYFRA21-1, NSE, and SCCA in the lung cancer group were significantly higher than those in the pulmonary tuberculosis group, showing statistically significant differences(P < 0.05). Lasso and single-factor logistic regression analysis was conducted to identify the following characteristic variables for differentiating lung cancer from pulmonary tuberculosis: sex, age, CA125, CEA, CYFRA21-1, NSE, and SCCA. A random forest model was used to rank these variables by importance as follows: CYFRA21-1, CEA, SCCA, NSE, CA125, age, and sex. The results of Lasso-logistic regression analysis indicated that the levels of CYFRA21-1, CEA, NSE, and age were independent risk factors for differentiating lung cancer from pulmonary tuberculosis(P < 0.05). The AUC, sensitivity, specificity, accuracy, and Youden index of the Random Forest model and the Lasso-logistic regression model for the differential diagnosis of lung cancer and pulmonary tuberculosis were 0.938, 90.38%, 87.50%, 0.888, 0.779 and 0.958, 86.54%, 92.19%, 0.879, 0.787, respectively. Conclusions The tumor markers CA125, CEA, CYFRA21-1, NSE, and SCCA hold significant clinical value in the differential diagnosis of lung cancer and pulmonary tuberculosis. The random forest model and Lasso-logistic regression model developed in this study can effectively discriminate between lung cancer and pulmonary tuberculosis. The Lasso - logistic regression model identified that the levels of CYFRA21-1, CEA, NSE, and age were independent risk for differentiating lung cancer from pulmonary tuberculosis.

Key words: machine learning, random forest, differential diagnosis, lung cancer, pulmonary tuberculosis

中图分类号: