The Journal of Practical Medicine ›› 2026, Vol. 42 ›› Issue (7): 1158-1164.doi: 10.3969/j.issn.1006-5725.2026.07.006

• Feature Reports:Tuberculosis • Previous Articles    

Constructing a differential diagnosis model for lung cancer and pulmonary tuberculosis using machine learning

You ZHOU1,2,Jifei CHEN2,Xi HE2,Aimei LIU2,Xiaobing YANG2,Yifang HUANG1()   

  1. 1.Department of Clinical Laboratory,the First Affiliated Hospital of Guangxi Medical University,Key Laboratory of Clinical Laboratory Medicine of Guangxi Department of Education,Nanning 530021,Guangxi,China
    2.Department of Biological Sample Bank of Science and Education,Guangxi Zhuang Autonomous Region Chest Hospital,Liuzhou 545005,Guangxi,China
  • Received:2025-09-17 Revised:2025-11-14 Accepted:2025-12-02 Online:2026-04-10 Published:2026-04-13
  • Contact: Yifang HUANG E-mail:YFY004462@sr.gxmu.edu.cn

Abstract:

Objective To develop a predictive model for differentiating lung cancer and pulmonary tuberculosis, machine learning methods were employed. Methods A retrospective analysis was conducted on the clinical data of 585 patients who visited Guangxi Chest Hospital from July 2020 to May 2023. The patients' ages ranged from 14 to 90 years old, with 457 males and 128 females. Based on the final clinical diagnosis results, the 585 cases were divided into the lung cancer group and the pulmonary tuberculosis group. The differences in tumor marker test results between the two groups of cases were compared. Lasso and single-factor logistic regression analysis were used to screen feature variables for differentiating lung cancer from pulmonary tuberculosis. A random forest model was constructed, and the important predictive variables were ranked. A Lasso-logistic regression model was constructed. The predictive efficacy of the random forest model and the Lasso-logistic regression model was compared through ROC curve analysis. Results The levels of serum tumor markers CA125, CEA, CYFRA21-1, NSE, and SCCA in the lung cancer group were significantly higher than those in the pulmonary tuberculosis group, showing statistically significant differences(P < 0.05). Lasso and single-factor logistic regression analysis was conducted to identify the following characteristic variables for differentiating lung cancer from pulmonary tuberculosis: sex, age, CA125, CEA, CYFRA21-1, NSE, and SCCA. A random forest model was used to rank these variables by importance as follows: CYFRA21-1, CEA, SCCA, NSE, CA125, age, and sex. The results of Lasso-logistic regression analysis indicated that the levels of CYFRA21-1, CEA, NSE, and age were independent risk factors for differentiating lung cancer from pulmonary tuberculosis(P < 0.05). The AUC, sensitivity, specificity, accuracy, and Youden index of the Random Forest model and the Lasso-logistic regression model for the differential diagnosis of lung cancer and pulmonary tuberculosis were 0.938, 90.38%, 87.50%, 0.888, 0.779 and 0.958, 86.54%, 92.19%, 0.879, 0.787, respectively. Conclusions The tumor markers CA125, CEA, CYFRA21-1, NSE, and SCCA hold significant clinical value in the differential diagnosis of lung cancer and pulmonary tuberculosis. The random forest model and Lasso-logistic regression model developed in this study can effectively discriminate between lung cancer and pulmonary tuberculosis. The Lasso - logistic regression model identified that the levels of CYFRA21-1, CEA, NSE, and age were independent risk for differentiating lung cancer from pulmonary tuberculosis.

Key words: machine learning, random forest, differential diagnosis, lung cancer, pulmonary tuberculosis

CLC Number: