Amirhossein Malakouti Semnani, Sohrab Kordrostami, Amirhossein Refahi Sheikhani, Mohammad Hossein Moattar,
Volume 16, Issue 2 (8-2025)
Abstract
Insurance companies face the critical challenge of identifying “good customers”—policyholders who consistently pay premiums with minimal or no claims—within large, heterogeneous datasets. This study proposes and evaluates a hybrid machine learning framework to predict good customer status using an enhanced insurance dataset that integrates demographic, financial, and policy-related features. The framework combines an XGBoost classifier, a soft-voting ensemble of RandomForest and LightGBM, and a custom Transformer Encoder, with all models tuned using the Optuna hyperparameter optimization library to enhance predictive accuracy and interpretability.
The methodology includes preprocessing steps such as categorical encoding and standardization of numerical variables (e.g., age, BMI, premium with GST), followed by a novel label engineering scheme that defines good customers as those whose premiums exceed the mean plus one standard deviation and have no claim history. The dataset is split into training (80%) and testing (20%) subsets. Two hybrid architectures are developed: Model A, which fuses the predicted probabilities from XGBoost and the Transformer Encoder using a 60–40 weighting, and Model B, which employs a soft-voting ensemble of RandomForest and LightGBM. Ablation studies quantify the contribution of each component, while performance is assessed using accuracy, AUC, F1-score, and Matthews Correlation Coefficient, supported by visual tools such as correlation heatmaps, ROC curves, and confusion matrices.
Experimental results show that Model A attains an accuracy of 0.8720 and an AUC of 0.9140, whereas Model B achieves an accuracy of 0.8850 and an AUC of 0.9260 after systematic hyperparameter tuning. Removing either the Transformer or XGBoost markedly degrades Model A, while omitting RandomForest or LightGBM leads to smaller performance drops in Model B, underscoring the value of ensemble diversity. Overall, the proposed framework provides a practical tool for insurance customer segmentation and profitability-oriented decision-making, and its open-source implementation facilitates replication, extension with additional features or larger datasets, and potential real-time deployment in operational insurance environments.