Comparative Analysis of Data Normalization Effects on RFMBased Customer Segmentation Using K-Means and DBSCAN

Nabilah Zahra; Tyas Arum; Zerafica  Patriawan; Dwika  Pertiwi; Much Muslim

Authors

Nabilah Aulia Zahra Department of Computer Science, Universitas Negeri Semarang, Indonesia
Tyas Sekar Arum Department of Computer Science, Universitas Negeri Semarang, Indonesia
Zerafica Adhyasti Dinar Patriawan Department of Computer Science, Universitas Negeri Semarang, Indonesia
Dwika Ananda Agustina Pertiwi Department of Technology Management, Universiti Tun Hussein Onn Malaysia, Malaysia
Much Aziz Muslim Department of Computer Science, Universitas Negeri Semarang, Indonesia

Keywords:

DBSCAN algorithm, K-Means algorithm, Data normalization, RFM analysis, Customer segmentation

Abstract

Customer segmentation is a widely used approach to understand customer transaction patterns and support the development of more effective business strategies. Recency, Frequency, and Monetary (RFM) analysis followed by clustering techniques is a common method applied for this purpose. However, previous research results still show very diverse results regarding the effect of data normalization on clustering quality, especially in the use of different datasets and algorithms. This study aims to analyze the effect of data normalization on RFM-based customer segmentation using the K-Means and DBSCAN algorithms. The analysis was conducted on two transaction datasets, namely Online Retail II and TransJakarta, using three pre-processing scenarios, namely without normalization, Min-Max normalization, and Z-Score normalization. Then, the cluster quality was evaluated using the Silhouette Score and Davies–Bouldin Index (DBI). In the Online Retail II dataset, K-Means produced the best performance without normalization with a Silhouette Score value of 0.9845, while DBSCAN was only able to form valid clusters after applying Z-Score normalization. On the TransJakarta dataset, the best performance of both algorithms was also achieved on unnormalized data, while DBSCAN was able to identify up to 20 clusters along with a number of noise points. These findings demonstrate that the effect of normalization is not always uniform across datasets and clustering methods. By comparing three normalization scenarios on two datasets with different transaction characteristics, this study provides empirical evidence on the importance of tailoring preprocessing strategies to the data characteristics and algorithm mechanisms used.