Beginners friendly Data Science Project using K-Means algorithm in R Studio

Fachrul Razy
17 min readDec 7, 2024

--

Photo by Chris Liverani on Unsplash

Introduction

As a beginner who want to learn about Data Science and Machine Learning, sometimes it seems overwhelming like learning about mathematics, statistics and many more new buzzy tech terms.

But, worry no more! In this article we will delve into one of the simplest yet most powerful techniques about Machine Learning algorithm that is K-Means clustering — a method used to group data into meaningful categories. Whether it’s customer segmentation, identifying patterns in nature, or analyzing everyday trends, K-Means is widely used across industries. In this beginner-friendly project, you’ll use R Studio, a user-friendly platform for data analysis, to explore the algorithm step by step.

Business Study Case

The goal of this project is to clustering customers based on several attributes like profession, resident type and annual spending value.

The Dataset

Dataset of Customer containing 50 customer data to train our K-Means Cluster model.

Data Science Process

In this project, we will be using this steps:

Data Science Process with Clustering Method

Setup Environment

First download the R Studio Desktop from here according to each type of OS you use and then install it.

Once it installed, open the R studio and create a new R Notebook File

We will be using R Notebook since it can be exported to HTML document to make documentation easier.

For markdown documentation we can learn the documentation here.

Project Structure

The project structure is quite simple, save the R Notebook file alongside with the customer dataset on the same folder.

We will continue directly to Data Preprocessing steps since our dataset is ready.

Data Preprocessing Stage

In this stage, we will examine the dataset. The dataset consist of 7 fields:

  • Customer_ID : Customer code with mixed text format CUST- followed by numbers.
  • Customer Name : Customer name with text format.
  • Gender : Only two data categories, Male and Female.
  • Age : Customer age in numeric format
  • Profession : Customer profession, also in text category type consisting of Self-Employed, Student, Professional, Housewife, and Student.
  • Resident Type : Type of residence of our customer, for this dataset there are only two categories: Cluster and Sector.
  • AnnualSpendValue : The money spent in a year.

First we read the dataset and save it to customers variable

customers <- read.csv("customer_segments.txt", sep="\t")
customers

Then, we use field_used variable to store vector fields

field_used <- c("Gender", "Profession", "Resident.Type")
customers[field_used]

Convert field_used into numeric and then join it to customers variable

# convert to numeric fields
customers_matrix <- data.matrix(customers[field_used])

# join the data
customers <- data.frame(customers, customers_matrix)

customers
# check data structure
str(customers)

Normalization of Values

On the AnnualSpendValue field, containing millions of value this will make the calculation of the sum of squared errors (kmeans) will be very large. Normalize the value to make the calculation simpler and easier to digest, but does not reduce accuracy. This will be especially useful if the amount of data is very large, for example having 200 thousand data. Normalization can be done in many ways. For this case, simply by dividing so that the value of millions becomes tens.

customers$AnnualSpendValue = customers$AnnualSpendValue/1000000
customers$AnnualSpendValue

Create Master Data

After merging the data, we can see what numeric numbers the category text is actually converted into.

The goal is to be used as a reference so that later when there is new data, we can “map” it into numeric data that is ready to be used for the clustering algorithm.

We can check the unique numeric value of “Profession” using unique function

unique(customers[c("Profession","Profession.1")]) 

Filling in Master Data

Creating Master Data, where we summarize categorical and numeric data into variables that we call master data.

Profession <- unique(customers[c("Profession","Profession.1")])
Gender <- unique(customers[c("Gender","Gender.1")])
Resident.Type <- unique(customers[c("Resident.Type","Resident.Type.1")])

Finally we have reached the end of data preprocessing steps with the final step filling in the master data.

Clustering Stage

Clustering

Clustering is the process of dividing objects into several groups (clusters) based on the level of similarity between one object and another. Some examples of clustering:

  • Grouping humans by age: babies, toddlers, children, teenagers, adults, old.
  • Grouping customers based on their purchasing power: regular and premium.

Many algorithms have been developed to perform clustering automatically, one of the most popular is K-Means which will be used in this case.

K-Means

K-means is an algorithm that divides data into a number of partitions in a simple way: finding the proximity of each point in a cluster to a number of average values ​​or means.

There are two key concepts that also give rise to the name k-means:

  • The number of desired partitions, represented by the letter k
  • Finding the “distance” of each point to a number of observed cluster average values, represented by the means
source kaggle

K Means Function

In this project we will directly segment the customers using our preprocessed data before using kmeans function in R.

This kmeans function is usually accompanied by a call to the seet.seed function. This is useful so that we can “standardize” the same list of random values ​​from kmeans so that we get the same output. Here is an example of using the combination of set.seed and kmeans functions.

set.seed(100)
kmeans(x=customers[c("Age","Profession.1")], centers=3)

In the example above, we divide customer data based on “Age” and “Profession” into 3 segments.

Sometimes the data parameters and the number of segments alone are not enough. A third parameter, nstart, is needed, which is the number of random combinations generated internally by R. And in the amount we provide, the algorithm will choose the best of those combinations. Here is a modification of the function call with an additional parameter nstart of 25.

We can modify field_used to include several fields like below

field_used = c("Gender.1", "Umur", "Profession.1", "Resident.Type.1","AnnualSpendValue")

# K-Means
set.seed(100)

# kmeans function to create 5 clusters with 25 random scenario and save it to segmentation variable

segmentation <- kmeans(x=customers[c("Age","Profession.1")], centers=3, nstart=25)

# show k-means result
segmentation

Data Analysis / Exploration Stage

In this stage we will explore the cluster result. First we check our segmentation using field_used variable.

set.seed(100)
segmentation <- kmeans(x=customers[field_used], centers=5, nstart=25)
segmentation
# Merging Cluster Result
segmentation$cluster
# Using segmentation cluster to Customers cluster
customers$cluster <- segmentation$cluster
str(customers)

This clustering vector is a series of vectors containing cluster numbers. From our results, the vector contains numbers 1 to 5, the maximum according to the number of clusters we want.

This vector starts from number 2, which means that the first data from our dataset will be allocated to cluster number 2. From the image, it can also be seen that the contents of the second vector have a value of 1, this means that the second data from our dataset is allocated to cluster number 1, and so on. The last data position (50th) has a cluster number of 5.

This result can be accessed with the cluster component of the result object as follows: segmentation$cluster

Analyzing of Cluster Result size

When we look at the result above we can see the output:

K-means clustering with 5 clusters of sizes 14, 10, 9, 12, 5

This means that with k-means we have divided the customer dataset into 5 clusters, where:

  • Cluster 1 has 14 data
  • Cluster 2 has 10 data
  • Cluster 3 has 9 data
  • Cluster 4 has 12 data
  • Cluster 5 has 5 data

With a total of 50 data, which is also the total number of customer data. Let’s verify this by starting from cluster 1. Take customer data whose cluster column content is 1 by using the which function.

which(customers$cluster == 1)
# count the length of cluster 1
length(which(customers$cluster == 1))

As we can see the Cluster 1 has 14 length of data position

We can do analyze other clusters using the same code above

# Cluster 2
length(which(customers$cluster == 2))
# Cluster 3
length(which(customers$cluster == 3))
...

Show Data on the Nth Cluster

# show data on Cluster 1
customers[which(customers$cluster == 1),]

As we can see above, that all data are female and aged between 14 to 25 years. Income, profession, spending value and resident type are quite varied. Now, we will look at the second cluster (Cluster 2)

customers[which(customers$cluster == 2),]

As we can see, the majority Profession is Professional and the average Annual Spend is around 5 million, except for those who work as housewives and enterpreneur.

Analyzing Cluster Means Result

Cluster means are the average value or central point (centroid) of all points in each cluster.

segmentation$centers

Here is the explanation of the above result:

  • Gender.1 : shows the average value of the gender data converted to numeric, with the number 1 representing Male and the number 2 representing Female.
  • Age : representation of the initial dataset without undergoing any conversion.
  • Profession.1 : shows the average value of Profession data for each cluster that has been converted into numeric.
  • Resident.Type.1 : shows the representation of Type.Resident data that has been converted into numeric with the number 1 representing Cluster and 2 representing Sector.
  • AnnualSpendValue : the average of annual spend value, where Cluster 3 & 5 have a higher value compared to the other three clusters.

Sum of Squares Results Analysis

The concept of sum of squares (SS) is the sum of the “squared distances” of the differences between each data point and its mean or centroid. This SS can be the mean or centroid for each cluster or the entire data. Sum of squares in other data science literature is often referred to as Sum of Squared Errors (SSE).

The greater the SS value, the wider the difference between each data point in the cluster. Based on this concept, here is an explanation for the kmeans output results above:

  1. The value 316.73367 is the SS for the 1st cluster, 108.49735 is the SS for the 2nd cluster, and so on. The smaller the value is potentially the better.
  2. total_SS : is the SS for all points against the global average value, not for each cluster. This value is always constant and is not affected by the number of clusters.
  3. between_SS : total_SS minus the sum of the SS values ​​of all clusters.
  4. (between_SS / total_SS) is the ratio of between_SS divided by total_SS. The higher the percentage, generally the better.
# Comparing two clusters, each with 2 and 5 centers
# 2 centers
set.seed(100)
kmeans(x=customers[field_used], centers=2, nstart=25)
# 5 centers
set.seed(100)
kmeans(x=customers[field_used], centers=5, nstart=25)

Available Components

The last of our code analyzing is the last section about available component. This nine object components that we can use to see the details of the k-means object. Here is a brief explanation of the nine components.

All of these components can be accessed using the $ accessor. For example, with our kmeans variable named segmentation and we want to access the withinss component, then we can use the following command that is already in the code editor.

We have completed the use of the K-Means algorithm with the kmeans function from the dataset that has been prepared in the data preparation stage. The kmeans function is simple to use but its output has rich information, namely:

  1. Size / number of data points in each cluster
  2. The average value (centroid) of each cluster
  3. Vector items from the cluster
  4. The sum of the squared distances from each point to its centroid (Sum of Squares or SS)
  5. Other information components

By analyzing this output, we are able to combine the cluster number with the original data. In addition, we also know how close each data point is to its cluster so that it becomes our provision to determine the optimal number of clusters.

Finding the Best Number of Clusters

From the information generated by the kmeans function, the Sum of Squares (SS) metric or often called the Sum of Squared Errors (SSE) is very important to be used as a basis for determining the most optimal number of clusters. Theoretically, here are some things we can observe with SS:

  • The fewer the number of clusters generated, the greater the SS value.
  • Likewise, the greater the number of clusters generated, the smaller the SS value.
  • Because of its quadratic nature, if there is a significant difference between each cluster combination, the difference in SS value will be greater.
  • As the number of clusters increases, the difference in each SS will be smaller. If entered into a line graph, the plotting of the total SS for each cluster is as follows.
sse <- sapply(1:10,
function(param_k)
{
kmeans(customers[field_used], param_k,
nstart=25)$tot.withinss
}
)

library(ggplot2)
sum_max_cluster <- 10
ssdata = data.frame(cluster=c(1:sum_max_cluster),sse)
ggplot(ssdata, aes(x=cluster,y=sse)) +
geom_line(color="red") + geom_point() +
ylab("Within Cluster Sum of Squares") + xlab("Sum of Cluster") +
geom_text(aes(label=format(round(sse, 2), nsmall = 2)),hjust=-0.2, vjust=-0.5) +
scale_x_discrete(limits=c(1:sum_max_cluster))

Note that the further to the right the difference in distance between each point becomes smaller. This line graph has the shape of an elbow, and to optimize the number of clusters we usually take the elbow points. In the example above we can take 4 or 5. The decision-making process based on this elbow plotting is usually called the Elbow Effect or Elbow Method.

If we break down the code above we can see two main code blocks. The first block is to simulate the Cluster Number and SS. The elbow method metric used as the basis for justification is the Sum of Squares (SS), or more precisely the tot.withinss component of the kmeans object.

This metric will search for the progressive value of tot.withinss for each combination of the number of clusters, and stored in vector form in R. For this material, we will use sapply. The sapply function will be used to call the kmeans function for a range of the number of clusters. We will use this range from 1 to 10.

sse <- sapply
(
1:10, function(param_k)
{
kmeans(customers[field_used], param_k, nstart=25)$tot.withinss
}
)

The second code block is about visualize the Sum of Squares (SS) or Sum of Squared Errors (SSE) vector that we have produced in the previous practice. We will use ggplot for visualization, the dataset is a combination of data frames from sse and a value range of 1:10, with the following command.

library(ggplot2)
sum_max_cluster <- 10
ssdata = data.frame(cluster=c(1:sum_max_cluster),sse)
ggplot(ssdata, aes(x=cluster,y=sse)) +
geom_line(color="red") + geom_point() +
ylab("Within Cluster Sum of Squares") + xlab("Sum of Cluster") +
geom_text(aes(label=format(round(sse, 2), nsmall = 2)),hjust=-0.2, vjust=-0.5) +
scale_x_discrete(limits=c(1:sum_max_cluster))

By utilizing the Sum of Squares (SS) or Sum of Squared Errors (SSE) value, we can decide the optimal number of segmentations that we use. This is done by simulating the iteration of the number of clusters from 1 to the maximum number that we want. In the example in this material, we use iteration numbers 1 to 10. After getting the SS value from each number of clusters, we can plot it on a line graph and use the elbow method to determine the optimal number of clusters.

Evaluation

Packaging K-Means Model

After the data preparation stage, using the kmeans algorithm, and finally being able to decide on the best number of clusters. Then the next stage is to “package” or “wrap” all the references to the conversion results and kmeans objects so that they can be used to process new and useful data in business. For this, the stages are as follows:

  1. Naming the cluster with characteristics that are easier to understand. We store this in the Segment.Customer variable.
  2. Combining the Segment.Customer, Profession, Gender, Resident.Type, and Segmentation variables into one list-type object into the Identity.Cluster variable.
  3. Saving the Identity.Cluster object in the form of a file so that it can be used later, this can be called a model.

Naming the Segment

We can recall the above code to see the Segmentation’s centers

segmentation$centers

Let’s naming the cluster:

  1. The first cluster we name it with Silver Youth Gals because the average age is 20 and the majority of profession is student and professional with average income 5.9 million (IDR).
  2. The second cluster we name it Silver Mid Professional because the average age is 52 years and spending is around 6 million.
  3. The third cluster we name it Diamond Professional because the average age is 42 years, the highest spending and all of them are professionals.
  4. The fourth cluster we name it Gold Young Professional because the average age is 31 years, students and professionals and spending is quite large.
  5. The last cluster we name it Diamond Senior Member karena umurnya rata-rata adalah 61 tahun dan pembelanjaan di atas 8 juta.

The following code creates a data frame named Segment.Customer which consists of two columns.

Segment.Customers <- data.frame(cluster=c(1,2,3,4,5), Nama.Segmen=c("Silver Youth Gals","Silver Mid Professional","Diamond Professional","Gold Young Professional","Diamond Senior Member"))

Merging References

So far we have learned the formation of data assets as follows:

  • A customer dataset that has been “enriched” with additional columns of text-to-numeric conversion results, and normalizing the AnnualSpendValue field.
  • A kmeans object with k = 5, selected based on the methodology using the Sum of Squares (SS) metric.
  • Creating reference variables or mappings of numeric and original text (categories) from the Gender, Profession and Resident Type columns.
  • A data.frame variable named Customer containing cluster names according to the analysis of the characteristics of the centroids of the customer columns used.
  • A vector of the field_used.

It would be great if all of them were combined in one variable with a type of a list, and this will be our model that can be saved to a file and used when needed.

In the following task, we will name this list with Identity.Cluster. The command is as follows:

Identity.Cluster <- list(Profession=Profession, Gender=Gender, Resident.Type=Resident.Type, Segmentation=segmentation, Segment.Customers=Segment.Customers, field_used=field_used)

Saving Object into File

The objects that have been merged in the previous section already have all the assets needed to allocate new data to the appropriate segments. To save this object to a file we use the saveRDS function. This file can then be reopened as an object in the future. For example, the command to save the Identity.Cluster object to the cust_cluster.rds file is as follows.

saveRDS(Identity.Cluster,"cust_cluster.rds")

Testing the K-Means Model

This means that the processed object of the K-Means algorithm and the related variables that we previously produced must be able to be used in real cases so that a complete cycle occurs. The real case for our clustering is quite simple: how new data can automatically help marketing and CRM teams to quickly identify which segment the customer is in. With the speed of identification, organizations or businesses can quickly move with effective marketing messages and win the competition.

Adding a new data

First we create a new R Notebook file, then we will create a data.frame with one data where the column names are exactly the same as the initial dataset.

new_cust_data <- data.frame(Customer_ID="CUST-110", Customer.Name="Susi Reini",Age=21,Gender="Wanita",Profession="Pelajar",Resident.Type="Cluster",AnnualSpendValue=3.5)

new_cust_data

Loading Clustering Objects from File

To open the file, we use the readRDS function. The command is very simple, here is an example to open the cluster.rds file that we have previously saved.

readRDS(file="cust_cluster.rds")
saveRDS(Identity.Cluster,"cust_cluster.rds")

Identity.Cluster

Merge with Reference Data

With the new data and the object containing the reference data has been read back, we can combine this new data to get a numeric conversion from the Gender, Profession and Resident.Type fields.

The goal is that we will be able to find the customer segment with the numeric data from the merge. The way to combine it is by using the merge function, where the two data will be combined by looking for similarities in column names and their contents.

For example, the following command will combine the new data variable with the Identity.Cluster$Profession variable.

merge(new_cust_data, Identity.Cluster$Profession)

The process of merging the data is as follows

  • The new data variable with Identity.Cluster$Profession has the same column name, namely Profession.
  • The Profession column will then be used as a “key” to combine these two variables.
  • It turns out that the Profession content of the new data, namely “Student” is also found in Identity.Cluster. This will make the merge successful.
  • This merge will also take the Profession.1 column and the data content related to Student, namely the value 3.

The following code is the complete merging references data

merge(new_cust_data, Identity.Cluster$Profession)
merge(new_cust_data, Identity.Cluster$Gender)
merge(new_cust_data, Identity.Cluster$Resident.Type)

new_cust_data

Defining Cluster

This stage is the determination to conduct new data testing for business interests, which segment is this in? Namely with the following stages:

finding the minimum or closest squared distance from the numeric column of the new data to the related column centroid from all existing clusters

which.min(sapply( 1:5, function( x ) sum( ( data[kolom] - objectkmeans$centers[x,])^2 ) ))

where:

  • min: is a function to find the minimum value
  • 1:5: is the range of cluster numbers from 1 to 5 (or more according to the cluster size)
  • sapply: used to iterate based on the range (in this case 1 to 5)
  • function(x): used for the process with x filled with 1 to 5 per process
  • (data[column] — objectkmeans$centers[x,]) ^2: is the squared distance of the data. Remember centers are components of the kmeans object.
  • sum: used to sum the squared distances

Here is the complete code to determine the cluster as in the example above for our case

# added new data
new_cust_data <- data.frame(Customer_ID="CUST-110", Customer.Name="Susi Reini",Age=21,Gender="Wanita",Profession="Pelajar",Resident.Type="Cluster",AnnualSpendValue=3.5)

# load the rds object file
Identity.Cluster <- readRDS(file="cust_cluster.rds")

# merging the references data
new_cust_data <- merge(new_cust_data, Identity.Cluster$Profession)
new_cust_data <- merge(new_cust_data, Identity.Cluster$Gender)
new_cust_data <- merge(new_cust_data, Identity.Cluster$Resident.Type)

# determine which cluster new data goes into
Identity.Cluster$Segment.Customers[which.min(sapply( 1:5, function( x )
sum(
( new_cust_data[Identity.Cluster$field_used] - Identity.Cluster$Segmentation$centers[x,])^2 ) )),]

Closing

We have finally reached the end of the project, this project show how new customer data is analyzed by our model and outputs cluster or segment numbers. It means that we have gone through a step-by-step cycle of modeling and using customer segmentation on our data.

--

--

Fachrul Razy
Fachrul Razy

Written by Fachrul Razy

Love to share thoughts in words

No responses yet