K-Means Clustering

Awais Junaid
7 min readSep 5, 2021

Task 10👨🏻‍💻Summer Program 2021

Task Description 📄

📌 Create a blog/article/video about explaining k-mean clustering and its real usecase in the security domain

What is K-Means ??

Kmeans algorithm is an iterative algorithm that tries to partition the dataset into Kpre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group. It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as different (far) as possible.

What is Clustering ???

Clustering is one of the most common exploratory data analysis technique used to get an intuition about the structure of the data. It can be defined as the task of identifying subgroups in the data such that data points in the same subgroup (cluster) are very similar while data points in different clusters are very different. In other words, we try to find homogeneous subgroups within the data such that data points in each cluster are as similar as possible according to a similarity measure such as euclidean-based distance or correlation-based distance.

K-Means Clustering Algorithm

K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering problems in machine learning or data science. In this topic, we will learn what is K-means clustering algorithm, how the algorithm works, along with the Python implementation of k-means clustering.

What is K-Means Algorithm?

K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different clusters. Here K defines the number of pre-defined clusters that need to be created in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.

It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that each dataset belongs only one group that has similar properties.

It allows us to cluster the data into different groups and a convenient way to discover the categories of groups in the unlabeled dataset on its own without the need for any training.

The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and repeats the process until it does not find the best clusters. The value of k should be predetermined in this algorithm.

It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this algorithm is to minimize the sum of distances between the data point and their corresponding clusters.

The k-means clustering algorithm mainly performs two tasks:

  • Determines the best value for K center points or centroids by an iterative process.
  • Assigns each data point to its closest k-center. Those data points which are near to the particular k-center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away from other clusters.

The below diagram explains the working of the K-means Clustering Algorithm:

The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Visual representation:

Taking K=2, to identify the dataset and to put them into different clusters. Here we are selecting the below two points as k points, which are not the part of our dataset. Consider the below image:

We will now draw a median between both the centroids. Consider the below image:

Let’s color them as blue and yellow for clear visualization.

We will compute the center of gravity of these centroids, and will find new centroids as below:

Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same process of finding a median line. The median will be like below image:

From the above image, we can see, one yellow point is on the left side of the line, and two blue points are right to the line. So, these three points will be assigned to new centroids.

We will repeat the process by finding the center of gravity of centroids, so the new centroids will be as shown in the below image:

As we got the new centroids so again will draw the median line and reassign the data points. So, the image will be:

We can see in the above image; there are no dissimilar data points on either side of the line, which means our model is formed. Consider the below image:

As our model is ready, so we can now remove the assumed centroids, and the two final clusters will be as shown in the below image:

Usecase of K-Means clustering in the security domain

With the increasing cases of cyber and other computer-related crimes, computer and mobile data security has become a matter of concern, secure, and non-contaminated data is very important for data mining to improve the performance of a system. This paper proposes a system framework that is able to collect information and then be able to generate alerts in real time. The proposed scheme is then simulated using the K-means clustering algorithm, which is one of the most popular clustering algorithms to determine the efficiency and the accuracy of the proposed scheme.

The IT departments are working hard to ensure integrity, availability, and integrity for the system infrastructure of their institution or organizations.They are struggling to put up measures to be able to detect threats in real time. This paper mainly focuses on proposing a system that will be able to detect any possible threats and alert the relevant authority about the incidents. The Proposed infrastructure consists of several modules that are designed to operate in a pipeline-feedback loop.

The proposed system also aims tosimplify the security detection system and makes it user-friendly. It integrates the Associative spatial predictionwith spatio-temporal clustering so as to help in visualizingcriminal activities, such as burglaries, robberies, amongother unlawful activities. An example that relates to theirsystem is a research study carried out by [5]. The literature work outlines the various techniques for expediting theprocess of resolving crimes through the identification ofthe patterns of crime with data mining. To achieve theprojected tasks, the paper proposes the application of clus-tering together with geographical analysis for predictingand identifying any possible criminal activity. Anothersystem known as the VU focuses on finding the insights,detecting influencer mentions, detecting the trendingstories and suggesting the contents to post from socialmedia platforms. The system is targeted to be used forgathering social intelligence for marketing purposes.

Conclusion:

A system that will be capable of generatingsecurity intelligence from the Internet has been proposed.Several different modules that will make up the systemhave also been proposed. This way, its efficiency and accuracy and be determined. The image given below gives us the exact representation of how the K-Means clustering look like where the sample clustering data is shown and then the data organized in a certain pattern.

--

--