k-mean clustering and its real use-case in the security domain

Amit Pandey
8 min readJul 19, 2021

First of all we need to understand the basic terms ,those basic terms are given below :-

→Clustering :

Clustering is one of the most common exploratory data analysis technique used to get an intuition about the structure of the data. It can be defined as the task of identifying subgroups in the data such that data points in the same subgroup (cluster) are very similar while data points in different clusters are very different. In other words, we try to find homogeneous subgroups within the data such that data points in each cluster are as similar as possible according to a similarity measure such as euclidean-based distance or correlation-based distance. The decision of which similarity measure to use is application-specific.

Clustering analysis can be done on the basis of features where we try to find subgroups of samples based on features or on the basis of samples where we try to find subgroups of features based on samples. We’ll cover here clustering based on features. Clustering is used in market segmentation; where we try to find customers that are similar to each other whether in terms of behaviors or attributes, image segmentation/compression; where we try to group similar regions together, document clustering based on topics, etc.

Unlike supervised learning, clustering is considered an unsupervised learning method since we don’t have the ground truth to compare the output of the clustering algorithm to the true labels to evaluate its performance. We only want to try to investigate the structure of the data by grouping the data points into distinct subgroups.

In this post, we will cover only K-means which is considered as one of the most used clustering algorithms due to its simplicity.

→K-means Algorithm :

Kmeans algorithm is an iterative algorithm that tries to partition the dataset into Kpre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group. It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as different (far) as possible. It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum. The less variation we have within clusters, the more homogeneous (similar) the data points are within the same cluster.

The way kmeans algorithm works is as follows:

  1. Specify number of clusters K.
  2. Initialize centroids by first shuffling the dataset and then randomly selecting K data points for the centroids without replacement.
  3. Keep iterating until there is no change to the centroids. i.e assignment of data points to clusters isn’t changing.
  • Compute the sum of the squared distance between data points and all centroids.
  • Assign each data point to the closest cluster (centroid).
  • Compute the centroids for the clusters by taking the average of the all data points that belong to each cluster.

The approach kmeans follows to solve the problem is called Expectation-Maximization. The E-step is assigning the data points to the closest cluster. The M-step is computing the centroid of each cluster. Below is a break down of how we can solve it mathematically (feel free to skip it).

The objective function is:

where wik=1 for data point xi if it belongs to cluster k; otherwise, wik=0. Also, μk is the centroid of xi’s cluster.

It’s a minimization problem of two parts. We first minimize J w.r.t. wik and treat μk fixed. Then we minimize J w.r.t. μk and treat wik fixed. Technically speaking, we differentiate J w.r.t. wik first and update cluster assignments (E-step). Then we differentiate J w.r.t. μk and recompute the centroids after the cluster assignments from previous step (M-step). Therefore, E-step is:

In other words, assign the data point xi to the closest cluster judged by its sum of squared distance from cluster’s centroid.

And M-step is:

Which translates to recomputing the centroid of each cluster to reflect the new assignments.

Few things to note here:

  • Since clustering algorithms including kmeans use distance-based measurements to determine the similarity between data points, it’s recommended to standardize the data to have a mean of zero and a standard deviation of one since almost always the features in any dataset would have different units of measurements such as age vs income.
  • Given kmeans iterative nature and the random initialization of centroids at the start of the algorithm, different initializations may lead to different clusters since kmeans algorithm may stuck in a local optimum and may not converge to global optimum. Therefore, it’s recommended to run the algorithm using different initializations of centroids and pick the results of the run that that yielded the lower sum of squared distance.
  • Assignment of examples isn’t changing is the same thing as no change in within-cluster variation:

The Use-case of K-mean clustering in Security domain are given below :-

We can use K-mean clustering in Malware Behavior Detection.

The increase of malware attacks may increase risk in information technology industry that consists of multiple sectors especially in cyber security. Because of that malware detection technique plays vital role in detecting malware attack that can give high impact towards the cyber world. In accordance with the technique, one of unsupervised machine learning able to detect malware attack by identifying the behavior of the malware; which called clustering technique. Owing to this matter, current research shows a paucity of analysis in detecting malware behavior and limited source that can be used in identifying malware attacks. Thus, this paper introduce clustering detection model by using K-Means clustering approach to detect malware behavior of data registry based on the features of the malware. Clustering techniques that use unsupervised algorithm in machine learning plays an important role in grouping similar malware characteristics by studying the behavior of the malware. Throughout the experiment, malware features were selected and extracted from computer registry data and eventually used in the proposed clustering detection model to be clustered as normal or suspicious behavior. The results of the experiment indicates that this proposed model is capable to cluster normal and suspicious data into two separate groups with high detection rate which is more than 90 percent accuracy. Ultimately, the main contribution based on the findings is the proposed framework can be used to cluster the data with the use of data registry to detect malware.

Here We go:-

Developing a malware detection system model by using KMeans clustering approaches is demanding in IDS(Intrusion Detection System) field. Even though clustering techniques causes a number of advantages in grouping similar malware characteristics, however unsupervised algorithm specifically against registry information is absent in machine learning techniques. First, clustering is one of the best method in recognizing similar binaries and put them in one group as used by [1]–[4]. Other researchers [4]–[9] shows that recognizing the malware in malware analysis by using K-Means clustering method is the best way. However, none of them use this method using registry information to analyze the malware. Thus, based on that matter, there is still low significant of malware analysis in this field. Second, Malware analysis by using registry information has been explored by previous researchers [10]– [15] with different methods of malware detection. Yet, K-Means algorithm is still not an alternative to cluster malware data to detect any malware causes low significant malware detection method. Even so, the use of K-Means clustering as malware detection in windows registry has been review by [16] in their survey and K-Means clustering method seems promising in malware detection field. Thus, this paper addresses the two issues, which are lack of data in detecting malware behavior and lack of further analysis in detecting malware behavior. K-Means clustering detection model with appoint of data mining, peculiarly clustering method is a notable field that can be explored to overcome this matter. It is a need to have continuous of IDS improvement in term of the accuracy of malware analysis, the detection time and the suitable detection approach; are the motivations for this research. Therefore, the objective of this research is to generate registry information in detecting malware behavior and secondly to propose clustering analysis against registry information for malware detection. This research focuses on the K-Means clustering as a method to analyze malware in windows registry, which accurately identify normal and suspicious behavior with minimum false positive and false negative as well as maximum true positive and true negative. In addition, the detection method is designed such that it could operate accurately in identifying intrusion in host-based intrusion detection system (HIDS).

— — — — — — — — — — -

Wind-up thought or final thought :-

Intrusion Detection System (IDS) is use as malware detector globally, causes many researchers in exploring this field. In this research project, clustering method is proposed for better malware detection. It is because of lack of analysis in detecting malware behavior causes low malware detection due to limited sources on this information especially in windows registry to identify malware activities. Clustering techniques that use unsupervised algorithm in machine learning plays an important role in grouping similar malware characteristics but this approach is absent in malware analysis environment specifically in registry information. Thus, the purpose of this research project is to study registry information and proposing clustering analysis against registry information to improve malware detection. Thus, the research project has been conducted successfully and proposing clustering analysis model against registry information to improve malware detection. Based on the result, the proposed method has a detection rate more than 90%. It is shows that the proposed method has high rate in detecting malware based on the features of the unknown file. According to the direction of this research project, it gives great benefit to community by providing guidance and steps on how to overcome the stated problem. Finally, it is a hope for community can fetch the importance of this research project and used it concerning to Information Technology and Computer Science area.

Thanks For Visiting 🙏

--

--