Calculating a fuzzy kmeans membership matrix with R and Rcpp

by Błażej Moska, computer science student and data science intern 

Suppose that we have performed clustering K-means clustering in R and are satisfied with our results, but later we realize that it would also be useful to have a membership matrix. Of course it would be easier to repeat clustering using one of the fuzzy kmeans functions available in R (like fanny, for example), but since it is slightly different implementation the results could also be different and for some reasons we don’t want them to be changed. Knowing the equation we can construct this matrix on our own, after using the kmeans function. The equation is defined as follows (source: Wikipedia):

$$w_{ij} = \frac{1}{ \sum_ {k=1}^c ( \frac{ \| x_{i} - c_{j} \| }{ \| x_{i} - c_{k} \| }) ^{ \frac{2}{m-1} } } $$

\(w_{ij}\) denotes to what extent the \(i\)th object belongs to the \(j\)th cluster. So the total number of rows of this matrix equals number of observation and total number of columns equals number of variables included in clustering. \(m\) is a parameter, typically set to \(m=2\). \(w_{ij}\) values range between 0 and 1 so they are easy and convenient to compare. In this example I will use \(m = 2\) so the Euclidean distance will be calculated.

To make computations faster I also used the Rcpp package, then I compared speed of execution of function written in R with this written in C++.

In implementations for loops were used, although it is not a commonly used method in R (see this blog post for more information and alternatives), but in this case I find it more convenient.

Rcpp (C++ version)

#include <Rcpp.h>
#include <math.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericMatrix fuzzyClustering(NumericMatrix data, NumericMatrix centers, int m) {  
/* data is a matrix with observations(rows) and variables, 
   centers is a matrix with cluster centers coordinates, 
   m is a parameter of equation, c is a number of clusters
*/
  
  int c=centers.rows();
  int rows = data.rows();
  int cols = data.cols(); /*number of columns equals number of variables, the same as is in centers matrix*/
  double tempDist=;        /*dist and tempDist are variables storing temporary euclidean distances */
  double dist=;      
  double denominator=;    //denominator of “main” equation
  
  NumericMatrix result(rows,c);    //declaration of matrix of results
  
  for(int i=;i<rows;i++){
    for(int j=;j<c;j++){
      for(int k=;k<c;k++){
        for(int p=;p<cols;p++){
          tempDist = tempDist+pow(centers(j,p)-data(i,p),2);      
         //in innermost loop an euclidean distance is calculated.
          dist = dist + pow(centers(k,p)-data(i,p),2);              
/*tempDist is nominator inside the sum operator in the equation, dist is the denominator inside the sum operator in the equation*/
        }
        tempDist = sqrt(tempDist);
        dist = sqrt(dist);
        denominator = denominator+pow((tempDist/dist),(2/(m-1)));
        tempDist = ;
        dist = ;
      }
      result(i,j) = 1/denominator;    
// nominator/denominator in the  main equation
      denominator = ;                           
    }
  }
  return result;
}

We can save this in a file with .cpp extension. To compile it from R we can write:

sourceCpp("path_to_cpp_file")

If everything goes right our function fuzzyClustering will then be available from R.

R version

fuzzyClustering=function(data,centers,m){

  c <- nrow(centers)
  rows <- nrow(data)
  cols <- ncol(data)
  result <- matrix(,nrow=rows,ncol=c)  #defining membership matrix
  denominator <- 
  
  for(i in 1:rows){
    for(j in 1:c){
      tempDist <- sqrt(sum((centers[j,]-data[i,])^2))  #euclidean distance, nominator inside a sum operator 
      for(k in 1:c){
        Dist <- sqrt(sum((centers[k,]-data[i,])^2))    #euclidean distance, denominator inside a sum operator 
        denominator <- denominator +((tempDist/Dist)^(2/(m-1))) #denominator of an equation 
      }    
      result[i,j] <- 1/denominator    #inserting value into membership matrix
      denominator <-            
    }
  }
  return(result);
}

Result looks as follows. Columns are cluster numbers (in this case 10 clusters were created), rows are our objects (observations). Values were rounded to the third decimal place, so the sums of rows can be slightly different than 1:

          1     2     3     4     5     6     7     8     9    10  
 [1,] 0.063 0.038 0.304 0.116 0.098 0.039 0.025 0.104 0.025 0.188  
 [2,] 0.109 0.028 0.116 0.221 0.229 0.080 0.035 0.116 0.017 0.051  
 [3,] 0.067 0.037 0.348 0.173 0.104 0.066 0.031 0.095 0.018 0.062  
 [4,] 0.016 0.015 0.811 0.049 0.022 0.017 0.009 0.023 0.007 0.031  
 [5,] 0.063 0.048 0.328 0.169 0.083 0.126 0.041 0.079 0.018 0.045  
 [6,] 0.069 0.039 0.266 0.226 0.102 0.111 0.037 0.084 0.017 0.048  
 [7,] 0.045 0.039 0.569 0.083 0.060 0.046 0.025 0.071 0.015 0.046  
 [8,] 0.070 0.052 0.399 0.091 0.093 0.054 0.034 0.125 0.022 0.062  
 [9,] 0.095 0.037 0.198 0.192 0.157 0.088 0.038 0.121 0.019 0.055 
[10,] 0.072 0.024 0.132 0.375 0.148 0.059 0.025 0.081 0.015 0.067

Performance comparison

Shown below is the output of Sys.time for the C++ and R versions, running against a simulated matrix with 30000 observations, 3 variables and 10 clusters.
The hardware I used was a low-cost notebook Asus R556L with Intel Core i3-5010 2.1 GHz processor and 8 GB DDR3 1600 MHz RAM memory.

C++ version:

  user    system  elapsed
  0.32    0.00    0.33 

R version:

  user    system  elapsed
  15.75    0.02     15.94 

In this example, the function written in C++ executed about 50 times faster than the equivalent function written in pure R.