Tutorial

RcppShark is basically a bridge to allow creating packages based on Shark. It also allows for rapid prototyping using Shark as the underlying machine learning library. Therefore, the most important part is to understand how to use RcppShark in your own packages. For this reason, RcppShark contains (the usual) package skeleton creator. To develop your own package using RcppShark, you simply call after installing RcppShark

library (RcppShark)

RcppShark.package.skeleton (“myOwnPackage”)

(make sure your own name is valid). This will create a empty package structure where you can freely use Shark in the C++ part. Let us wrap the KMeans algorithm, following the tutorial from the Shark homepage (http://image.diku.dk/shark/sphinx_pages/build/html/rest_sources/tutorials/algorithms/kmeans.html).

For this we want basically two functions: One that will ‘learn’ the clustering and cluster the given training data, and one that will ‘predict’ the clustering to new data. Let us start with the first part. We start with the C++ code, and will only add R code, if it is necessary. Rcpp will export our C++ function to R in a very nice fashion, so it is not always necessary to add R code, at least not for our little tutorial.

Now we can write the corresponding C++ function. For this we follow the Shark tutorial by creating a new file in the src folder of our new package, called KMeans.cpp, and importing all the headers we need:

#include <shark/Algorithms/KMeans.h> //k-means algorithm
#include <shark/Models/Clustering/Centroids.h>//model performing hard clustering of points
#include <shark/Models/Clustering/HardClusteringModel.h>//model performing hard clustering of points

#include <Rcpp.h>
#include "utils.h" // some conversion helpers

using namespace shark;
using namespace std;

using namespace Rcpp;

Note that we have skipped the CSV and normalizer header, as we have full control over both in R, and added Rcpp as well as our small utils.h header which contains some code to convert from the Shark to the Rcpp data format (and back). Lets continue with the main routine. This will take a matrix, the number of clusters \(k\) to find and will return a list with the following two information: The final clustering in form of a model consisting of the centroids and a vector with the cluster assignments. So we start with the header:

//' Simple KMeans Train
//' 
//' @export
//'
// [[Rcpp::depends(BH)]]    
// [[Rcpp::export]] 
List SharkKMeansTrain (NumericMatrix X, ssize_t k) {

The first thing to do is to convert the given data in form of a matrix into an object, Shark can work with. For this, we call the corresponding function from the utils.h class:

    // convert data
    UnlabeledData<RealVector> data = NumericMatrixToUnlabeledData(X);
    std::size_t elements = data.numberOfElements();

Now comes the Shark code. For more details on this, please consult the Shark tutorial. We drop here everything that we do not need, e.g. printing information.

    // compute centroids using k-means clustering
    Centroids centroids;
    kMeans (data, k, centroids);

    // convert cluster centers/centroids
    Data<RealVector> const& c = centroids.centroids();
    NumericMatrix cM = DataRealVectorToNumericMatrix (c);
    
    // cluster training data we are given and convert to Rcpp object
    HardClusteringModel<RealVector> model(&centroids);
    Data<unsigned> clusters = model(data);
    NumericVector labels = LabelsToNumericVector (clusters);

Finally we just need to put all the bits into a list and return it to R:

    // return solution found 
    Rcpp::List rl = R_NilValue;
    rl = Rcpp::List::create(
        Rcpp::Named("labels") = labels,
        Rcpp::Named("centroids") = cM );
    return (rl);

Please note that for sake of brevity we dropped all kind of checks. For production code such checks obviously should be done.

Before we test this code, let us quickly write the corresponding predict function (which is more or less already part of the training function, so we could drop that there):

//' Simple KMeans Predict
//' 
//' @export
//'
// [[Rcpp::depends(BH)]]    
// [[Rcpp::export]] 
List SharkKMeansPredict (NumericMatrix X, NumericMatrix centroids) {

    // convert data
    UnlabeledData<RealVector> data = NumericMatrixToUnlabeledData (X);
    std::size_t elements = data.numberOfElements();

    // convert centroids
    Centroids c (NumericMatrixToDataRealVector (centroids));
    
    // cluster data we are given and convert to Rcpp object
    HardClusteringModel<RealVector> model (&c);
    Data<unsigned> clusters = model (data);
    NumericVector labels = LabelsToNumericVector (clusters);

    // return solution found 
    rl = Rcpp::List::create(Rcpp::Named("labels") = labels);
    return (rl);
}

Running this code is now easy, as by Rcpp::export tag, Rcpp will automatically create a wrapper to be called from R. This is good enough for us, though usually you really want to wrap the C++ function in a nice R routine. So we can now write a small function that will use our KMeans routine by creating first some example data and then calling our KMeans:

library (devtools)
load_all (".")

# convert iris to matrix
data = as.matrix(iris[,1:4])

model = SharkKMeansTrain (data, 3)
labels = SharkKMeansPredict (data, model$centroids)

Note that we need the devtools package, as we develop a new package. The output should be look similar to this:

> model
$labels
  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [75] 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 2 2 1 2 2 2 2
[112] 2 2 1 1 2 2 2 2 1 2 1 2 1 2 2 1 1 2 2 2 2 2 1 2 2 2 2 1 2 2 2 1 2 2 2 1 2
[149] 2 1

$centroids
         [,1]     [,2]     [,3]     [,4]
[1,] 5.006000 3.428000 1.462000 0.246000
[2,] 5.901613 2.748387 4.393548 1.433871
[3,] 6.850000 3.073684 5.742105 2.071053

> labels
$labels
  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [75] 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 2 2 1 2 2 2 2
[112] 2 2 1 1 2 2 2 2 1 2 1 2 1 2 2 1 1 2 2 2 2 2 1 2 2 2 2 1 2 2 2 1 2 2 2 1 2
[149] 2 1

For convenience every new created skeleton contains the above KMeans example.