Quick Start

Martin Vincent

2016-12-28

Quick Start (for msgl version 2.3.5)

1. Load the msgl library in R

library(msgl)
## Loading required package: Matrix
## Loading required package: sglOptim
## Loading required package: foreach
## Loading required package: doParallel
## Loading required package: iterators
## Loading required package: parallel

2. Load your data

Load data containing N samples and p features (covariates):

x <- # load design matrix (of size N x p)
classes <- # load class labels (a vector of size N)

For the purpose of this tutorial we will load a data set consisting of microRNA normalized expression measurements of primary cancer samples.

data(PrimaryCancers)
x[1:5,1:5]
##              let.7a     let.7c     let.7d     let.7e       let.7f
## P-544-ME -1.1052510 -0.9213983 -0.7200146 -0.9448098 -0.591417505
## P-554-NW -1.0956835 -1.0879221 -0.6100223 -0.9538088 -0.554779014
## P-559-OI -1.1271169 -1.0914447 -0.6889379 -1.0823322 -0.736167409
## P-564-MO -1.2465982 -1.2719367 -0.7614792 -1.2006796 -0.784319518
## P-579-MY -0.6194332 -0.4971233 -0.5169694 -0.9004003  0.009509523
dim(x)
## [1] 165 371
table(classes)
## classes
##    Breast       CCA Cirrhosis       CRC        EG       HCC     Liver 
##        17        20        17        20        18        17        20 
##  Pancreas  Squamous 
##        20        16

Hence, p = 384, N = 165 and the number of classes K = 9, this implies that the multinomial classification model has 9*(384+1) = 3465 parameters.

Let us take out a small test set:

idx <- 1:10
x.test <- x[idx,]
x <- x[-idx,]
classes.test <- classes[idx]
classes <- classes[-idx]

3. Estimate error using cross validation

Choose lambda (fraction of lambda.max) and alpha, with alpha = 1 for lasso, alpha = 0 for group lasso and alpha in the range (0,1) for spares group lasso.

Use msgl::cv to estimate the error for each lambda in a sequence decreasing from the data derived lambda max to lambda * lambda max. Lambda max is the lambda at which the first penalized parameter becomes non-zero. A smaller lambda will take longer to fit and include more features. The following command will run a 10 fold cross validation for each lambda value in the lambda sequence using 2 parallel units (using the foreach and doParallel packages.

cl <- makeCluster(2)
registerDoParallel(cl)

fit.cv <- msgl::cv(x, classes, fold = 10, alpha = 0.5, lambda = 0.1, use_parallel = TRUE)
## Running msgl 10 fold cross validation (dense design matrix)
## 
##  Samples:  Features:  Classes:  Groups:  Parameters: 
##        155        372         9      372       3.348k
stopCluster(cl)

(for the current version no progress bar will be shown)

Get a summery of the validated models. We have now cross validated the models corresponding to the lambda values, one model for each lambda value. We may get a summery of this validation by doing:

fit.cv
## 
## Call:
## msgl::cv(x = x, classes = classes, alpha = 0.5, lambda = 0.1, 
##     fold = 10, use_parallel = TRUE)
## 
## Models:
## 
##  Index:  Lambda:  Features:  Parameters:  Error: 
##        1     1.00        1.4         11.3    0.93
##       20     0.64       12.7           75    0.54
##       40     0.40       31.9        168.3    0.23
##       60     0.25       43.3        226.7    0.17
##       80     0.16         54        282.8    0.14
##      100     0.10       65.5        342.3    0.14
## 
## Best model:
## 
##  Index:  Lambda:  Features:  Parameters:  Error: 
##       86     0.14         58        302.3    0.14

Hence, the best model is obtained using lambda index 86 and it has a cross validation error of 0.14. The expected number of selected features is 58 and the expected number of parameters is 302.3.

4. Fit the final model

Use msgl to fit a final model.

fit <- msgl::fit(x, classes, alpha = 0.5, lambda = 0.1)
## 
## Running msgl (dense design matrix) 
## 
##  Samples:  Features:  Classes:  Groups:  Parameters: 
##        155        372         9      372       3.348k

Get a summery of the estimated models

fit
## 
## Call:
## msgl::fit(x = x, classes = classes, alpha = 0.5, lambda = 0.1)
## 
## Models:
## 
##  Index:  Lambda:  Features:  Parameters: 
##        1     1.00          2           13
##       20     0.64         11           65
##       40     0.40         32          171
##       60     0.25         43          230
##       80     0.16         48          250
##      100     0.10         67          345

Take a look at the estimated models. As we saw in the previous step the model with index 86 had the best cross validation error, we may take a look at the included features using the command:

features(fit)[[best_model(fit.cv)]] # Non-zero features in best model
##  [1] "Intercept"   "let.7c"      "miR.10a"     "miR.17"      "miR.21"     
##  [6] "miR.27a"     "miR.34a"     "miR.92a"     "miR.99b"     "miR.122"    
## [11] "miR.129.3p"  "miR.130b"    "miR.133a"    "miR.133b"    "miR.135b"   
## [16] "miR.138"     "miR.139.5p"  "miR.143"     "miR.147b"    "miR.148a"   
## [21] "miR.181a"    "miR.182"     "miR.187"     "miR.191"     "miR.192"    
## [26] "miR.196b"    "miR.199a.3p" "miR.203"     "miR.205"     "miR.210"    
## [31] "miR.214"     "miR.216a"    "miR.221"     "miR.223"     "miR.224"    
## [36] "miR.302b"    "miR.338.3p"  "miR.484"     "miR.505"     "miR.508.3p" 
## [41] "miR.518f"    "miR.526b"    "miR.532.3p"  "miR.548d.3p" "miR.570"    
## [46] "miR.615.5p"  "miR.625"     "miR.628.5p"  "miR.654.3p"  "miR.885.5p" 
## [51] "miR.891a"    "miR.492"     "miR.506"

Hence 53 features are included in the model, this is close to the expected number based on the cross validation estimate.

The sparsity structure of the parameters belonging to these 53 features may be viewed using

parameters(fit)[[best_model(fit.cv)]]
## 9 x 53 sparse Matrix of class "lgCMatrix"
##    [[ suppressing 53 column names 'Intercept', 'let.7c', 'miR.10a' ... ]]
##                                                                          
## Breast    | | | . . . . | . | . . | | . . . . | | | | | | | | | | | | | |
## CCA       | . | | | | | . | | | . . . . | . . . | | . | . . | . | | . | |
## Cirrhosis | . . . | | . . . | | | . . | | | . . . . . . | . | | | . . | .
## CRC       | | | | . | | | | | . | . . | . | . | | . | | . | | . | . | . .
## EG        | . . | . . . . . | | . | | | | . | | . . | . | | | . | . | | .
## HCC       | . | | . | | | . | | | . . | | . . . | . . | | . . | | . . | |
## Liver     | | | . | | | | | | . . . . | | | | | | | . | . | | . . | . | .
## Pancreas  | | | | | | | | . | | | | . | | . . | . | . | . . . | . . | | |
## Squamous  | . . | . . . | . | . | | | . | | . . | | . | | | . | | | | | .
##                                                    
## Breast    | | | . . . | . . | . | | | | . | . . . .
## CCA       | | . | | | | | | . | | . | | | | . | | |
## Cirrhosis | | | . | . . | | | | | | | | | | | | | |
## CRC       | . | . . . . . . . . | | | . . . | | . .
## EG        | . | | | . | | | | | | | | | . | | | . |
## HCC       | | | . | | . . . | | . . . | . . | . . .
## Liver     | . | . | | | | . | | | | . . . | | . | .
## Pancreas  . . . | . | | . . . | | | . . . | | | | |
## Squamous  | | | . | . . . . . . . . . . . . . . . .

We may also take a look at the estimate parameters (or coefficients)

coef(fit, best_model(fit.cv))[,1:5] # First 5 non-zero parameters of best model
## 9 x 5 sparse Matrix of class "dgCMatrix"
##            Intercept     let.7c     miR.10a       miR.17      miR.21
## Breast     1.6855207 -0.1835078  0.42757013  .            .         
## CCA       -5.6945211  .         -0.07849334  0.445093420 -0.24076978
## Cirrhosis  2.5466498  .          .           .            0.19283340
## CRC       -4.3557707  1.4881267 -1.34554218 -1.665796232  .         
## EG         3.1864493  .          .           0.255711038  .         
## HCC        0.6712489  .          1.33418100 -0.337340600  .         
## Liver      1.9736792 -0.6841190  0.02635424  .            0.29593275
## Pancreas   2.4886878 -0.5117927 -0.97662326  1.608377508 -0.03819527
## Squamous   3.2801128  .          .          -0.004670857  .

If we count the total number of non-zero parameters in the model we get, in this case 275 which is close to the expected based on the cross validation estimate.

6. Use your model for predictions

Load test data containing M samples and p features.

x.test <- # load matrix with test data (of size M x p)

Use the final model to predict the classes of the M samples in x.test.

res <- predict(fit, x.test)

res$classes[,best_model(fit.cv)] # Classes predicted by best model
##   P-544-ME   P-554-NW   P-559-OI   P-564-MO   P-579-MY   P-590-OU 
## "Squamous"    "Liver"       "EG"    "Liver"      "CRC"      "CRC" 
##   P-598-PO   Q-199-AB   Q-250-GS   Q-278-DK 
##    "Liver"       "EG"      "CCA"       "EG"
classes.test # True classes
##  P-544-ME  P-554-NW  P-559-OI  P-564-MO  P-579-MY  P-590-OU  P-598-PO 
##    Breast Cirrhosis        EG     Liver       CRC       CRC     Liver 
##  Q-199-AB  Q-250-GS  Q-278-DK 
##        EG       CCA       CRC 
## Levels: Breast CCA Cirrhosis CRC EG HCC Liver Pancreas Squamous

We may also get the estimated probabilities for each of the classes

res$response[[best_model(fit.cv)]]
##             P-544-ME    P-554-NW    P-559-OI    P-564-MO    P-579-MY
## Breast    0.34498690 0.002792285 0.005996772 0.007163726 0.131685017
## CCA       0.14133557 0.021275517 0.051999915 0.028327222 0.044122906
## Cirrhosis 0.01886155 0.367601180 0.020613525 0.227294778 0.004426578
## CRC       0.04713724 0.002614093 0.074877699 0.007438449 0.747329294
## EG        0.02909340 0.044067912 0.697064421 0.037769510 0.012597340
## HCC       0.03239314 0.037234555 0.022754389 0.058457432 0.009365618
## Liver     0.01312646 0.513399253 0.037706577 0.624709821 0.002725093
## Pancreas  0.02725561 0.008346044 0.083880751 0.006756721 0.039025695
## Squamous  0.34581013 0.002669161 0.005105950 0.002082341 0.008722459
##              P-590-OU    P-598-PO   Q-199-AB   Q-250-GS    Q-278-DK
## Breast    0.015382429 0.004562008 0.04342990 0.02062219 0.019833523
## CCA       0.020756347 0.017019175 0.14454934 0.64051753 0.191789619
## Cirrhosis 0.010877781 0.132216804 0.01218625 0.09572533 0.013338561
## CRC       0.763032118 0.006113376 0.13220219 0.03607028 0.057734792
## EG        0.100358916 0.048757494 0.51183520 0.03327640 0.583803006
## HCC       0.044016223 0.092917684 0.05615446 0.04741534 0.008915731
## Liver     0.007124862 0.684468259 0.01349342 0.01305300 0.003731763
## Pancreas  0.030202909 0.011685993 0.06707729 0.08586810 0.078074532
## Squamous  0.008248416 0.002259208 0.01907196 0.02745183 0.042778474