speaq2, function illustrations

Charlie Beirnaert

2016-11-14

speaq2

To illustrate the possibilities of speaq2, the spiritual successor to the original speaq package, the wine dataset is used. This dataset is also utilized in the paper so with this vignette you will be able to reproduce the results.

Note that there is a plotting function available in speaq2 (for raw spectra) but the features are naturally plotted with ggplot2. The code for these plots are also included in this vignette and can be used in your own workflows.

Plotting the raw wine data

library(speaq2)
data(Winedata)
Spectra.wine <- as.matrix(Winedata$spectra )
ppm.wine <- as.numeric(Winedata$ppm) 
wine.color <- Winedata$wine.color 
wine.origin <- Winedata$origin 
# all spectra
drawSpecPPM(Y.spec = Spectra.wine, 
            X.ppm = ppm.wine, 
            title = 'Wine data spectra', 
            groupFactor = wine.color, 
            legend.extra.x = 1, 
            legend.extra.y = 1.1)

The drawSpecPPM() plotting function indicates that there might be a gap in the data (which is correct as the original authors deleted the region between 5.0 and 4.5 ppm). This warning can be of importance for the interpretation of the plot as in this case the original authors deleted the data, not by setting the values to 0, but by effectively removing it from the matrix and ppm vector. This produces a plot that appears continuous but is in fact not.

The drawSpecPPM() function also indicates that the groupFactor is not a factor and (succesfully) attempts to do the conversion. The next plot is an excerpt of the wine NMR spectra.

# small excerpt by defining the region of interest
drawSpecPPM(Y.spec = Spectra.wine, 
            X.ppm = ppm.wine, 
            groupFactor = as.factor(wine.color), 
            title = 'Raw wine data excerpt', 
            legend.extra.x = 1.1, 
            legend.extra.y = 1.0, 
            ROI.ppm = 3.6, 
            ROI = NULL, 
            roiWidth.ppm = 0.15, 
            legendpos = "topright" )

From spectra via peaks to aligned peaks (features)

Now that we’ve had a look at the spectra it is time to convert these to peaks by using the getWaveletPeaks() function. This takes about 50 seconds, with nCPU = 4 on a 2.5GHz machine (nCPU is set to 1 for the vignette build but should be changed).

wine.peaks <- getWaveletPeaks(Y.spec=Spectra.wine, 
                             X.ppm=ppm.wine, 
                             baselineThresh = 10,
                             SNR.Th = -1, 
                             nCPU = 2, 
                             include_nearbyPeaks = TRUE) # nCPU set to 2 for the vignette build


wine.aligned <- PeakAligner(Y.peaks = wine.peaks,  
                            min.samp.grp = 5, 
                            grouping.window.width = 200)

There are two notes to be made on some important parameters: 1. The include_nearbyPeaks = TRUE option also detects peaks in the tails of larger other peaks. At first this might seem the prefered option as you want to detect as much peaks as possible but we will see later that often it is better to exclude them as these small peaks can cause problems later on. 2. The grouping.window.width parameter is chosen to be twice the size of the default value: it is advised to choose this larger when working with spectra that exhibit large between sample shifts. The wine dataset is an extreme example of large between sample shifts (caused by the substantial pH differences between wines of different colour)

Now we can plot the detected peaks and the aligned peaks. The dataset after alignment contains both the original ppm values of every peak (in the peakPPM variable) but also the group information (found in the peakIndex variable). By calling the AddPlottingStuff() function the groupPPM variable is added so we also have the ppm value of the groups (the link is: groupPPM <- ppm.wine[peakIndex])

Plotting the peak data

# adding labels to the dat a for plotting and the group ppm values
library(grid)
library(gridBase)
library(gridExtra)
library(ggplot2)
.pardefault <- par(no.readonly = T)
wine.peaks.plot <- AddPlottingStuff(Y.peaks = wine.peaks, 
                                    X.ppm = ppm.wine, 
                                    groupLabels = wine.color )

wine.aligned.plot <- AddPlottingStuff(Y.peaks = wine.aligned, 
                                      X.ppm = ppm.wine, 
                                      groupLabels = wine.color )
#ROI.ppm <- 3.6
#roiWidth.ppm <- 0.15
ROI.ppm <- 1.330
roiWidth.ppm <- 0.025

pp1 <- ggplot(wine.peaks.plot[wine.peaks.plot$peakPPM > (ROI.ppm - roiWidth.ppm ) &
                              wine.peaks.plot$peakPPM < (ROI.ppm + roiWidth.ppm ) ,], 
              aes(x = peakPPM, y = peakValue, colour = label) ) + 
       geom_point() + 
       theme_bw() + 
       xlim(c(ROI.ppm + roiWidth.ppm, ROI.ppm - roiWidth.ppm)) +
       scale_colour_manual(values = c("rose" = "#00BF7D","red" = "#F8766D","white" = "#00B0F6")) +  
       labs(x = "ppm", y = "peak value") + 
       ggtitle("Wavelet based peak detection") +
       theme(legend.title = element_blank(),
             legend.position = c(0.061,0.75),
             legend.background = element_rect(colour = "black", size = 0.3, linetype = 1),
             legend.key = element_blank(),
             legend.text = element_text(size = 10),
             text = element_text(size = 12),
             axis.text.y = element_text(angle = 90, hjust = 0.5),
             axis.ticks.length = unit(0.2,"cm"), 
             axis.title.x = element_text(margin = margin(10,0,0,0)),
             axis.title.y = element_text(margin = margin(0,20,0,0)),
             plot.title = element_text(lineheight = 0.8, face = "bold", margin = margin(0,0,20,0)),
             plot.margin = unit(c(0.5,1,0.5,0.5), "cm")) 
     


pp2 <- ggplot(wine.aligned.plot[wine.aligned.plot$peakPPM > (ROI.ppm - roiWidth.ppm ) &
                                wine.aligned.plot$peakPPM < (ROI.ppm + roiWidth.ppm ) ,],
              aes(x = groupPPM, y = peakValue, colour = label) ) +
       geom_point() + 
       theme_bw() + 
       xlim(c(ROI.ppm + roiWidth.ppm, ROI.ppm - roiWidth.ppm)) +
       scale_colour_manual( values = c("rose" = "#00BF7D","red" = "#F8766D","white" = "#00B0F6")) +  
       labs(x = "ppm", y = "peak value") + 
       ggtitle("Aligned peaks") +
       theme(legend.title = element_blank(),
             legend.position = c(0.061,0.748),
             legend.background = element_rect(colour = "black",size = 0.3, linetype = 1),
             legend.key = element_blank(), 
             legend.text = element_text(size = 10),
             text = element_text(size = 12),
             axis.text.y = element_text(angle = 90, hjust = 0.5),
             axis.ticks.length = unit(0.2,"cm"), 
             axis.title.x = element_text(margin = margin(10,0,0,0)),
             axis.title.y = element_text(margin = margin(0,20,0,0)),
             plot.title = element_text(lineheight = 0.8, face="bold", margin = margin(0,0,20,0)),
             plot.margin = unit(c(0.5,1,0.5,0.5), "cm")) 

plot.new()

grid.newpage()
pushViewport(viewport(layout = grid.layout(13, 1)))

#Draw ggplot1
pushViewport(viewport(layout.pos.row = (6:9)))
print(pp1, newpage = FALSE)
popViewport()

#Draw ggplot2
pushViewport(viewport(layout.pos.row = (10:13)))
print(pp2, newpage = FALSE)
popViewport()

#Draw bsae plot
pushViewport(viewport(layout.pos.row = (1:5)))
par(fig = gridFIG(), new = TRUE)
drawSpecPPM(Y.spec = Spectra.wine, 
            X.ppm = ppm.wine, 
            groupFactor = as.factor(wine.color), 
            title = "Raw wine data excerpt",
            ROI.ppm = ROI.ppm, 
            roiWidth.ppm = roiWidth.ppm,  
            nAxisPos = 6,
            manual.colours = c("#F8766D","#00BF7D", "#00B0F6"),
            legend.extra.x = 0.3, 
            legend.extra.y = 0.8)
popViewport()

The plot above shows clearly what the basis of speaq2 does, convert spectra accurately to peak data and align these peaks. The quality of the alignment can be checked by plotting the silhouette values. this can be done with the SilhouetR() function.

SilhouetteValues <- SilhouetR(DataMatrix = wine.aligned$peakPPM, 
                              GroupIndices = wine.aligned$peakIndex)
## DataMatrix is not a matrix, attempting conversion with the assumption of only 1 variable (1 column)
Silh_plot <- ggplot(SilhouetteValues, aes(SilhouetteValues)) +
             geom_freqpoly(binwidth = 0.03) +
             theme_bw()
Silh_plot

It is clear that the grouping is very good. To be absolutely shure we can check the mean silhouette value of every group to see if there are groups that should be regrouped. This regrouping, if needed, can be done with the regroupR() function.

groups <- unique(SilhouetteValues$GroupIndices)
Ngroups <- length(groups)
sil_means <- matrix(NA, ncol = 3, nrow = Ngroups)

for (k in 1:Ngroups) {
    sil_means[k, 1] = groups[k]
    sil_means[k, 2] = mean(SilhouetteValues$SilhouetteValues[SilhouetteValues$GroupIndices == 
        groups[k]])
    sil_means[k, 3] = mean(wine.aligned$peakSNR[wine.aligned$peakIndex == groups[k]])
}

sil_means <- sil_means[order(sil_means[, 2]), ]
colnames(sil_means) <- c("groupIndex", "avg_silhouette_val", "avg. SNR")
head(sil_means)
##      groupIndex avg_silhouette_val  avg. SNR
## [1,]       5313          0.2557844 32.261401
## [2,]       5050          0.4549954 11.669748
## [3,]       2330          0.4848325 31.757387
## [4,]       4711          0.4904269  5.754990
## [5,]       5552          0.5011936  6.933832
## [6,]       3287          0.5099114  3.559727

As it turns out, there is a group with a low average silhouette value (0.25) but a fairly high average signal-to-noise ratio (if this group had a low SNR ratio it could just be a noise group). This indicates that these are non-noise peaks which are grouped incorrectly, the follwoing plot acknowledges this

faulty.groupIndex <- sil_means[1,1]
ROI.ppm <- ppm.wine[faulty.groupIndex]
roiWidth.ppm <- 0.1
pp1 <- ggplot(wine.peaks.plot[wine.peaks.plot$peakPPM > (ROI.ppm - roiWidth.ppm ) &
                              wine.peaks.plot$peakPPM < (ROI.ppm + roiWidth.ppm ) ,],
              aes(x = peakPPM, y = peakValue, colour = label) ) + 
       geom_point() + 
       xlim(c(ROI.ppm + roiWidth.ppm, ROI.ppm - roiWidth.ppm)) +
       scale_colour_manual( values = c("rose" = "#00BF7D","red" = "#F8766D","white" = "#00B0F6")) +  
       labs(x = "ppm", y = "peak value") + 
       ggtitle("Wavelet based peak detection") +
       theme_bw() + 
       theme(legend.title = element_blank(),
             legend.position = c(0.061,0.75),
             legend.background = element_rect(colour = "black", size = 0.3, linetype = 1),
             legend.key = element_blank(), 
             legend.text = element_text(size=10),
             text = element_text(size = 12),
             axis.text.y = element_text(angle=90, hjust=0.5),
             axis.ticks.length = unit(0.2,"cm"), 
             axis.title.x = element_text(margin=margin(10,0,0,0)),
             axis.title.y = element_text(margin=margin(0,20,0,0)),
             plot.title = element_text(lineheight = 0.8, face = "bold", margin = margin(0,0,20,0)),
             plot.margin = unit(c(0.5,1,0.5,0.5), "cm"))


pp2 <- ggplot(wine.aligned.plot[wine.aligned.plot$groupPPM > (ROI.ppm - roiWidth.ppm ) &
                                wine.aligned.plot$groupPPM < (ROI.ppm + roiWidth.ppm ) ,], 
              aes(x = groupPPM, y = peakValue, colour = label) ) +
       geom_point() + 
       theme_bw() + 
       xlim(c( ROI.ppm + roiWidth.ppm, ROI.ppm - roiWidth.ppm)) +
       scale_colour_manual( values = c("rose" = "#00BF7D","red" = "#F8766D","white" = "#00B0F6")) +  
       labs(x = "ppm", y = "peak value") + 
       ggtitle("Aligned peaks") +
       theme(legend.title = element_blank(),
             legend.position = c(0.061,0.748),
             legend.background = element_rect(colour = "black", size = 0.3, linetype = 1),
             legend.key = element_blank(), 
             legend.text = element_text(size = 10),
             text = element_text(size = 12),
             axis.text.y = element_text(angle = 90, hjust = 0.5),
             axis.ticks.length = unit(0.2,"cm"), 
             axis.title.x = element_text(margin = margin(10,0,0,0)),
             axis.title.y = element_text(margin = margin(0,20,0,0)),
             plot.title = element_text(lineheight = 0.8, face="bold", margin = margin(0,0,20,0)),
             plot.margin = unit(c(0.5,1,0.5,0.5), "cm"))

plot.new()

grid.newpage()
pushViewport(viewport(layout = grid.layout(13, 1)))

#Draw ggplot1
pushViewport(viewport(layout.pos.row = (6:9)))
print(pp1, newpage = FALSE)
popViewport()

#Draw ggplot2
pushViewport(viewport(layout.pos.row = (10:13)))
print(pp2, newpage = FALSE)
popViewport()

#Draw bsae plot
pushViewport(viewport(layout.pos.row = (1:5)))
par(fig = gridFIG(), new = TRUE)
drawSpecPPM(Y.spec = Spectra.wine, 
            X.ppm = ppm.wine, 
            groupFactor = as.factor(wine.color), 
            title = "Raw wine data excerpt",
            ROI.ppm = ROI.ppm, 
            roiWidth.ppm = roiWidth.ppm,  
            nAxisPos = 6,
            manual.colours = c("#F8766D","#00BF7D", "#00B0F6"),
            legend.extra.x = 0.3, 
            legend.extra.y = 0.8)
popViewport()