Finding Thematic Structure in the David Dataset

Stephen Woloszynek


We’ll start with a dataset from David et al., which contains 16S amplicon sequencing data from 2 individuals across 350 time points. It includes a sequence abundance table (similar to an OTU table, but of a Dada2 flavor), a dataframe of metadata, and a taxonomy table that were generated by a group of students for a class project, so if you are interested, the repository can be found here: Note that the David data was generated via the Dada2 pipeline:, with Silva as a reference database for identifying taxonomy.


The data is “lazyloaded,” so it’ll load into memory when called via simply typing DAVID.

            2202:DonorA 2202:DonorB

UBERON:feces 312 176 UBERON:saliva 258 0

This is also the case for the Gevers et al. data, which is available by typing GEVERS.


We’ll be utilizing topic modeling to identify useful structure in the data, so it may be obvious that we probably want to reshape our data into documents. We provide a function that easily preprocesses abundance data so it can be piped into our topic modeling wrappers. The function= prepare_data takes quite a few arguments. Two are required:

#>           00001 00002 00003 00004 00005
#> ERR531441  7232 44702     0  4562 17174
#> ERR531442  1807     0   756  5386  3899
#> ERR531443 20396 19382     0  9155 10203
#> ERR531444 14325   686     0   394 11702
#> ERR531445  3534  6751 13093  4323  3533

Those two arguments are enough to ultimately identify topics in an unsupervised (i.e., no sample information) fashion; however, there are still some arguments you should be aware of:

#>       Kingdom    Phylum          Class         Order          
#> 00001 "Bacteria" "Bacteroidetes" "Bacteroidia" "Bacteroidales"
#> 00002 "Bacteria" "Firmicutes"    "Clostridia"  "Clostridiales"
#> 00003 "Bacteria" "Bacteroidetes" "Bacteroidia" "Bacteroidales"
#> 00004 "Bacteria" "Firmicutes"    "Clostridia"  "Clostridiales"
#> 00005 "Bacteria" "Bacteroidetes" "Bacteroidia" "Bacteroidales"

Now, if we have sample metadata available, and we’d like to infer relationships between sample metadata and topics, then we can pass them in as the following arguments:

#>                  ID Day         Site       Donor            Cal
#> ERR531441 ERR531441 233 UBERON:feces 2202:DonorA <not provided>
#> ERR531442 ERR531442 169 UBERON:feces 2202:DonorB <not provided>
#> ERR531443 ERR531443 138 UBERON:feces 2202:DonorA     1124.26268
#> ERR531444 ERR531444 267 UBERON:feces 2202:DonorA <not provided>
#> ERR531445 ERR531445  96 UBERON:feces 2202:DonorB <not provided>

The ability to utilize metadata was the primary reason we chose to use the Structural Topic Model (STM) as our topic model package (and topic model approach for that matter) of choice – see The formula determines the way in which our sample metadata will influence our samples over topics distribution and hence will allow to make inferences with respect to this relationship further down the pipeline. Briefly, the formula is similar to a regression equation, with each variable getting its own weight to be estimated by the model. The dependent variable (the output) is the probability of a sample containing a given topic, modeled by a multivariate logistic normal distribution. Once everything is fit, we can explore the covariate effects by focusing on these weight estimates in the posterior distribution.

Because time is the most important variable in the David dataset, we’ll use Days as a continuous covariate. Also, there are certainly differences in sample site (feces versus saliva), so we’ll set that as a factor. Now, microbiomes fluctuate over time, so we can hypothesize that the relationship between time and topic frequency will be nonlinear; that is, for a given sample, the probability of containing a particular topic will possibly “wiggle” as a function of time, as opposed to strictly increase or decrease. To account for such variation, we’ll use a spline. The STM has a user-friendly spline wrapper s(), which we’ll use here. Regarding that binary site factor, modeling it is straightforward; we’ll compare the frequency of topics given samples for one site versus the other. The direction of this comparison is what the refs argument is for in the prepare_data function. If we set refs to “UBERON:saliva,” then UBERON:feces will be set to 1 in our design matrix. Positive weight estimates for this site covariate implies increases probability of containing a given topic for UBERON:feces samples. While not critical in this example, it’s more important when dealing with multi-level factors, when we’d like to compare a bunch of treatments to a control group, for example.

Let’s run the prepare_date function. We’ll set cn_normalize to false since we are not dealing with OTUs (the lookup table adjusts for copy number as a function of OTU ID; we’ll normalize later in the pipeline since we are using Dada2 with Silva and not OTUs with GreenGenes.)

CLEAN <- prepare_data(otu_table=DAVID$ABUND,
                      formula=~Site + s(Day),

The output is a themetadata class object, containing our preprocessed data. This function performs the following:

This object can be directly passed into our STM wrapper, find_topics. The only thing we have to do is specify the number of topics, which we’ll keep small for the sake of time.


system.time(TOPICS <- find_topics(CLEAN,K=15))
#> Registered S3 methods overwritten by 'stm':
#>   method            from           
#>   makepredictcall.s themetagenomics
#>   predict.s         themetagenomics
#>    user  system elapsed 
#>  84.076   2.836  86.986

We now have an object of class topics. We can print some information by just typing its name:

#> A topics object containing a topic model with 15 topics, 745 samples and 1493 discrete taxa.

We can also visualize it by running vis(TOPICS), but let’s hold off for now. We can obtain the same visualization, but with additional information, if we now measure the effects of our covariates found in the formula we passed into prepare_data. We have a simple method to do so

system.time(TOPIC_EFFECTS <- est(TOPICS))
#>    user  system elapsed 
#>  16.516   0.420  16.946

Like our topics object, this effects object also prints out useful information when called:

#> An effects object containing topic effects information.
#> 2 total covariate(s):
#> factor   SiteUBERON:feces (N=745, N+=487)
#> numeric  Day (mean=169.11, median=163.00, std=100.97)
#> Topic weights outside 10%-90% uncertainty interval:
#> SiteUBERON:feces:    T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15
#> Day: T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15

This object contains all of our covariate information, ranging from estimates and uncertainty intervals to posterior predictions, depending on the type of covariate provided in the formula. Let’s go through some of these estimates to give you an idea of what is being estimated. We initially passed in Site and Day. Site is a binary factor, and we set its reference as “UBERON:feces.” Its effect estimate will result from exploring the posterior predictive distribution. For a given topic, we calculate the predicted topic probability given the weights estimated in our model, averaging across samples, but with \(X_{feces}=1\). We do the same for \(X_{feces}=0\), and then we calculate the difference between both predictions. This essentially measures the effect of site by comparing when the effect is “on” versus when it’s “off.” We average these prediction estimates across a bunch of posterior samples (which can be controls via the npoints argument in est), allowing us to obtain a mean, as well as “uncertainty intervals.”:

#>        estimate          10%         90%
#> T1   0.02942558  0.018316797  0.04100150
#> T2   0.08030180  0.064227548  0.09753639
#> T3   0.03294297  0.022171878  0.04488854
#> T4   0.01214722  0.002617126  0.02196663
#> T5   0.13850028  0.120143127  0.15773370
#> T6   0.02097989  0.013179266  0.02914373
#> T7   0.13715480  0.120522138  0.15362203
#> T8   0.05912621  0.050163746  0.06827866
#> T9   0.01015018  0.004219495  0.01647574
#> T10  0.15871532  0.137784249  0.18024631
#> T11  0.09944359  0.081426714  0.11750389
#> T12 -0.99034131 -0.994587535 -0.98595606
#> T13  0.06304571  0.053958686  0.07216377
#> T14  0.09214673  0.069227756  0.11566539
#> T15  0.05714455  0.040571777  0.07454943

The continuous covariate Day is treated somewhat differently, particularly because we used a spline. Our intention for these estimates was to guide the user during the visualizations. Say you had 100 topics, which one should you investigate first? We’ll, that was the idea. For splines, because there are potentially so many weights and because they’re nonlinear, it’s difficult to simply obtain a single predicted value or slope that’s meaningful. Instead, we calculate the predicted topic probabilities given samples from the posterior predictive distribution, spanning over the continuous covariates range. We then calculate the standard deviation of these predictions, giving us an idea of the variation in the nonlinear fit. This informs us which topics to explore first. Here are the results:

#>        estimate         10%         90%
#> T1  0.025506325 0.018675388 0.032226392
#> T2  0.028631102 0.020657008 0.037159408
#> T3  0.014962251 0.009733842 0.020779986
#> T4  0.018213034 0.013150961 0.023597713
#> T5  0.029865644 0.021769596 0.038015309
#> T6  0.019217310 0.013428274 0.025105037
#> T7  0.024020959 0.017576645 0.030667326
#> T8  0.016959005 0.012914066 0.021206060
#> T9  0.009572787 0.006262823 0.013418182
#> T10 0.054347029 0.044308528 0.064558294
#> T11 0.035858424 0.027139635 0.044565975
#> T12 0.005194069 0.002135625 0.008121778
#> T13 0.017392357 0.012612329 0.022702763
#> T14 0.084434015 0.071762685 0.097088061
#> T15 0.054936807 0.042797990 0.066888528

We have quite a few visualizations at our disposal. Using the method vis and declaring the type of visualization we prefer, we can deploy an interactive shiny application. We’ll start by simply looking at the taxa:


If we start at the top, there’s a set of blue, gray, and red points with (quite small) uncertainty intervals. These are simply the estimates and intervals we printed out above.

To its left, you should see that the covariate these points correspond to is UBERON:feces. Just to remind you, we set saliva as our reference level, hence the covariate name, and these points represent the weight estimates with topic probability as the dependent variable. Red and blue points indicate respective positive and negative values whose uncertainty intervals don’t enclose 0; hence, fecal samples are somewhat associated with the majority of topics, whereas salival samples are strongly associated with topic 12. If we shift our focus to the middle figures, this topic-covariate relationship should be clearer:

Note that this row of figures borrows heavily from LDAvis: The left figure ordinates the topics over samples distribution, showing the top 2 axes in terms of variation explained. The distance (or method) can be chosen, but defaults to Jensen Shannon. Here, we can see that the topic that had a negative estimate (blue), and hence was associated with salival samples, is quite dissimilar to the others. The figure to the right shows the overall frequency of the taxa in our dataset. If we click a topic of interest in the ordination figure, say topic 5, then the bar plot will adjust and show the relative frequency of topic 5 taxa (red bars) relative to these taxa sample-wide distribution:

If we adjust the the lambda slider, we can manipulate this figure further. The slider controls the relevancy, which in turn dictates the set of taxa shown. Relevancy is defined as follows: \[r := \lambda \log p(taxa|topic) + (1-\lambda) \log p(taxa|topic)/p(word)\]

So, at the default value of 1, we see the words with maximum probability in the words over topics distribution that estimated directly by the topic model. The figure will therefore show us the words most probable in the selected topic. If we move to the slider to say, 0.2 (see below), the we are weighting this probably by the overall probability of a given word. We’ll now see words that are not only highly probably within a given topic, but also highly probably in that topic relative to other topics. For more information, see

The bottom row has an interactive notework figure that shows topic correlations, which were estimated directly by the topic model through the multivariate logistic normal distribution. This figure indicates which topics co-occur across samples, with the size of the edges reflecting the degree of the positive correlation. The node colors and sizes are set to be consistent with the ordination plot:

We should now have a decent grasp of which topics are important with respect to site, which topics co-occur, which taxa co-occur, and which taxa associate with which topics. That’s a lot of information, but hopefully this shiny app makes it manageable. At this point, though, you might be asking about that Day covariate. This is a time series dataset after all. Well, themetagenomics has a better visualization for continuous covariates, so we’ll use that:


This app should look familiar. Again, we’ll start from the top. As mentioned before, these estimates have a different interpretation. Had we assumed a linear association between topic frequency and time, looking at the weights themselves (i.e., the slopes) would suffice, but we used a spline. We therefore show the Spearman correlation estimates between the posterior predictive distribution and the raw samples over topics distribution. This at least gives us an idea of where the nonlinear estimates map the best. Let’s select topic 14:

There’s a clear trend: the probability of a sample containing topic 14 is highest during the first half year and then becomes effectively 0. Now, we can take this a step further. When we performed our effects estimation, for a given continuous covariate, all factors are compared to their respective reference level (hence the need of expanding factors to dummy variables). Thus, the red line is the trend over time with all covariate weights held fixed (that is, averaged across all samples) and the weight for site set to 1, whereas the blue line (the reference) is when site is set to 0. Although the uncertainty is larger early, between days 40 and 175, there seems to be a discernible difference between the frequency of topic 14 across these samples. What does that mean taxonomically though? I mean, this is microbiome data! Well, if we scroll down, we can see a bunch of scatter plots. These are the true abundances (i.e., the abundances from the raw data) from highest probability taxa in topic 14, plotted as a function of Day. Each point represents a sample.

We now have some understanding of which taxa are associated with the trend we’re seeing in topic 14, specifically B. plebeius, B. uniformis, and B. crossotus. They each have high probability in the first half of the series, which abruptly plummets. We again have the ability to adjust lambda to adjust the relevancy. Let’s set it to 0.1.


We’ll shift gears now. We’re going to explore the predicted functional content in our topics. Because this data was fit with Dada2 using Silva, we have to use Tax4Fun and not PICRUSt. We can predict KEGG annotations by simply using the predict method on our topics object from before. Recall how we didn’t copy number normalize? Now we will.

tmp <- tempdir()
FUNCTIONS <- predict(TOPICS,reference='silva_ko',reference_path=tmp,cn_normalize=TRUE,sample_normalize=FALSE,scalar=25)
#>    user  system elapsed 
#>   1.840   0.064   1.908

This functions object contains all of the predicted KO terms for Silva annotated abundances. There are a few settings you can adjust when running Tax4Fun, so definitely experiment, but the argument worth noting is scalar. Typically, we pass counts into PICRUSt or Tax4Fun; however, since we are using topic frequencies, we need to scale our topic over taxa distribution into pseudocounts. The scalar argument dictates what we are going to scale the maximum abundances to. It defaults to 100, but for the sake of time (since these models can take a while to fit), we set it to 25.

Now that we have these predicted functions, we need to identify meaningful genes as a function of topic. We’ll do that via a multilevel model. We could try our luck using maximum likelihood (ML), but given the sparsity of this type of data, rarely do these models converge. Instead, we’ll initialize our target parameters using ML and then fit our model via Hameltonian Monte Carlo (HMC) using Stan. Like the other commands we’ve seen thus far, we use a method called est on our functions class. There are a few arguments we can adjust. First, we can choose the level of the KEGG hierarchy to use as our factor-levels in our pathway factor (my apologies for the multiple use of “level”). We’ll choose level 3 since level 2 is often too broad to capture any effects. Second, we can choose the priors. These default to t-distributions. We’ll go with something a little more tighter and hence informative to hopefully speed of convergence: 3 normal priors. We could also choose a Laplace prior, which effectively is the Bayesian lasso. Considering the high dimensionality of these pathway groupings, it is certainly worth trying.

It should be noted that we are choosing the number of chains (so the number of runs to perform from independent starting conditions) and also the number of cores (to run these chains in parallel). For this vignette, it’ll be set to 2, since most machines should be able to parallelize across 2 cores. It is recommended to use more cores if they are available, however.

FUNCTION_EFFECTS <- est(FUNCTIONS,level=3,iters=300,

Overall, running the that function takes about 20 minutes: 15 for the actual multilevel model, about 5 for calculate the initial values via ML, and another 5 to extract results. Nevertheless, prioritizing speed left us with a model that failed to converge: 16 parameters had Rhat values greater than 1.1. We’ll refit the model with more iterations, but pass in our current fit to speed through the time usually devoted to compiling. We’ll initialize at our last sample from our current fit. We’ll use the method resume and double our original number of iterations:

FUNCTION_EFFECTS2 <- resume(FUNCTION_EFFECTS,init_type='last',iters=600,chains=2,cores=2)

Finally, we can visualize our output in the same way we did before. For function visualizations, however, we need to pass in both the function effects object and the topic effects we fit before.


Again, we are guided by the scatter plot at the top. The second row contains a heatmap showing the topics over taxa distribution, which we were indirectly introduced to in previous plots. We can hover over clusters to identify the key bacteria. This heatmap simply allows us to tie the significant functions to specific clusters of co-occurring taxa. Also note we can adjust the distance metric used for clustering. Now, if we move down to the third row, we’ll see the function-topic heatmap:

This heatmap shows essentially the function-topic interactions from the hierarchical model we just fit. Since we have weights that vary as a function of topic and other weights that vary as a function of pathway, these weights show shifts in gene abundance specific to certain function-topic combinations. Yellow and blue imply high and low abundances of genes within a given function-topic combination, respectively. The crosses indicate whether the interaction weight enclosed 0 given the user defined uncertainty interval (95% in this case). If we click this cell, a table will generate below:

These are the genes found within this particular function-topic combination, ordered in terms of decreasing abundance. With this information, we can identify gene sets of interest in topics we have previously identified as important with respect to our target covariates. Note, however, that these interactions aren’t very meaningful since we chose such a small scalar; the predicted gene abundances are quite small.