# Introduction to phybreak package

#### 2017-07-30

phybreak reconstructs who infected whom during an infectious disease outbreak, based on pathogen sequences taken from the infected hosts (patients, farms, …) and the time at which these sequences were taken. The model and MCMC sampling method are described in Klinkenberg et al (2017). Since that publication, some features have been added, which are described at the end of this vignette.

The general workflow is:

1. make a phybreakdata object with all data.
2. make a phybreak object with the data, giving all settings for the analysis.
3. run the MCMC-chain, first without sampling until the chain has converged, then with sampling until sufficient samples have been taken.
4. analyse the output and create a consensus transmission tree.

## Create a phybreakdata object

The method takes for each host a sampling day and a DNA sequence. The sequences have to be aligned, and can be provided in DNAbin (package ape}), phyDat (package phangorn), or matrix format (each host on a row, lower-case letters for nucleotides). The sampling times can either be numeric or Date. The following data are from an outbreak of foot-and-mouth disease (FMD) in the UK in 2007, and were published by Cottam et al. (2008):

library(phybreak)

names(sequences_FMD2007) <- c("IP1b/1", "IP1b/2", "IP2b", "IP2c",
"IP3b", "IP3c", "IP4b", "IP5", "IP6b",
"IP7", "IP8")
dates_FMD2007 <- as.Date(c("2007/08/03", "2007/08/04", "2007/08/06",
"2007/08/07", "2007/09/12", "2007/09/15",
"2007/09/13", "2007/09/17", "2007/09/21",
"2007/09/24", "2007/09/29"))

# these data are also included with the package
data("FMD_2007")
sequences_FMD2007 <- FMD_2007$sequences dates_FMD2007 <- FMD_2007$dates

These sequences are in DNAbin format.

A phybreakdata object is created by calling phybreakdata, with the sequences and sample times as arguments.

dataset_FMD2007 <- phybreakdata(sequences = sequences_FMD2007,
sample.times = dates_FMD2007)
#> Warning in phybreakdata(sequences = sequences_FMD2007, sample.times =
#> dates_FMD2007): all nucleotide ambiguity codes are turned into n

There may be a warning saying that the sequence data contains letters other than a, c, g, or t, which are most likely ambiguity codes. Currently, phybreak treats all these codes as missing (as code n, i.e. any nucleotide).

head(dataset_FMD2007)
#> $sequences #> 11 sequences with 8176 character and 30 different site patterns. #> The states are a c g t #> #>$sample.times
#>       IP1b/1       IP1b/2         IP2b         IP2c         IP3b
#> "2007-08-03" "2007-08-04" "2007-08-06" "2007-08-07" "2007-09-12"
#>         IP4b         IP3c          IP5         IP6b          IP7
#> "2007-09-13" "2007-09-15" "2007-09-17" "2007-09-21" "2007-09-24"
#>          IP8
#> "2007-09-29"
#>
#> $sample.hosts #> IP1b/1 IP1b/2 IP2b IP2c IP3b IP4b IP3c IP5 #> "IP1b/1" "IP1b/2" "IP2b" "IP2c" "IP3b" "IP4b" "IP3c" "IP5" #> IP6b IP7 IP8 #> "IP6b" "IP7" "IP8" The phybreakdata object contains three elements: sequences (in class phyDat), sample times and sample hosts. The last element is a vector giving the hosts from which the samples were taken. Because no separate host names were entered (through argument host.names), the sample names were copied into this vector. It is also possible to separately enter sample.names, which will override the names with the sequences or sample times data. It is also possible to provide the sequence data in matrix format, such as these data from an FMD outbreak in 2001 Cottam et al. (2008): # load the data data("FMD_2001") # extract names and dates by splitting the rownames namesdates_FMD2001 <- strsplit(rownames(FMD_2001), "_") #split the rownames names_FMD2001 <- sapply(namesdates_FMD2001, '[', i = 1) #1st element is name dates_FMD2001 <- as.numeric(sapply(namesdates_FMD2001, '[', i = 2)) #2nd element is date # make phybreakdata object dataset_FMD2001 <- phybreakdata(sequences = FMD_2001, sample.times = dates_FMD2001, sample.names = names_FMD2001) #> Warning in phybreakdata(sequences = FMD_2001, sample.times = #> dates_FMD2001, : names in sequences don't match sample.names and are #> therefore overwritten In addition to the warning about the nucleotide codes, there is now another warning telling you that the provided sample names do not match the names in the sequence and time data. head(dataset_FMD2001) #>$sequences
#> 15 sequences with 8196 character and 58 different site patterns.
#> The states are a c g t
#>
#> $sample.times #> A K B L N O C F P D G M I J E #> 40 41 42 50 52 54 60 63 70 81 84 87 98 104 63 #> #>$sample.hosts
#>   A   K   B   L   N   O   C   F   P   D   G   M   I   J   E
#> "A" "K" "B" "L" "N" "O" "C" "F" "P" "D" "G" "M" "I" "J" "E"

Note that the names in the sample.times element are indeed the names from names_FMD2001.

## Create a phybreak object

With a phybreakdata object we can make a phybreak object, which requires specification of the model that will be used for analysis. The complete model has four submodels for which choices have to be made in creating the phybreak object.

1. The transmission model. This model consists of a generation interval distribution. The generation interval is the time between infection of a primary case and a secondary case by this primary case. It is assumed that this interval follows a gamma distribution with mean gen.mean and shape parameter gen.shape, for both of which values should be provided. Whereas the shape will be fixed during the analysis, the mean can be estimated (est.gen.mean = TRUE), for which a prior can be provided with mean prior.mean.gen.mean and standard deviation prior.mean.gen.sd.
2. The sampling model. This model consists of a sampling interval distribution. The sampling interval is the time between infection and sampling of a case. This interval also follows a gamma distribution, with mean sample.mean and shape parameter sample.shape. As with the generation interval, the mean can be estimated (est.sample.mean = TRUE), with prior mean prior.mean.sample.mean and standard deviation prior.mean.sample.sd.
3. The within-host model. This model describes the size of the within-host pathogen population as a function of the time since infection of a host. This model is used for the coalescent process within hosts, i.e. how the phylogenetic minitrees in all hosts have arisen. The first choice is which model to use, through the argument wh.model, which can be 1, 2, or 3. In model 1, it is assumed that there is only a single lineage present within a host at any time after infection. As a consequence, coalescence takes place at the time of transmission to another host, or at sampling. In model 2, it is assumed that the within-host pathogen population is infinite, but that there is a strict bottleneck at transmission. As a consequence, all coalescent nodes are placed ‘just after’ infection of a host. In model 3, it is assumed that the within-host pathogen population grows linearly with slope wh.slope. An initial value should be provided, and it is possible to estimate this parameter (est.wh.slope = TRUE) with a gamma prior distribution with mean prior.wh.mean and shape prior.wh.shape.
4. The mutation model. A Jukes-Cantor model is assumed for mutation, with mutation rate mu. An initial value can be given, and mu is always estimated with a uniform distribution for log(mu).

More detail about these models is given in Klinkenberg et al. (2017). The default settings are to estimate all estimable parameters and set the shape parameters of generation and sampling intervals at 3, creating pretty wide and therefore not very informative distributions. If you do have some information on these distributions (especially the sampling interval distribution) from other sources, that can significantly improve performance.

To carry out the analysis of the FMD-2007 outbreak, as in Klinkenberg et al. (2017), we will create a phybreak object as follows:

phybreak_FMD2007 <- phybreak(dataset = dataset_FMD2007)

plotTrans(phybreak_FMD2007)

plotPhylo(phybreak_FMD2007)

Because phybreak creates random initial phylogenetic and transmission trees, plots will look slightly different each time you re-run this code. The first plot shows the initial transmission tree, the second plot the phylogenetic tree, in which a change of color indicates a transmission event. The default initial sampling and generation intervals are 1, which is too short for foot-and-mouth disease. As a result, the transmission tree is initialized as a chain of subsequent infections. By changing the initial condition to a value that is more realistic, you may decrease the required time for the MCMC chain to converge (shorter burn-in). For instance, by starting with a mean of 14 days for both intervals:

phybreak_FMD2007_2 <- phybreak(dataset = dataset_FMD2007, gen.mean = 14, sample.mean = 14)

plotTrans(phybreak_FMD2007_2)

plotPhylo(phybreak_FMD2007_2)

The created phybreak object is a list with 5 slots: the data (d), the variables describing the tree topology (v), the parameters (p), additional helper information for running the MCMC-chain, such as prior distributions (h), and the samples from the posterior distribution (s).

## Run the MCMC-chain

With the created phybreak object we can now sample from the posterior by running an MCMC-chain. There are two functions available for that, burnin.phybreak and sample.phybreak. The difference is that the former runs the chain but only returns the phybreak object with a changed current state (the slot with posterior samples (s) is returned unchanged), whereas the latter also samples from the chain and returns these samples. Because just after initialization the MCMC chain still has to converge, you can start without sampling from the chain:

phybreak_FMD2007 <- burnin.phybreak(phybreak_FMD2007, ncycles = 5000)
#> keepphylo = 0.2
#>    cycle      logLik         mu  gen.mean  sam.mean parsimony (nSNPs = 27)
#>     1589   -11689.84   2.24e-05      7.63      11.3        27
#>     3320   -11690.34   1.64e-05      8.49      11.8        27

Approximately every 10 seconds, information is given about the current state of the chain, such as how many update cycles have passed, the log-likelihood, the mutation rate, the mean generation and sampling intervals, and the parsimony score of the phylogenetic tree. To start with the latter, the parsimony score is the (minimum) number of mutations that can explain (the current state of) the phylogenetic tree. In the initial phase of the chain the parsimony score will generally decrease, but it can never become lower than the number of SNPs in the dataset (shown between brackets). The printed output can be used to get a rough idea of convergence: as long as the parameters increase or decrease systematically, or the log-likelihood increases or parsimony score decreases, the chain has not converged. (Conversely, if they do seem to have stabilized, convergence is not certain and should always be checked after sampling!). As you can see from this output, because the FMD dataset is small, it seems to converge quickly even if initialized with unrealistic mean generation and sampling intervals of 1 day. You can check the current state by plotting the trees or through get.parameters:

get.parameters(phybreak_FMD2007)
#>           mu  mean.sample     mean.gen     wh.slope
#> 2.068144e-05 1.058950e+01 7.128172e+00 1.077374e+00

When you do want to sample from the chain, you need to specify the number of samples and optionally a thinning interval. The default thinning interval is 1, which means that after every cycle (which updates all parameters once, and the trees once for every host in the dataset), the current state is stored. It is also possible to thin after running the chain (or not at all), but you should never use different thinning intervals with a single phybreak object.

phybreak_FMD2007 <- sample.phybreak(phybreak_FMD2007, nsample = 25000)
#> keepphylo = 0.2
#>   sample      logLik         mu  gen.mean  sam.mean parsimony (nSNPs = 27)
#>     1643    -11690.3   2.85e-05      6.61      9.82        27
#>     3343   -11692.55   2.52e-05      7.52      11.8        27
#>     5025   -11686.34   2.91e-05      5.95      8.11        27
#>     6722   -11689.54    2.1e-05      5.84      8.61        27
#>     8415   -11690.13   2.21e-05      10.2      9.34        27
#>    10095   -11688.87   2.22e-05      6.23      11.9        27
#>    11791   -11683.93    2.7e-05      5.87      8.57        27
#>    13473   -11689.63   2.74e-05      9.35         9        27
#>    15132   -11701.27   2.98e-05      12.2      11.4        27
#>    16820   -11678.43   4.11e-05      6.64      9.99        27
#>    18515   -11676.89   3.18e-05       7.3      6.75        27
#>    20200   -11686.86   4.56e-05      8.25      8.02        27
#>    21894   -11684.63   1.99e-05       5.7      9.57        27
#>    23570   -11694.47   3.28e-05      12.9      9.82        27

## Analyse the output

After running sample.outbreak, the phybreak object contains samples from the MCMC-chain. We can use the functionality of the coda package to analyse the chain, e.g. making trace plots, calculating effective sample sizes (ESS), and calculating summary statistics of the parameters and (continuous) variables:

library(coda)
mcmc_FMD2007 <- get.mcmc(phybreak_FMD2007)

effectiveSize(mcmc_FMD2007)
#>              mu              mS              mG           slope
#>        471.3680        158.3283       1437.7575        630.9731
#>     tinf.IP1b/1     tinf.IP1b/2       tinf.IP2b       tinf.IP2c
#>       2153.9484       1504.6650        913.8980       1219.6180
#>       tinf.IP3b       tinf.IP4b       tinf.IP3c        tinf.IP5
#>        444.5503        528.2760        441.9709        119.6333
#>       tinf.IP6b        tinf.IP7        tinf.IP8 infector.IP1b/1
#>        427.4344        696.9914        645.4573       4583.9410
#> infector.IP1b/2   infector.IP2b   infector.IP2c   infector.IP3b
#>       5388.8598       3477.8341       5921.0758       1063.7995
#>   infector.IP4b   infector.IP3c    infector.IP5   infector.IP6b
#>        591.9112       1294.6390        634.5000       2488.8959
#>    infector.IP7    infector.IP8          logLik
#>       2650.3230       3230.8465        572.8704

The ESSs are generally large enough (i.e. >200), with the exception of the mean sampling interval and the infection time of farm IP5, which have ESSs between 100 and 200. Note that if you run the code yourself, these numbers will be different. Another test is to plot the chain, e.g. for the mean sampling interval:

plot(mcmc_FMD2007[, "mS"])

To increase the ESS you could take more samples by running sample.phybreak again; the MCMC chain will continue where it stopped. For now, we’ll stick with the current chain. The mcmc object can also be used to get summary statistics of the parameters and variables:

summary(mcmc_FMD2007[, c("mu", "mS", "mG", "slope")])
#>
#> Iterations = 1:25000
#> Thinning interval = 1
#> Number of chains = 1
#> Sample size per chain = 25000
#>
#> 1. Empirical mean and standard deviation for each variable,
#>    plus standard error of the mean:
#>
#>            Mean        SD  Naive SE Time-series SE
#> mu    2.833e-05 6.997e-06 4.425e-08      3.223e-07
#> mS    1.017e+01 3.048e+00 1.928e-02      2.422e-01
#> mG    7.718e+00 1.986e+00 1.256e-02      5.239e-02
#> slope 1.070e+00 5.728e-01 3.623e-03      2.280e-02
#>
#> 2. Quantiles for each variable:
#>
#>            2.5%       25%       50%       75%     97.5%
#> mu    1.666e-05 2.318e-05 2.769e-05 3.265e-05 4.398e-05
#> mS    5.382e+00 7.988e+00 9.801e+00 1.190e+01 1.726e+01
#> mG    4.794e+00 6.338e+00 7.401e+00 8.727e+00 1.254e+01
#> slope 2.766e-01 6.478e-01 9.690e-01 1.390e+00 2.461e+00

The summary statistics of the model parameters are useful (so I selected only these), but summaries of the variables are better obtained through specific phybreak functions. The reason is that infection times in the mcmc object are only relative to the first sampling day, and that infectors are categorical variables and should be considered within the context of the complete tree. From the summary you see that the median sampling interval is 9.8 days, and the median generation interval 7.4 days. The prime summaries from the analysis are the consensus transmission trees: who was infected by whom, obtained through transtree. The naive way would be to choose for each host (each farm in this case) the infector that is most sampled:

transtree(phybreak_FMD2007, method = "count")
#>        infector support inf.times.Q2.5 inf.times.Q50 inf.times.Q97.5
#> IP1b/1   IP1b/2 0.37196     2007-07-17    2007-07-27      2007-08-01
#> IP1b/2   IP1b/1 0.38988     2007-07-16    2007-07-27      2007-08-02
#> IP2b     IP1b/1 0.48248     2007-07-23    2007-08-01      2007-08-04
#> IP2c     IP1b/1 0.43044     2007-07-16    2007-07-29      2007-08-05
#> IP3b       IP4b 0.49016     2007-08-28    2007-09-06      2007-09-10
#> IP4b        IP5 0.47568     2007-08-21    2007-09-02      2007-09-10
#> IP3c       IP4b 0.54600     2007-08-20    2007-09-03      2007-09-12
#> IP5        IP2b 0.91000     2007-08-07    2007-08-19      2007-09-08
#> IP6b       IP3b 0.83664     2007-09-01    2007-09-12      2007-09-18
#> IP7        IP3b 0.58664     2007-09-07    2007-09-15      2007-09-21
#> IP8         IP7 0.67408     2007-09-08    2007-09-19      2007-09-26

The column support gives the proportion of posterior trees which had that identified infector. The three right-most columns give the median and 95% credible interval of the infection times (days). The problem with this method, is that you may get circular relations. E.g. in this case, farm IP1b/1 and farm IP1b/2 would have infected one another, which is of course impossible (note that in fact, these were two samples from the same farm). The Edmonds’ algorithm removes the cyclical relations as described in Klinkenberg et al. (2017). That results in the following transmission tree reconstruction:

transtree(phybreak_FMD2007, method = "edmonds")
#>        infector support inf.times.Q2.5 inf.times.Q50 inf.times.Q97.5
#> IP1b/1    index 0.36872     2007-07-17    2007-07-27      2007-08-01
#> IP1b/2   IP1b/1 0.38988     2007-07-16    2007-07-27      2007-08-02
#> IP2b     IP1b/1 0.48248     2007-07-23    2007-08-01      2007-08-04
#> IP2c     IP1b/1 0.43044     2007-07-16    2007-07-29      2007-08-05
#> IP3b       IP4b 0.49016     2007-08-28    2007-09-06      2007-09-10
#> IP4b        IP5 0.47568     2007-08-21    2007-09-02      2007-09-10
#> IP3c       IP4b 0.54600     2007-08-20    2007-09-03      2007-09-12
#> IP5        IP2b 0.91000     2007-08-07    2007-08-19      2007-09-08
#> IP6b       IP3b 0.83664     2007-09-01    2007-09-12      2007-09-18
#> IP7        IP3b 0.58664     2007-09-07    2007-09-15      2007-09-21
#> IP8         IP7 0.67408     2007-09-08    2007-09-19      2007-09-26

Now IP1b/1 is identified as the index case (first case of the outbreak, infected from outside), and the support for this (0.369) is only slightly lower than the support for infection by IP1b/2 (0.372). This tree can also be plotted:

plotTrans(phybreak_FMD2007, plot.which = "edmonds")

The plot shows who was infected by whom according to the Edmonds’ consensus tree, with coloured arrows indicating the support. See help(plotTrans) for information about the colours and about all options to adjust the plot to your taste. The grey shapes indicate the median posterior generation interval distribution. The Edmonds’ tree creates a correct transmission tree which maximal total support, but is not related to a matching phylogenetic tree. The maximum parent credibility (mpc) tree is that tree among the sampled trees with maximum total support. Because it is one of the (in our case 25000) sampled trees, it does come with a matching phylogenetic tree. In our example, the mpc tree is the same as the Edmonds’ tree:

transtree(phybreak_FMD2007, method = "mpc")
#>        infector support inf.times.Q2.5 inf.times.Q50 inf.times.Q97.5
#> IP1b/1    index 0.36872     2007-07-17    2007-07-27      2007-08-01
#> IP1b/2   IP1b/1 0.38988     2007-07-16    2007-07-27      2007-08-02
#> IP2b     IP1b/1 0.48248     2007-07-23    2007-08-01      2007-08-04
#> IP2c     IP1b/1 0.43044     2007-07-16    2007-07-29      2007-08-05
#> IP3b       IP4b 0.49016     2007-08-28    2007-09-06      2007-09-10
#> IP4b        IP5 0.47568     2007-08-21    2007-09-02      2007-09-10
#> IP3c       IP4b 0.54600     2007-08-20    2007-09-03      2007-09-12
#> IP5        IP2b 0.91000     2007-08-07    2007-08-19      2007-09-08
#> IP6b       IP3b 0.83664     2007-09-01    2007-09-12      2007-09-18
#> IP7        IP3b 0.58664     2007-09-07    2007-09-15      2007-09-21
#> IP8         IP7 0.67408     2007-09-08    2007-09-19      2007-09-26

With the mpc tree, it is not only possible to plot the transmission tree (with plotTrans or plot), but also the phylogenetic tree:

plotPhylo(phybreak_FMD2007, plot.which = "mpc")

Other available consensus trees are the maximum clade credibility tree (mcc, phylogenetic tree only). That method scores trees by looking at the clades (sets of tips descending from an internal node), and scores the clades by their frequencies in all sampled posterior trees; the tree score is the product of all clade scores. Finally, as an equivalent to the mcc tree, there is the maximum transmission cluster credibility score (mtcc, transmission and phylogenetic trees). That method scores trees by looking at transmission clusters, defined as hosts with all ‘progeny’ in the transmission tree. As far as I’m aware, the ‘mtcc’ method had not been described in literature, and from what I’ve seen it tends to create many hosts without secondary infections, but it may give some insight into support for clusters of cases where it is uncertain which case was the first.

transtree(phybreak_FMD2007, method = "mtcc")
#>        infector support inf.times.Q2.5 inf.times.Q50 inf.times.Q97.5
#> IP1b/1    index 1.00000     2007-07-17    2007-07-27      2007-08-01
#> IP1b/2   IP1b/1 0.38376     2007-07-16    2007-07-27      2007-08-02
#> IP2b     IP1b/1 0.88452     2007-07-23    2007-08-01      2007-08-04
#> IP2c     IP1b/1 0.61172     2007-07-16    2007-07-29      2007-08-05
#> IP3b       IP4b 0.83320     2007-08-28    2007-09-06      2007-09-10
#> IP4b        IP5 0.90892     2007-08-21    2007-09-02      2007-09-10
#> IP3c       IP4b 0.52140     2007-08-20    2007-09-03      2007-09-12
#> IP5        IP2b 0.95524     2007-08-07    2007-08-19      2007-09-08
#> IP6b       IP3b 0.72704     2007-09-01    2007-09-12      2007-09-18
#> IP7        IP3b 0.83488     2007-09-07    2007-09-15      2007-09-21
#> IP8         IP7 0.76584     2007-09-08    2007-09-19      2007-09-26

Note that support for the index case is 1 by definition, because the index case plus all progeny is the complete outbreak. Finally, infectorsets gives you more sampled infectors for each host (or a subset), ordered by support. To leave out barely supported infectors you can set a threshold support for inclusion per infector and/or a cumulative support percentile for all infectors:

infectorsets(phybreak_FMD2007, which.hosts = "all", percentile = 0.95, minsupport = 0)
#> $IP1b/1 #> infector support #> 1 IP1b/2 0.37196 #> 2 index 0.36872 #> 3 IP2c 0.21924 #> #>$IP1b/2
#>   infector support
#> 1   IP1b/1 0.38988
#> 2    index 0.37484
#> 3     IP2c 0.19816
#>
#> $IP2b #> infector support #> 1 IP1b/1 0.48248 #> 2 IP1b/2 0.31192 #> 3 IP2c 0.17860 #> #>$IP2c
#>   infector support
#> 1   IP1b/1 0.43044
#> 2   IP1b/2 0.30132
#> 3    index 0.22944
#>
#> $IP3b #> infector support #> 1 IP4b 0.49016 #> 2 IP3c 0.31160 #> 3 IP5 0.09764 #> 4 IP6b 0.08112 #> #>$IP4b
#>   infector support
#> 1      IP5 0.47568
#> 2     IP3c 0.37072
#> 3     IP3b 0.08272
#> 4     IP2b 0.06380
#>
#> $IP3c #> infector support #> 1 IP4b 0.54600 #> 2 IP5 0.34556 #> 3 IP3b 0.06720 #> #>$IP5
#>   infector support
#> 1     IP2b 0.91000
#> 2     IP4b 0.03448
#> 3     IP3c 0.02468
#>
#> $IP6b #> infector support #> 1 IP3b 0.83664 #> 2 IP4b 0.05384 #> 3 IP7 0.04752 #> 4 IP3c 0.03560 #> #>$IP7
#>   infector support
#> 1     IP3b 0.58664
#> 2      IP8 0.23416
#> 3     IP6b 0.16172
#>
#> \$IP8
#>   infector support
#> 1      IP7 0.67408
#> 2     IP3b 0.24696
#> 3     IP6b 0.07040

## Further information

The above text gives the main workflow for outbreak analysis of one dataset with phybreak. Not all functions have been discussed. See help(get.phybreak) for functions to extract elements from a phybreak object; help(logLik.phybreak) to obtain the log-likelihood for the different submodels; help(plotTrans) and help(plotPhylo) for more grip on the plots; help(phylotree) to obtain the consensus phylogenetic trees; and help(thin.phybreak) for thinning the MCMC chain. Finally, the phybreak package also allows simulation of outbreaks with complete datasets as output. You can use this to investigate identifiability of parameters and transmission trees for particular scenarios. See help(sim.phybreak) for more information.

Since publication of Klinkenberg et al. (2017) I have worked on extending the package. The main feature currently available is the possibility to deal with multiple samples per host. If you have such data, you need to define in phybreakdata for each sample from which host it was taken. In the foot-and-mouth data, two samples were actually taken from the same farm. Then, the complete analysis would go as follows:

hosts_FMD2007 <- c("IP1b", "IP1b", "IP2b", "IP2c",
"IP3b", "IP3c", "IP4b", "IP5", "IP6b",
"IP7", "IP8")

dataset_FMD2007 <- phybreakdata(sequences = sequences_FMD2007,
sample.times = dates_FMD2007,
host.names = hosts_FMD2007)
phybreak_FMD2007 <- phybreak(dataset = dataset_FMD2007)
phybreak_FMD2007 <- burnin.phybreak(phybreak_FMD2007, ncycles = 5000)
phybreak_FMD2007 <- sample.phybreak(phybreak_FMD2007, nsample = 25000)
transtree(phybreak_FMD2007, method = "edmonds")
#>      infector support inf.times.Q2.5 inf.times.Q50 inf.times.Q97.5
#> IP1b    index 0.61152     2007-07-17    2007-07-26      2007-08-01
#> IP2b     IP1b 0.70492     2007-07-22    2007-07-31      2007-08-04
#> IP2c     IP1b 0.60252     2007-07-16    2007-07-29      2007-08-04
#> IP3b     IP4b 0.49240     2007-08-27    2007-09-06      2007-09-10
#> IP4b      IP5 0.48708     2007-08-18    2007-09-02      2007-09-10
#> IP3c     IP4b 0.56248     2007-08-17    2007-09-04      2007-09-12
#> IP5      IP2b 0.95092     2007-08-02    2007-08-19      2007-09-05
#> IP6b     IP3b 0.82836     2007-08-31    2007-09-12      2007-09-18
#> IP7      IP3b 0.56900     2007-09-07    2007-09-15      2007-09-21
#> IP8       IP7 0.66356     2007-09-07    2007-09-18      2007-09-26
plotTrans(phybreak_FMD2007, plot.which = "edmonds")