ADMUR: Ancient Demographic Modelling Using Radiocarbon

This vignette provides a comprehensive guide to modelling population dynamics using the R package ADMUR, and accompanies the publication ‘Directly modelling population dynamics in the South American Arid Diagonal using 14C dates’, Philosophical Transactions B, 2020, A. Timpson et al. https://doi.org/10.1098/rstb.2019.0723

Throughout this vignette, R code blocks often use objects created earlier in the vignette in previous code blocks. However, the manual for each function provides examples with self sufficient R code blocks.

The motivation for creating the ADMUR package is to provide a robust framework to infer population dynamics from radiocarbon datasets, given the uncontroversial assumption that (to a first order of approximation) the archaeological record contains more dateable anthropogenic material from prehistoric periods when population levels were greater. Unfortunately, the spatiotemporal sparsity of radiocarbon data conspires with the wiggly nature of the calibration curve to encourage the overinterpretation of such datasets, often leading to colourful but statistically unjustified interpretations of population dynamics. No statistical method can (or ever will) be able to perfectly reconstruct the true population dynamics from such a dataset. ADMUR is no exception to this, but provides tools to infer a plausible yet conservative reconstruction of population dynamics.

The ADMUR package can be installed directly from the CRAN in the usual way:

`install.packages('ADMUR')`

Alternatively it can be installed from GitHub, after installing and loading the ‘devtools’ package on the CRAN:

```
install.packages('devtools')
library(devtools)
install_github('UCL/ADMUR')
```

Either way, the ADMUR package can then be locally loaded:

`library(ADMUR)`

A summary of the available help files and data sets included in the package can be browsed, which include a terrestrial anthropogenic ^{14}C dataset from the South American Arid Diagonal:

```
help(ADMUR)
help(SAAD)
```

Datasets must be structured as a data frame that include columns ‘age’ and ‘sd’, which represent the uncalibrated ^{14}C age and its error, respectively.

`1:5,1:8] SAAD[`

```
## UniqID site lat long age sd LabNo Material_D
## 1 237 Villa del Mar -17.62295 -71.34017 6360 60 Beta-71133 bone
## 2 240 Yara -17.51998 -71.36716 5020 60 Beta-80970 bone
## 3 505 El Ahogado -17.96667 -70.88333 3515 40 Pa-1769 charcoal
## 4 506 El Ahogado -17.96667 -70.88333 3535 60 Pa-1768 charcoal
## 5 507 El Ahogado -17.96667 -70.88333 3660 40 Pa-1789 charcoal
```

Citations are available as follows:

`citation('ADMUR')`

```
##
## To cite the ADMUR package in publications please use both the
## following:
##
## A. Timpson, R. Barberena, M. G. Thomas, C. Méndez, K. Manning.
## Directly modelling population dynamics in the South American Arid
## Diagonal using 14C dates. 2020. Philosophical Transactions of the
## Royal Society B. https://doi.org/10.1098/rstb.2019.0723
##
## ADMUR: Ancient Demographic Modelling Using Radiocarbon. Adrian
## Timpson. University College London. Research Department of Genetics,
## Environment and Evolution (GEE), Darwin Building, Gower Street,
## London, WC1E 6BT. 2020 https://github.com/UCL/ADMUR
##
## To see these entries in BibTeX format, use 'print(<citation>,
## bibtex=TRUE)', 'toBibtex(.)', or set
## 'options(citation.bibtex.max=999)'.
```

The algorithm used by ADMUR to calculate model likelihoods of a ^{14}C dataset uses several functions to first calibrate ^{14}C dates. These functions are also intrinsically useful for ^{14}C date calibration or for generating a Summed Probability Distribution (SPD).

Generating a single calibrated date distribution or SPD requires either a two-step process to give the user full control of the date range and temporal resolution, or a simpler one step process using a wrapper function that automatically estimates a sensible date range and resolution from the dataset, performs the two step process internally, and plots the SPD.

- Use the function summedCalibratorWrapper()

```
data.frame( age=c(6562,7144), sd=c(44,51) )
data <- summedCalibratorWrapper(data) x <-
```

Notice the function assumes the data provided were all ^{14}C dates. However, if you have other kinds of date such as thermoluminescence you can specify this. Non-^{14}C types are assumed to be in calendar time, BP. You can also specify a particular calibration curve:

```
data.frame( age=c(6562,7144), sd=c(44,51), datingType=c('14C','TL') )
data <- summedCalibratorWrapper(data=data, calcurve=shcal20) x <-
```

Generating the SPD without the wrapper gives you more control, and requires a two-step process:

- Convert a calibration curve to a CalArray using the function makeCalArray()
- Calibrate the
^{14}C dates through the CalArray using the function summedCalibrator().

This is useful for improving computational times if generating many SPDs, for example in a simulation framework, since the CalArray needs generating only once.

```
data.frame( age = c(9144), sd=c(151) )
data <- makeCalArray( calcurve=intcal20, calrange=c(8000,13000) )
CalArray <- summedCalibrator(data, CalArray)
cal <-plotPD(cal)
```

The CalArray is essentially a two-dimensional probability array of the calibration curve, and can be viewed using the plotCalArray() function. Calibration curves vary in their temporal resolution, and the preferred resolution can be specified using the parameter **inc** which interpolates the calibration curve. It would become prohibitively time and memory costly if analysing the entire 50,000 year range of the calibration curve at a 1 year resolution (requiring a 50,000 by 50,000 array) and in practice the default 5 year resolution provides equivalent results to 1 year resolution for study periods wider than c.1000 years.

```
makeCalArray( calcurve=shcal20, calrange=c(5500,6000), inc=1 )
x <-plotCalArray(x)
```

It is worth noting that the algorithm used by this package to calibrate ^{14}C dates gives practically equivalent results to those from OxCal generated using oxcAAR and Bchron

However, there are two fringe circumstances where these software programs differ substantially: at the border of the calibration curve; and if a date has a large error.

Consider the real ^{14}C date [MAMS-13035] https://doi.org/10.1016/j.aeae.2015.11.003 age: 50524 +/- 833 BP calibrated through intcal13, which only extends to 46401BP. Bchron throws an error, whilst OxCal applies a one-to-one mapping between Conventional Radiocarbon (CRA) time and calendar time for any date (mean) beyond the range of the calibration curve. The latter is in theory a reasonable way to mitigate the problem, however OxCal applies this in a binary manner that can create peculiarities. Instead ADMUR gradually fades the calibration curve to a one-to-one mapping between the end of the curve and 60,000 BP.

A ^{14}C date is typically reported as a mean date with an error, which is often interpreted as representing a symmetric Gaussian distribution before calibration. However, a Gaussian has a non-zero probability at all possible years (between -\(\infty\) and +\(\infty\)), and therefore cannot fairly represent the date uncertainty which must be skewed towards the past. Specifically, if we consider the date in CRA time, it must have a zero probability of occurring in the future. Alternatively, if we consider the date as a ^{14}C/^{12}C ratio, it cannot be smaller than 1 (the present). Therefore ADMUR assumes a ^{14}C date error is lognormally distributed with a mean equal to the CRA date, and a variance equal to the CRA error squared. This naturally skews the distribution away from the present. In practice, this difference is undetectably trivial for typical radiocarbon errors since the lognormal distribution approximates a normal distribution away from zero. However, theoretically the differences can be large if considering dates with large errors that are close to the present.

A naive approach to generating an SPD as a proxy for population dynamics would be to sum all dates in the dataset, but a more sensible approach is to sum the SPDs of each phase. The need to bin dates into phases is an important step in modelling population dynamics to adjust for the data ascertainment bias of some archaeological finds having more dates by virtue of a larger research interest or budget.

Therefore phaseCalibrator() generates an SPD for each phase in a dataset, and includes a binning algorithm which provides a useful solution to handling large datasets that have not been phased. For example, consider the following 8 dates from 2 sites:

```
subset( SAAD, site %in% c('Carrizal','Pacopampa') )
data <-2:7] data[,
```

```
## site lat long age sd LabNo
## 1192 Carrizal -6.063056 -79.49806 3640 100 Beta-31075
## 1193 Carrizal -6.063056 -79.49806 4390 110 Beta-18920
## 1194 Carrizal -6.063056 -79.49806 4450 100 Beta-31073
## 1195 Carrizal -6.063056 -79.49806 4620 100 Beta-31074
## 1196 Carrizal -6.063056 -79.49806 4690 120 Beta-27417
## 1205 Pacopampa -6.200000 -79.01000 2385 155 SI-794
## 1206 Pacopampa -6.200000 -79.01000 2765 135 SI-792
## 1207 Pacopampa -6.200000 -79.01000 2855 95 SI-793
```

The data have not already been phased (do not include a column ‘phase’) therefore the default binning algorithm calibrates these dates into four phases. this is achieved by binning dates that have a mean ^{14}C date within 200 ^{14}C years of any other date in that respective bin. Therefore Pacopampa.1 comprises samples 1207 and 1206, Pacopampa.2 comprises sample 1205, Carrizal.1 comprises samples 1196 and 1195 and 1194 and 1193, and Carrizal.2 comprises sample 1192:

```
makeCalArray( calcurve=shcal20, calrange=c(2000,6000) )
CalArray <- phaseCalibrator(data=data, CalArray=CalArray)
x <-plotPD(x)
```

Finally, the distributions in each phase can be summed and normalised to unity. It is straight forward to achieve this directly from the dataframe created above:

```
as.data.frame( rowSums(x) )
SPD <-
# normalise
SPD/( sum(SPD) * CalArray$inc ) SPD <-
```

Alternatively, the wrapper function summedPhaseCalibrator() will perform this entire workflow internally:

```
summedPhaseCalibrator( data=data, calcurve=shcal20, calrange=c(2000,6000) )
SPD <-plotPD(SPD)
```

A CPL model lends itself well to the objectives of identifying specific demographic events. Its parameters are the (x,y) coordinates of the hinge points, which are the relative population size (y) and timing (x) of these events. Crucially, this package calculates model likelihoods (the probability of the data given some proposed parameter combination). This likelihood is used in a search algorithm to find the maximum likelihood parameters; to compare models with different numbers parameters to find the best fit without overfitting; in Monte-Carlo Markov Chain (MCMC) analysis to estimate credible intervals of those parameters; and in a goodness-of-fit test to check that the data is a typical realisation of the maximum likelihood model and its parameters.

Theoretically a calibrated date should be a continuous Probability Density Function (PDF), however in practice a date is represented as a discrete vector of probabilities corresponding to each calendar year, and therefore is a Probability Mass Function (PMF). This discretisation provides the advantage that numerical methods can be used to easily calculate relative likelihoods, provided the model is also discretised to the same time points.

A toy() model is provided to demonstrate how this achieved. First, we simulate a plausible ^{14}C dataset and calibrate it. The function simulateCalendarDates() automatically covers a slightly wider date range to ensure simulated ^{14}C dates are well represented around the edges:

```
set.seed(12345)
350
N <-
# randomly sample calendar dates from the toy model
simulateCalendarDates(toy, N)
cal <-
# Convert to 14C dates.
uncalibrateCalendarDates(cal, shcal20)
age <- data.frame(age = age, sd = 50, phase = 1:N, datingType = '14C')
data <-
# Calibrate each phase, taking care to restrict to the modelled date range with 'remove.external'
makeCalArray(shcal20, calrange = range(toy$year))
CalArray <- phaseCalibrator(data, CalArray, remove.external = TRUE) PD <-
```

The argument ‘remove.external = TRUE’ ensures any calibrated phases with less than 50% of their probability mass within the modelled date range are excluded, reducing the effective sample size from 350 to 303. This is a crucial step to avoid mischievous edge effects of dates outside the date range. Similarly, notice we constrained the CalArray to the modelled date range. These are important to ensure that we only model the population across a range that is well represented by data. To extend the model beyond the range of available data would be to assume the absence of evidence means evidence of absence. No doubt there may be occasions when this is reasonable (for example if modelling the first colonisation of an island that has been well excavated, and the period before arrival is evidenced by the absence of datable material), but more often the range of representative data is due to research interest, and therefore the logic of only including dates with at least 50% of their probability within the date range is that their true dates are more likely to be internal (within the date range) than external.

`print( ncol(PD) )`

`## [1] 303`

Finally we calculate the overall log likelihood of the model given the data:

`loglik(PD=PD, model=toy)`

`## [1] -2266.526`

For comparison, we can calculate the overall likelihood of a uniform model given exactly the same data. Intuitively this should have a lower likelihood, since our dataset was randomly generated from the non-uniform toy population history:

```
convertPars(pars=NULL, years=5500:7500, type='uniform')
uniform.model <-loglik(PD=PD, model=uniform.model)
```

`## [1] -2311.632`

And indeed the toy model is thirty nine million trillion times more likely than the uniform model:

`exp( loglik(PD=PD, model=toy) - loglik(PD=PD, model=uniform.model) )`

`## [1] 3.882277e+19`

Having established how to calculate the relative likelihood of any proposed model given a dataset, we can use any out-of-the-box search algorithm to find the maximum likelihood model. This first requires us to describe the PD of any population model in terms of a small number of parameters, rather than a vector of probabilities for each year.

We achieve this using the Continuous Piecewise Linear (CPL) model, which is defined by the (x,y) coordinates of its hinge points.

When performing a search for the best 3-CPL model coordinates (given a dataset), only five of these eight values are free parameters. The x-coordinates of the start and end (5500 BP and 7500 BP) are fixed by the choice of date range. Additionally, one of the y-coordinates must be constrained by the other parameters, since the total probability (area) must equal 1. As a result, an n-CPL model will have 2n-1 free parameters.

We use the function convertPars() to map our search parameters to their corresponding PD coordinates. This allows us to propose independent parameter values from a uniform distribution between 0 and 1, and convert them into coordinates that describe a corresponding CPL model PD. This parameter-to-coordinate mapping is achieved using a modified stick breaking Dirichlet process.

The Dirichlet Process (not to be confused with the Dirichlet distribution) is an algorithm that can break a stick (the x-axis date range) into a desired number of pieces, ensuring all lengths are sampled evenly. The length (proportion) of remaining stick to break is chosen by sampling from the Beta distribution, such that we use the Beta CDF (with \(\alpha\) = 1 and \(\beta\) = the number of pieces still to be broken) to convert an x-parameter into its equivalent x-coordinate value. We extend this algorithm for use with the CPL model by also converting y-parameters to y-coordinates as follows:

- Fix the y-value of the first hinge (H1, x = 5500 BP) to any constant (y = 3 is arbitrarily chosen since the mapping function below gives 3 for an average y-parameter of 0.5).
- Use the mapping function \(f(y) = (1/(1-y))^2)-1\) to convert all remaining y-parameters (between 0 and 1) to y-values (between 0 and +\(\infty\)).
- Calculate the total area, given the y-values and previously calculated x-coordinates.
- Divide y-values by the total area, to give the y-coordinates of the final PDF.

The parameters must be provided as a single vector with an odd length, each between 0 and 1 (y,x,y,x,…y). For example, a randomly generated 6-CPL model will have 11 parameters and 7 hinges:

```
set.seed(12345)
convertPars( pars=runif(11), years=5500:7500, type='CPL' )
```

```
## year pdf
## 1 5500.000 3.261064e-06
## 2 5950.539 4.771822e-07
## 3 6580.113 1.299434e-06
## 4 6929.120 3.426049e-06
## 5 7307.354 1.357384e-05
## 6 7395.293 1.031902e-02
## 7 7500.000 7.915814e-08
```

Note: The Area Breaking Algorithm is a heuristic that ensures all parameter space is explored and therefore the maximum likelihood parameters are always found. However, unlike the one-dimensional stick-breaking process, its mapping of random parameters to PD coordinates is not perfectly even, and we welcome ideas for a more elegant algorithm.

Any preferred search algorithm can be used. For example, the JDEoptim function from DEoptimR uses a differential evolution optimisation algorithm that performs very nicely for this application. We recommend increasing the default NP parameter to at least 20 times the number of parameters, and repeating the search to ensure consistency:

```
library(DEoptimR)
JDEoptim(lower = rep(0,5),
best <-upper = rep(1,5),
fn = objectiveFunction,
PDarray = PD,
type = 'CPL',
NP = 100,
Trace = TRUE)
```

```
convertPars(pars=best$par, years=5500:7500, type='CPL')
CPL <- summedPhaseCalibrator( data=data, calcurve=shcal20, calrange=c(5500,7500) )
SPD <-plotPD(SPD)
lines(CPL$year, CPL$pdf, lwd=2, col='firebrick')
legend(x=6300, y=max(CPL$pdf), cex=0.7, lwd=2, col='firebrick', bty='n', legend='best fitted 3-CPL')
text(x=CPL$year, y=CPL$pdf, pos=3, labels=c('H1','H2','H3','H4'))
```

The ADMUR function mcmc() uses the Metropolis-Hastings algorithm to search joint parameter values of an n-CPL model, given a the calibrated probability distributions of phases in a ^{14}C dataset (PDarray). In principle the starting parameters do not matter if burn is of an appropriate length, but in practice it is more efficient to start in a sensible place such as the maximum likelihood parameters:

` mcmc(PDarray=PD, startPars=best$par, type='CPL', N=100000, burn=2000, thin=5, jumps =0.025) chain <-`

The acceptance ratio (AR) and raw chain (before burn-in and thinning) can be sanity checked. Ideally we want the AR somewhere in the range 0.3 to 0.5 (this can be tuned with the ‘jumps’ argument), and the raw chain to resemble ‘hairy caterpillars’:

```
print(chain$acceptance.ratio)
par(mfrow=c(3,2), mar=c(4,3,3,1))
'steelblue'
col <-for(n in 1:5){
plot(chain$all.pars[,n], type='l', ylim=c(0,1), col=col, xlab='', ylab='', main=paste('par',n))
}
```