# polymapR: linkage mapping in outcrossing polyploids

#### 2019-02-01

In this tutorial, we will go step by step through the basic steps of performing linkage analysis in polyploids using polymapR. Most of the theory behind the functions described here can be found in a recent Bioinformatics publication of Bourke et al. (2018) at https://doi.org/10.1093/bioinformatics/bty371. The example dataset is a simulated dataset of an hypothetical tetraploid outcrossing plant species with five chromosomes. Chromosomes have random pairing behaviour. The population consists of 207 individuals originating from an F1 cross between two non-inbred parents. Frequently occurring data imperfections like missing values, genotyping errors and duplicated individuals are taken into account.

## 1. A (very) short introduction to R and the polymapR package.

In R, a function can be called by typing function_name() in the console and pressing [Enter]. (Optional) arguments go within the parentheses, followed by a = sign and the value you want to give to the argument. Arguments are separated by a comma. All possible arguments and default values of a function are in the help file, which is called by typing ? followed by the function name. Getting the help file for the commonly used function seq looks like this:

?seq

And generating a sequence of numbers from 2 to 14 by an increment of 3 using seq looks like this:

seq(from = 2,
to = 14,
by = 3)
## [1]  2  5  8 11 14

Although copy pasting the commands from this document to your R console will do for this tutorial, we recommend to have a bit more experience in using R than above before using polymapR. This will allow you better to troubleshoot and check out data output. A good place to start would be http://tryr.codeschool.com or any of the plethora of R tutorials on the web.

The polymapR package is literally a collection of functions, each defined by their function name and arguments. It relies on dosage data of molecular markers, and therefore is logically used after the packages fitTetra or fitPoly. As for any project, it is best to first create a new folder which will be the main working directory for the mapping project. Set the working directory to your new folder (setwd() or use the drop-down menu) and copy the marker dosage file from fitTetra there.

## 2. Install polymapR

polymapR is available on CRAN at https://cran.R-project.org/package=polymapR which is probably where you accessed this vignette. As with other packages on CRAN, it can be installed directly from within R using the command:

install.packages("polymapR")

## 3. Logging function calls and output

Most exported function in polymapR have a log argument. The default setting is NULL. In that case, all messages are returned to the standard output, which means your R console if you are calling the functions interactively. However, if it is set to a file name (e.g. log.Rmd or log.txt) all function calls and output are written to that file. It automatically creates one if it does not exist yet. The log file is written in R markdown (http://rmarkdown.rstudio.com/). The advantage of markdown is that it is readable as plain text, but it can also easily be rendered into a nicely formatted html or pdf. In order to do so, give your logfile a .Rmd extension and use the knitr package to turn it into a pdf, docx or html.

## 4. Data importing

### 4.1 Reading in marker dosage data

We assume that the user has used fitTetra or some alternative method to convert signal intensity ratios into marker dosages. polymapR takes as its starting point such data, but we will impose some further quality checks before we can begin mapping (such as allowed numbers of missing values). We also need to check the markers for skewness (differences in the observed and expected marker segregation) which can cause problems later in the clustering. For this, we have incorporated a number of functions developed for the fitTetra package [1], and included in the currently-unpublished R package fitPolyTools developed by Roeland Voorrips.

###### But first, a note on polyploid terminology

In the polymapR package, and throughout this vignette, we often refer to terms like “simplex”, “duplex” or “nulliplex” etc. These terms describe the dosage scores of SNP markers, essentially the count of the number of “alternative” alleles present at a particular locus in an individual. A simplex x nulliplex marker is therefore a marker that for which the maternal dosage is 1 (mothers always come first) and paternal dosage is 0. To save space, we also refer to these markers as SxN markers, which is synonymous with 1x0 markers. Sometimes statements are made concerning simplex x nulliplex markers - it should be understood that these statements also usually apply to nulliplex x simplex markers - markers for which the segregating allele is carried by the father rather than the mother.

The first step is to import the marker data. The layout of the dataset should be marker name, parental scores followed by offspring scores (so-called wide data format):

Marker P1 P2 F1.001 F1.002..
mymarker1 1 0 1 1
mymarker2 2 4 3 2
mymarker3 0 3 2 2

In cases where the parents have been genotyped more than once, a consensus parental score is required. Missing parental scores are not allowed at this stage, and should have been imputed already if desired.

The data should be in some format that can be read easily into R, such as .csv, .dat or .txt file. In the example a .csv file is read in. For other file types check out ?read.delim.

ALL_dosages <- read.csv("tetraploid_dosages.csv",
stringsAsFactors = FALSE,
row.names = 1) #first column contains rownames
class(ALL_dosages)
## [1] "data.frame"

By default, read.csv returns data in a data frame. All polymapR functions that use dosage data only accept dosage data when the dosages are in a matrix. A matrix is a two-dimensional R object with all elements of the same type (e.g. character, integer, numeric). Markernames and individual names should be specified in rownames and columnnames respectively. To use the data with functions, the data should be converted from a data.frame in to a matrix by the as.matrix function:

ALL_dosages <- as.matrix(ALL_dosages)

class(ALL_dosages)
## [1] "matrix"
head(ALL_dosages[,1:5])
##             P1 P2 F1_001 F1_002 F1_003
## Zm_rs005505  2  2      2      3      2
## Zm_ts097822  1  2      1      1      2
## Ac_ws072452  1  0      1     NA      1
## Ac_ts073123  0  2      1     NA      2
## Ap_ws071152  3  1      3      2     NA
## St_rs076767  1  0      1      1      1
dim(ALL_dosages)
## [1] 3000  209

If you are testing the package and would like a sample dataset to work with, a set of tetraploid marker dosage data is provided within the package itself. First load the package and then the dosage data as follows:

library(polymapR)
data(ALL_dosages)

### 4.2 Checking for skewness

The next step before we proceed further is to check whether the marker scores in the F1 correspond to those expected according to the parental dosages. We do this using the checkF1 function (from fitPolyTools). However, we first load the polymapR package into the global environment:

library(polymapR)

Now run checkF1:

F1checked <- checkF1(dosage_matrix = ALL_dosages,parent1 = "P1",parent2 = "P2",
F1 = colnames(ALL_dosages)[3:ncol(ALL_dosages)],
polysomic = TRUE, disomic = FALSE, mixed = FALSE, ploidy = 4)

head(F1checked)
##   m  MarkerName parent1 parent2 F1_0 F1_1 F1_2 F1_3 F1_4 F1_NA P1 P2
## 1 1 Zm_rs005505       2       2    6   39   90   45    4    23  2  2
## 2 2 Zm_ts097822       1       2   14   79   83   10    0    21  1  2
## 3 3 Ac_ws072452       1       0  102   83    0    0    0    22  1  0
## 4 4 Ac_ts073123       0       2   27  121   37    0    0    22  0  2
## 5 5 Ap_ws071152       3       1    1   48   77   55    0    26  3  1
## 6 6 St_rs076767       1       0   92   88    0    0    0    27  1  0
##   bestfit frqInvalid_bestfit Pvalue_bestfit matchParent_bestfit
## 1 18I81_0             0.0000         0.9187                 Yes
## 2  1551_0             0.0000         0.4724                 Yes
## 3    11_0             0.0000         0.1624                 Yes
## 4   141_0             0.0000         0.4160                 Yes
## 5   121_1             0.0055         0.1165                 Yes
## 6    11_0             0.0000         0.7656                 Yes
##   bestParentfit frqInvalid_bestParentfit Pvalue_bestParentfit
## 1       18I81_0                   0.0000               0.9187
## 2        1551_0                   0.0000               0.4724
## 3          11_0                   0.0000               0.1624
## 4         141_0                   0.0000               0.4160
## 5         121_1                   0.0055               0.1165
## 6          11_0                   0.0000               0.7656
##   matchParent_bestParentfit q1_segtypefit q2_parents q3_fracscored
## 1                       Yes        1.0000          1        0.7222
## 2                       Yes        1.0000          1        0.7464
## 3                       Yes        1.0000          1        0.7343
## 4                       Yes        1.0000          1        0.7343
## 5                       Yes        0.9278          1        0.6860
## 6                       Yes        1.0000          1        0.6739
##   qall_mult qall_weights
## 1    0.7222       0.9383
## 2    0.7464       0.9436
## 3    0.7343       0.9410
## 4    0.7343       0.9410
## 5    0.6365       0.8901
## 6    0.6739       0.9275

There is the possibility to check for shifted markers (incorrectly genotyped) but we will not concern ourselves with that right now. The main arguments to be specified are what the parental identifiers are (parent1 and parent2), what the identifiers of the F1 population are (argument F1 - note we just take the column names of the dosages matrix from column 3 to the end), and which sort of inheritance model we wish to compare against (here we have specified polysomic to be TRUE and both disomic and mixed to be FALSE. Note that mixed allows for one parent to be fully disomic and the other fully polysomic, but does not refer to the situation of a mixed inheritance pattern within one parent, associated with segmental allopolyploidy - this is because we have no fixed expectation to compare with if this is the case). The argument ploidy specifies the ploidy level of parent 1, if parent 2 has a different ploidy (for instance in a tetraploid x diploid cross) then we can use the ploidy2 argument. The results are stored in the data.frame F1checked, which has quite a few columns. We are interested in seeing whether there is consistency between the parental scores and the F1. In general, if there is good consistency between the parental and offspring scores we will have a high value for qall_mult and qall_weights, which are aggregated quality measures of q1, q2 and q3. In the sample datatset provided with polymapR there are no skewed markers present. However, you may find it useful to examine the output of checkF1 and remove markers which show conflict between the parental and offspring scores, as these markers will likely cause issues in the later mapping steps.

## 5. Generate summary data

Now we summarise the marker data in terms of the different segregation types, number of missing values, number of inadmissible scores (which could be genotyping errors or possible double reduction scores). Run the summary function on the marker data as follows:

mds <- marker_data_summary(dosage_matrix = ALL_dosages,
ploidy = 4,
pairing = "random",
parent1 = "P1",
parent2 = "P2",
progeny_incompat_cutoff = 0.05)
## Calculating parental info...
## Checking compatability between parental and offspring scores...
##
## ####parental_info
##
##         P2_0   P2_1   P2_2   P2_3   P2_4
## -----  -----  -----  -----  -----  -----
## P1_0       0    312    227     49      3
## P1_1     300    625    356     67      4
## P1_2     230    323    175     29      2
## P1_3      66     65     30      5      0
## P1_4       2      5      3      0    122
##
## ####offspring_incompatible
##
##         P2_0   P2_1   P2_2   P2_3   P2_4
## -----  -----  -----  -----  -----  -----
## P1_0      NA   0.00   0.00   0.56   2.42
## P1_1    0.00   0.00   0.00   0.28   0.85
## P1_2    0.00   0.00   0.00   0.12   0.00
## P1_3    0.45   0.33   0.14   0.19     NA
## P1_4    0.97   0.48   0.64     NA   1.07
##
## ####Incompatible individuals:
##
## None
pq_before_convert <- parental_quantities(dosage_matrix = ALL_dosages,
las = 2)

##              1x0   2x0   3x0   4x0   0x1   1x1   2x1   3x1   4x1   0x2   1x2   2x2   3x2   4x2   0x3   1x3   2x3   3x3   0x4   1x4   2x4   4x4
## ----------  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----
## frequency    300   230    66     2   312   625   323    65     5   227   356   175    30     3    49    67    29     5     3     4     2   122

If you want to know exactly what these functions are doing and what can be typed as an argument just use ?marker_data_summary and/or ?parental_quantities.

## 6. Convert marker dosages to simple segregations; remove non-segregating data

Simplex x nulliplex markers segregate in a 1:1 ratio in the offspring, but this segregation is also possible with triplex x nulliplex, triplex x quadriplex and simplex x quadriplex, and the segregating allele in each case comes from the same parent. We can therefore re-code all of these segregation types into the simplest 1 x 0 case, taking care that we handle possible double reduction scores correctly. A list of the conversions in tetraploid we apply which simplify the subsequent analysis and mapping work are:

 3x0,3x4,1x4 -> 1x0 0x3,4x3,4x1 -> 0x1 2x4 -> 2x0 4x2 -> 0x2 3x3 -> 1x1 3x1 -> 1x3 3x2 -> 1x2 2x3 -> 2x1

Note that some classes such as 2x2 cannot be converted into any other category, and we do not convert non-segregating classes such as 4x4 as these are of no further use for mapping or QTL analysis. To convert the marker data:

segregating_data <- convert_marker_dosages(dosage_matrix = ALL_dosages)
## Warning in convert_marker_dosages(dosage_matrix = ALL_dosages): There are dosages greater than 4 or less than 0 in the dataset.
##             If they include parental dosages, markers are removed from the output.
##             Otherwise the dosage is made missing.
##
## ####Marker dosage frequencies:
##
##         P2_0   P2_1   P2_2   P2_3
## -----  -----  -----  -----  -----
## P1_0       0    366    230      0
## P1_1     370    630    386    132
## P1_2     232    352    175      0
##
## markers not converted: 2615
##
## markers 1 parent converted: 193
##
## markers 2 parents converted: 192
##
## non-segregating markers deleted: 127
pq_after_convert <- parental_quantities(dosage_matrix = segregating_data)

##              1x0   2x0   0x1   1x1   2x1   0x2   1x2   2x2   1x3
## ----------  ----  ----  ----  ----  ----  ----  ----  ----  ----
## frequency    370   232   366   630   352   230   386   175   132

## 7. Quality checks on marker data

There are a number of quality checks which are needed before we begin mapping. For example, we do not want to include markers which have too many missing values (“NA” scores) as this may be an indication that the marker has performed poorly and may have many errors associated with it. We would also like to screen out individuals that have many missing values as this might be an indication that the DNA quality of that sample was poor, or duplicate individuals which give no extra information and should probably also be removed / combined. Because there are no fixed rules for how stringent we should be with quality checks (5% missing values ok? 10% missing values ok? 20%?), some visual aids are produced in order to help with these decisions.

### 7.1 Missing value rate per marker

We first screen the marker data for missing values using screen_for_NA_values. This function allows the user to remove rows or columns with a certain rate of missing values from the dataset. The dosage matrix, which is specified as segregating_data contains markers in rows and individuals in columns. In R, rows are usually specified by the number 1, and columns by 2. The argument margin lets you specify by margin number whether you want to screen markers (margin = 1) or individuals (margin = 2) If the argument cutoff is set to NULL, it takes user input on the rate of markers to be screened. In the example below cutoff is specified, but because decisions are made after data inspection, we recommend to decide on the cutoff value after inspection of the data by specifying cutoff = NULL. A threshold of 10% missing values is sensible, as a compromise between high-confidence marker data versus keeping a large proportion of the markers for mapping and QTL analysis. In general, the most problematic markers tend to have higher rates of missing values so any reasonable threshold here should be fine. If problems are encountered later in the mapping, it might be worthwhile to re-run this step with a more stringent (e.g. 5%) threshold.

screened_data <- screen_for_NA_values(dosage_matrix = segregating_data,
margin = 1, # margin 1 means markers
cutoff =  0.10,
print.removed = FALSE) 

##
## Number of removed markers :  1458
##
##
## There are now 1415 markers leftover in the dataset.

### 7.2 Missing value rate per individual

As in step 7.1 we can also visualise the rate of missing values per individual, which can help us to make a decision about whether certain individuals should be removed from the dataset before commencing the mapping. In order to screen the individuals for missing values we enter:

screened_data2 <- screen_for_NA_values(dosage_matrix = screened_data,
cutoff = 0.1,
margin = 2, #margin = 2 means columns
print.removed = FALSE)

##
## Number of removed individuals :  2
##
##
## There are now 205 individuals leftover in the dataset.

Again, it is not clear what an acceptable threshold for the rate of missing values should be, but we recommend a cut-off rate of 0.1 – 0.15 (10-15%) missing values allowed.

### 7.3 Duplicate individuals

It is also desirable to be able to check whether there are any duplicated individuals in the population as can sometimes happen (or due to mix-ups in the DNA preparation and genotyping). The function screen_for_duplicate_individuals does this for you. Note that a duplicate can remain in the dataset, but they add no further information for mapping or QTL analysis, and may even distort the analysis (by giving an inaccurate population size). As for the function screen_for_NA_values, cutoff can be set to NULL to accept user input. Especially for screen_for_duplicate_individuals it makes sense to make a decision on the cut off after inspection of the data and figures. To screen the data for duplicate individuals, enter:

screened_data3 <- screen_for_duplicate_individuals(dosage_matrix = screened_data2,
cutoff = 0.95,
plot_cor = T)
##
## Combining F1_060 & F1_081 into F1_060
##
## Combining F1_046 & F1_085 into F1_046
## Warning in screen_for_duplicate_individuals(dosage_matrix =
## screened_data2, : Multiple duplicates of single genotype identified at this
## threshold. Attempting to merge...

##
## Combining F1_032 & F1_100 into F1_032
##
## Combining F1_032 & F1_114 into F1_032
##
## Combining F1_079 & F1_101 into F1_079
##
## Combining F1_106 & F1_113 into F1_106
##
## Combining F1_112 & F1_199 into F1_112
##
## #### 7 individuals removed:
##
##      _        _        _        _
## [1,] "F1_081" "F1_085" "F1_100" "F1_114"
## [2,] "F1_101" "F1_113" "F1_199" ""
## _        _        _        _
## -------  -------  -------  -------
## F1_081   F1_085   F1_100   F1_114
## F1_101   F1_113   F1_199

Here we see that there were a number of duplicate pairs of individuals identified (red points in the output graph). Each duplicate pair is merged into a consensus individual (conflicts are made missing), keeping the name of the first individual. This is summarised in the output as well.

### 7.4 Duplicated markers

Finally, we can also check for duplicated markers (those which give us no extra information about a locus) and remove these from the dataset before mapping, using the screen_for_duplicate_markers function. This function returns a list with 2 items - the filtered_dosage_matrix which we are interested in now, and bin_list, which we are possibily interested in again later. It is a good idea to save the output of this function, as you might like to add back the duplicate markers after mapping.

screened_data4 <- screen_for_duplicate_markers(dosage_matrix = screened_data3)

##
|
|                                                                 |   0%
|
|=================================================================| 100%
## Marked and merged 3 duplicated markers
dim(screened_data4$filtered_dosage_matrix)  ## [1] 1412 200 filtered_data <- screened_data4$filtered_dosage_matrix 

Before we start the mapping, it is a good idea to run the summary function again, this time on the filtered_data object, so we have a clear record of the numbers of markers going forward to the mapping stage.

pq_screened_data <- parental_quantities(dosage_matrix = filtered_data)

##              1x0   2x0   0x1   1x1   2x1   0x2   1x2   2x2   1x3
## ----------  ----  ----  ----  ----  ----  ----  ----  ----  ----
## frequency    175   111   194   301   183   101   198    88    61

## 8. Simplex x nulliplex markers – defining chromosomes and homologues

The first step in actually mapping the markers is to define linkage groups for the markers. We use the simplex x nulliplex markers to define the linkage groups (chromosomes) and following that, the homologues (4 per linkage group are expected in a tetraploid). We use the LOD scores to cluster markers, which increases dramatically for tight coupling-phase estimates.

To estimate the recombination frequencies, LOD and phasing between all marker pairs of P1 enter:

SN_SN_P1 <- linkage(dosage_matrix = filtered_data,
markertype1 = c(1,0),
target_parent = "P1",
other_parent = "P2",
ploidy = 4,
pairing = "random"
)
###### A note on arguments and multi-core processing

There is much more to the linkage function. Checkout ?linkage to see the arguments that can be passed to it. One important feature is that it can run in parallel on multiple cores. This can be very time-efficient if you are working with a large number of markers. The number of cores to use can be specified using the ncores argument. Before doing so, first check the number of cores on your system using parallel::detectCores(). Use at least one core less than you have available to prevent overcommitment.

It is a good idea to first visualise the relationship between the recombination frequency and LOD score, which gives us some insight into the differences between the information content of the different phases, as well as highlighting possible problems in the data. This can be done by using the plot_linkage_df function:

plot_linkage_df(SN_SN_P1, r_max = 0.5)

In some situations, it is possible to identify outlying datapoints on this plot, which do not fit the expected relationship. We have implemented a function to check this for the case of pairs of simplex x nulliplex markers, as these are the markers that we use in the subsequent steps of clustering and identifying chromosome and homologue linkage groups. If you are unsure, you can run the SNSN_LOD_deviation function anyway to see whether your data fits the expected pattern:

P1deviations <- SNSN_LOD_deviations(linkage_df = SN_SN_P1,
ploidy = 4,
N = ncol(filtered_data) - 2, #The F1 population size
alpha = c(0.05,0.2),
plot_expected = TRUE,
phase="coupling")

Here we see there are a few “outliers” in the coupling pairwise data (identified by the stars) from the expected relationship between r and LOD (the bounds of which are shown by the dotted lines - these can be widened as necessary using the alpha argument for the upper and lower tolerances around the “true” line, usually in a trial and error approach). In our example dataset, these highlighted points are hardly outliers and we can proceed. In other datasets, you may want to remove these starred pairwise estimates before continuing with marker clustering, as it is possible they may cause unrelated homologue clusters to clump together and prevent a clear clustering structure from being defined. To do so, remove them from the linkage data frame as follows:

SN_SN_P1.1 <- SN_SN_P1[SN_SN_P1$phase == "coupling",][-which(P1deviations > 0.2),] We are now in a position to cluster the marker data into chromosomes using the cluster_SN_markers function. This function clusters markers over a range of LOD thresholds. In the example below this is over a range from LOD 3 to 10 in steps of 1 (LOD_sequence = c(3:10)). If the argument plot_network is set to TRUE this function will plot the network of the lowest LOD threshold (LOD 3 in this case), and overlays the clusters of the higher LOD scores. This setting is not recommended with large number of marker pairs. Your computer will most likely run out of memory. P1_homologues <- cluster_SN_markers(linkage_df = SN_SN_P1, LOD_sequence = seq(3, 10, 0.5), LG_number = 5, ploidy = 4, parentname = "P1", plot_network = FALSE, plot_clust_size = FALSE) ## Total number of edges: 894 The function cluster_SN_markers returns a list of cluster data frames, one data frame for each LOD threshold. The first plot shows two things - on the normal y-axis, the number of clusters over a range of LOD thresholds (x-axis) - and on the right-hand y-axis, the number of markers that do not form part of any cluster, which is shown by the blue dashed line. In the cluster_SN_markers function, we specify the ploidy (4 for a tetraploid) and the expected number of chromosomes (LG_number = 5) and therefore there are 4 x 5 = 20 homologue clusters expected. If we examine the second plot, we see how these clusters stay together or fragment as the linkage stringency (LOD score) increases. Here, we see clearly that at LOD 3.5, there are 5 clusters which all eventually divide into 4 sub-clusters each. This is the ideal scenario - we have identified both the chromosomal linkage groups and the homologues using only the simplex x nulliplex marker data. Note that it is the coupling-phase estimates which define the homologue clusters and the repulsion-phase estimates which provide the associations between these coupling-clusters (homologues). In practice, things might not always be so orderly. In noisy datasets with higher numbers of genotyping errors, or at higher ploidy levels, the repulsion-phase linkages between homologues will be lost among false positive linkages between un-linked markers. This means that all markers will likely clump together at lower LOD scores. You can manually check how many clusters there are at a specific LOD score, and how many markers are in a cluster: P1_hom_LOD3.5 <- P1_homologues[["3.5"]] t <- table(P1_hom_LOD3.5$cluster)
print(paste("Number of clusters:",length(t)))
## [1] "Number of clusters: 5"
t[order(as.numeric(names(t)))]
##
##  1  2  3  4  5
## 37 36 29 31 41
P1_hom_LOD5 <- P1_homologues[["5"]]
t <- table(P1_hom_LOD5$cluster) print(paste("Number of clusters:",length(t))) ## [1] "Number of clusters: 20" t[order(as.numeric(names(t)))] ## ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ## 13 9 10 4 12 7 9 8 3 13 6 7 4 10 6 11 17 6 6 12 Based on the plots (or the output as shown) you should be able to decide which LOD threshold best separates the data. In this example, at a LOD of 3.5 we identify chromosomes and anywhere between LOD 4.5 and LOD 7 we have identified our homologues (4 x 5 = 20 clusters). In this parent (P1) we have therefore solved the marker clustering problem without any difficulty. It remains to gather this into a structure that can be used at later stages. To do this, we need two LOD values which define chromosomal groupings (here, LOD_chm = 3.5) and homologue groupings (here, LOD_hom = 5). We can then use the define_LG_structure function with these arguments as follows (note the argument LG_number refers to the expected number of chromosomes for this species): LGHomDf_P1 <- define_LG_structure(cluster_list = P1_homologues, LOD_chm = 3.5, LOD_hom = 5, LG_number = 5) ## ## ####SxN Marker(s) lost in clustering step at LOD 5: ## ## |_ | ## |:-----------| ## |Ls_rs099908 | head(LGHomDf_P1) ## SxN_Marker LG homologue ## 1 Ac_ns035629 1 1 ## 3 Ac_ns045766 1 1 ## 7 Ac_ns096474 1 1 ## 22 Ac_ts072686 1 1 ## 24 Ac_ws013795 1 1 ## 42 Ap_rs075640 1 1 However, you may find that it is not possible to identify chromosomal groups as simply as this. We therefore recommend that homologue groups be first identified, and then another marker type be used to provide associations between these clusters, and hence help form chromosomal linkage groups. This can be done as follows: • Generate homologue clusters using the cluster_SN_markers function at a high LOD. • Use the Duplex x Nulliplex (or Simplex x Simplex) marker information to put the clusters back together into chromosomal groups. It makes sense to use only SxN marker pairs in coupling phase for the homologue clustering (since these associations define the homologues). Note that the most likely phase per marker pair (coupling or repulsion) was already estimated by our previous call to the linkage function. SN_SN_P1_coupl <- SN_SN_P1[SN_SN_P1$phase == "coupling",] # select only markerpairs in coupling

LOD_sequence = c(3:12),
LG_number = 5,
ploidy = 4,
parentname = "P1",
plot_network = FALSE,
plot_clust_size = FALSE)
## Total number of edges: 682

In the dot plot above, we see that all chromosomal associations have now been removed with the repulsion information.

###### A note on nested lists

In the polymapR package, output is often given as a nested list. Some basic information on lists is available on e.g. r-tutor. We use nested lists because we are often dealing with hierarchical data: marker data > homologue > linkage group > organism. Some basic R knowledge is needed to retrieve, visualize or filter data from such a nested list. However, if you would like to visualize it without R (e.g. in your favourite spreadsheet program), you can choose to export a nested list using write_nested_list. It will keep its hierarchical structure by writing it to directories that have the same structure as the the nested list. Here’s an example: write_nested_list(list = P1_homologues, directory = "P1_homologues")

Next, we use linkage to the DxN markers to provide bridging linkages between homologue clusters, using the bridgeHomologues function. The first time we run this function, we use a LOD threshold of 4 for evidence of linkage between a SxN and a DxN marker. However, there may also be some false positive linkages at this level. It might therefore be necessary to run the function again at a higher LOD (say LOD 8) to avoid such false positives. In doing so, we might lose true linkage information. Therefore, it’s a good idea to run the function at LOD 4 first, then increase to a higher LOD if necessary, and build up a picture of which clusters form part of the same chromosome. First calculate linkage between SxN and DxN markers:

SN_DN_P1 <- linkage(dosage_matrix = filtered_data,
markertype1 = c(1,0),
markertype2 = c(2,0),
target_parent = "P1",
other_parent = "P2",
ploidy = 4,
pairing = "random")

Then, use bridgeHomologues to use these SxN - DxN linkages to find associations (“bridges”) between homologue clusters. Note that we must provide a particular division of the Simplex x Nulliplex data - in this example, we use the division at LOD 6 which gave us 20 clusters (our 20 putative homologues):

LGHomDf_P1_1 <- bridgeHomologues(cluster_stack = P1_homologues_1[["6"]],
LOD_threshold = 4,
automatic_clustering = TRUE,
LG_number = 5,
parentname = "P1")

This results in an ideal situation of 5 linkage groups each with exactly 4 homologues:

table(LGHomDf_P1_1$LG, LGHomDf_P1_1$homologue)
##
##      1  2  3  4
##   1 13  9  4 10
##   2 12  7  9  8
##   3  3 13  5  7
##   4  4 10  6 11
##   5 17 12  6  6

However, clustering of P2 is a bit less ideal. The following redoes everything for P2:

SN_SN_P2 <- linkage(dosage_matrix = filtered_data,
markertype1 = c(1,0),
target_parent = "P2",
other_parent = "P1",
ploidy = 4,
pairing = "random"
)
SN_SN_P2_coupl <- SN_SN_P2[SN_SN_P2$phase == "coupling",] # get only markerpairs in coupling P2_homologues <- cluster_SN_markers(linkage_df = SN_SN_P2_coupl, LOD_sequence = c(3:12), LG_number = 5, ploidy = 4, parentname = "P2", plot_network = FALSE, plot_clust_size = FALSE) ## Total number of edges: 743 SN_DN_P2 <- linkage(dosage_matrix = filtered_data, markertype1 = c(1,0), markertype2 = c(2,0), target_parent = "P2", other_parent = "P1", ploidy = 4, pairing = "random") LGHomDf_P2 <- bridgeHomologues(cluster_stack = P2_homologues[["6"]], linkage_df = SN_DN_P2, LOD_threshold = 4, automatic_clustering = TRUE, LG_number = 5, parentname = "P2") ## Not all homologue clusters are represented in the graph. ## ## ####Unrepresented homologues ## ## cluster size ## -------- ----- ## 12 3 ## 21 2 table(LGHomDf_P2$LG,LGHomDf_P2$homologue) ## ## 1 2 3 4 ## 1 11 9 5 0 ## 2 13 13 7 11 ## 3 13 11 11 8 ## 4 9 9 13 11 ## 5 10 10 6 9 Linkage group 2 possesses five “homologues” instead of four - it is therefore likely we have a number of fragmented homologues. However, we don’t know which clusters are actual homologues, and which are fragments. We can try to determine this using two methods. ### 8.2 Using the function cluster_per_LG cluster_per_LG does not rely on markers in repulsion, but uses variation in LOD thresholds between markers in coupling. Run it as follows: cluster_per_LG(LG = 2, linkage_df = SN_SN_P2[SN_SN_P2$phase == "coupling",],
LG_hom_stack = LGHomDf_P2,
LOD_sequence = c(3:10), # The first element is used for network layout
modify_LG_hom_stack = FALSE,
network.layout = "stacked",
nclust_out = 4,
label.offset=1.2)
## Total number of edges: 190

The network layout is based on the first LOD score you provide in LOD_sequence (which is LOD = 3 in this example). The darker the background, the more consistent the clustering is over the range of LOD scores. Change the LOD_sequence if the network is not as required (4 clusters in the case of a tetraploid) and re-run. If the network looks as you would like to, use the cluster_per_LG function but with a LOD_sequence of length 1 (the first LOD score, so again LOD = 3 in this example), and specify modify_LG_hom_stack = TRUE to make a modification to the cluster numbering:

LGHomDf_P2_1 <- cluster_per_LG(LG = 2,
linkage_df = SN_SN_P2[SN_SN_P2$phase == "coupling",], LG_hom_stack = LGHomDf_P2, LOD_sequence = 3, modify_LG_hom_stack = TRUE, network.layout = "n", nclust_out = 4) ## Total number of edges: 190 table(LGHomDf_P2_1$homologue, LGHomDf_P2_1$LG) ## ## 1 2 3 4 5 ## 1 11 11 13 9 10 ## 2 9 7 11 9 10 ## 3 5 13 11 13 6 ## 4 0 13 8 11 9 ### 8.3 Cross-ploidy populations (e.g. tetraploid x diploid to give triploid F1) polymapR is also able to map triploid populations, and may be extended to other cross-ploidy populations if there is demand for that in future. However, because of differences in parental ploidy levels, we have implemented a mapping approach that assumes that the parent of higher ploidy level is parent 1. Usually in polymapR a cross is understood to be between “maternal parent” x “paternal parent”. However this is just a convenient convention, and parent numbering can be reversed if so desired. In the case of e.g. triploid populations, there is no flexibility anymore, parent 1 must be tetraploid and parent 2 must be diploid. The next point to be aware of is that the segregation of markers in the diploid parent is assumed to follow disomic inheritance (there is no other option in a diploid), whereas the segregation of the tetraploid parent is assumed to follow tetrasomic inheritance. Mixtures with preferential pairing in the tetraploid parent are currently not implemented for triploid populations. Because of this disomic behaviour in the diploid parent, we are unable to use the normal clustering function (cluster_SN_markers) for identifying the homologues. Proceed as normal with the steps described above for parent 1. However in order to proceed correctly for parent 2 (diploid), you will need to provide the markertype as c(0,1) and not c(1,0). This is easy to understand since a 1x0 marker in a triploid population is Aaaa x aa while a 0x1 marker is aaaa x Aa. These are clearly different marker types and we cannot reverse them. To access the correct linkage functions for the diploid parent, therefore, you will need to think in terms of 1x0 markers segregating from parent 1, and 0x1 markers segregating from parent 2. If you are confused about what marker types are possible in a triploid population, run the following, which gives all possible marker combinations (and the segregation ratios in the F1): get("seg_p3_random",envir=getNamespace("polymapR")) ## dosage1 dosage2 f1_0 f1_1 f1_2 f1_3 ## 1 0 0 1 0 0 0 ## 2 0 1 1 1 0 0 ## 3 0 2 0 1 0 0 ## 4 1 0 1 1 0 0 ## 5 1 1 1 2 1 0 ## 6 1 2 0 1 1 0 ## 7 2 0 1 4 1 0 ## 8 2 1 1 5 5 1 ## 9 2 2 0 1 4 1 ## 10 3 0 0 1 1 0 ## 11 3 1 0 1 2 1 ## 12 3 2 0 0 1 1 ## 13 4 0 0 0 1 0 ## 14 4 1 0 0 1 1 ## 15 4 2 0 0 0 1 To run the preliminary linkage analysis of 0x1 markers in parent 2, we also need to specify the ploidy of parent 2, in this case ploidy2 = 2. Note that the target parent is P2 as this is the parent we are now interested in, but markertype1 = c(0,1). For this we need a special dataset with triploid data, a sample of which is available from the polymapR package also: data("TRI_dosages") SN_SN_P2.tri <- linkage(dosage_matrix = TRI_dosages, markertype1 = c(0,1), target_parent = "P2", #Note the target parent is P2 other_parent = "P1", ploidy = 4, ploidy2 = 2, pairing = "random" ) Have a look at the plot of r versus LOD score - here you should notice that both the coupling and repulsion phases are equally informative now and cannot therefore be separated by LOD score as before: plot_linkage_df(SN_SN_P2.tri) If we try to cluster using the old function, we get the following result - no separation of the homologues, but the chromosomal linkage groups are identified: P2_homologues.tri <- cluster_SN_markers(linkage_df = SN_SN_P2.tri, LOD_sequence = seq(3, 10, 1), LG_number = 3, ploidy = 2, #because P2 is diploid.. parentname = "P2", plot_network = FALSE, plot_clust_size = FALSE)  ## Total number of edges: 345 ## Please note: the number of clusters is smaller than number of expected homologues. ## Not all expected homologues are represented. To separate the 0x1 markers into homologues, we use the phase information from the linkage analysis directly, using the phase_SN_diploid function. To learn more about how this function works, input ?phase_SN_diploid first, to get an idea of the arguments used. LGHomDf_P2.tri <- phase_SN_diploid(linkage_df = SN_SN_P2.tri, cluster_list = P2_homologues.tri, LOD_chm = 4, #LOD at which chromosomes are identified LG_number = 3) #number of linkage groups ## Total number of edges: 161 ## Complete phase assignment possible using only coupling information at LOD 4 The rest of the analysis should be pretty much the same as for tetraploids or hexaploids, although you should check each time you use a new function whether that function has a ploidy2 argument, and if it does, use it! ## 9. Assigning SxS and DxN markers and consensus linkage group (LG) names Assigning simplex x simplex markers to a homologue and linkage group is done by calculating linkage between SxS and SxN markers and after that assigning them based on this linkage using assign_linkage_group: SN_SS_P1 <- linkage(dosage_matrix = filtered_data, markertype1 = c(1,0), markertype2 = c(1,1), target_parent = "P1", other_parent = "P2", ploidy = 4, pairing = "random") P1_SxS_Assigned <- assign_linkage_group(linkage_df = SN_SS_P1, LG_hom_stack = LGHomDf_P1, SN_colname = "marker_a", unassigned_marker_name = "marker_b", phase_considered = "coupling", LG_number = 5, LOD_threshold = 3, ploidy = 4) ## ## ####Marker(s) showing ambiguous linkage to more than one LG: ## ## |_ | ## |:-----------| ## |Ap_ts069042 | ## ## In total, 296 out of 301 markers were assigned. ## ## ####Marker(s) not assigned: ## ## _ _ _ _ ## ------------ ------------ ------------ ------------ ## Ac_ws033297 Ap_ts089267 St_ns048176 St_ts030296 ## Zm_ws010615 head(P1_SxS_Assigned) ## Assigned_LG LG1 LG2 LG3 LG4 LG5 Hom1 Hom2 Hom3 Hom4 ## Ac_ns002135 4 0 0 0 7 0 0 7 0 0 ## Ac_ns002510 2 0 7 0 0 0 0 0 0 7 ## Ac_ns024650 2 0 11 0 0 0 11 0 0 0 ## Ac_ns028519 5 0 0 0 0 5 0 5 0 0 ## Ac_ns028533 5 0 0 0 0 3 0 0 3 0 ## Ac_ns029178 3 0 0 12 0 0 0 12 0 0 ## Assigned_hom1 Assigned_hom2 Assigned_hom3 Assigned_hom4 ## Ac_ns002135 2 NA NA NA ## Ac_ns002510 4 NA NA NA ## Ac_ns024650 1 NA NA NA ## Ac_ns028519 2 NA NA NA ## Ac_ns028533 3 NA NA NA ## Ac_ns029178 2 NA NA NA SN_SS_P2 <- linkage(dosage_matrix = filtered_data, markertype1 = c(1,0), markertype2 = c(1,1), target_parent = "P2", other_parent = "P1", ploidy = 4, pairing = "random") P2_SxS_Assigned <- assign_linkage_group(linkage_df = SN_SS_P2, LG_hom_stack = LGHomDf_P2_1, SN_colname = "marker_a", unassigned_marker_name = "marker_b", phase_considered = "coupling", LG_number = 5, LOD_threshold = 3, ploidy = 4) ## ## In total, 280 out of 300 markers were assigned. ## ## ####Marker(s) not assigned: ## ## _ _ _ _ ## ------------ ------------ ------------ ------------ ## Ac_ts005083 Ac_ws069529 Ap_ws005733 Ap_ws040208 ## Ls_rs020124 Ls_rs094027 Ls_ts045044 Ls_ws082695 ## Ps_rs063315 Ps_ts001531 Ps_ws041352 Ps_ws078864 ## St_ns019072 St_ns024470 St_rs035771 St_ts001557 ## St_ts058903 St_ws039972 St_ws050714 Zm_ns098754 ##  As simplex x simplex markers are present in both parents, we can define which linkage groups correspond with each other between parents. After this, one of the linkage groups of the parents should be renamed and the SxS markers assigned again according to the new linkage group names: LGHomDf_P2_2 <- consensus_LG_names(modify_LG = LGHomDf_P2_1, template_SxS = P1_SxS_Assigned, modify_SxS = P2_SxS_Assigned) ## ## ####Original LG names ## ## 1 2 3 4 5 ## --- --- --- --- --- --- ## 1 0 0 0 47 0 ## 2 48 0 0 0 0 ## 3 0 0 54 0 0 ## 4 0 58 0 0 0 ## 5 0 0 0 0 68 ## ## ####Modified LG names ## ## 1 2 3 4 5 ## --- --- --- --- --- --- ## 1 47 0 0 0 0 ## 2 0 48 0 0 0 ## 3 0 0 54 0 0 ## 4 0 0 0 58 0 ## 5 0 0 0 0 68 P2_SxS_Assigned <- assign_linkage_group(linkage_df = SN_SS_P2, LG_hom_stack = LGHomDf_P2_2, SN_colname = "marker_a", unassigned_marker_name = "marker_b", phase_considered = "coupling", LG_number = 5, LOD_threshold = 3, ploidy = 4) ## ## In total, 280 out of 300 markers were assigned. ## ## ####Marker(s) not assigned: ## ## _ _ _ _ ## ------------ ------------ ------------ ------------ ## Ac_ts005083 Ac_ws069529 Ap_ws005733 Ap_ws040208 ## Ls_rs020124 Ls_rs094027 Ls_ts045044 Ls_ws082695 ## Ps_rs063315 Ps_ts001531 Ps_ws041352 Ps_ws078864 ## St_ns019072 St_ns024470 St_rs035771 St_ts001557 ## St_ts058903 St_ws039972 St_ws050714 Zm_ns098754 ##  Since we now have a consistent linkage group numbering, we can also assign the DxN markers: P1_DxN_Assigned <- assign_linkage_group(linkage_df = SN_DN_P1, LG_hom_stack = LGHomDf_P1, SN_colname = "marker_a", unassigned_marker_name = "marker_b", phase_considered = "coupling", LG_number = 5, LOD_threshold = 3, ploidy = 4) ## ## In total, 111 out of 111 markers were assigned. P2_DxN_Assigned <- assign_linkage_group(linkage_df = SN_DN_P2, LG_hom_stack = LGHomDf_P2_2, SN_colname = "marker_a", unassigned_marker_name = "marker_b", phase_considered = "coupling", LG_number = 5, LOD_threshold = 3, ploidy = 4) ## ## In total, 101 out of 101 markers were assigned. ## 10. Assign all other markertypes Now that we have a backbone of SxN markers with consistent linkage group names, it is time to assign all other marker types to a linkage group and homologue using their linkage to SxN markers. Since we already did this for DxN and SxS markers, there is no need to re-do this work for these marker types. The function homologue_lg_assignment finds all markertypes that have not been assigned yet, does the linkage analysis and assigns them to a linkage group and homologue. As linkage is calculated for a lot of marker combinations, this might take a while. marker_assignments_P1 <- homologue_lg_assignment(dosage_matrix = filtered_data, assigned_list = list(P1_SxS_Assigned, P1_DxN_Assigned), assigned_markertypes = list(c(1,1), c(2,0)), LG_hom_stack = LGHomDf_P1, target_parent = "P1", other_parent = "P2", ploidy = 4, pairing = "random", convert_palindrome_markers = FALSE, LG_number = 5, LOD_threshold = 3, write_intermediate_files = FALSE ) Also do this for P2: marker_assignments_P2 <- homologue_lg_assignment(dosage_matrix = filtered_data, assigned_list = list(P2_SxS_Assigned, P2_DxN_Assigned), assigned_markertypes = list(c(1,1), c(2,0)), LG_hom_stack = LGHomDf_P2_2, target_parent = "P2", other_parent = "P1", ploidy = 4, pairing = "random", convert_palindrome_markers = TRUE, LG_number = 5, LOD_threshold = 3, write_intermediate_files = FALSE ) Next, to make sure the marker linkage group assignment is consistent across parents we run the check_marker_assignment function, removing any bi-parental markers if they show linkage to different chromosomes (which suggests a problem with these markers): marker_assignments <- check_marker_assignment(marker_assignments_P1,marker_assignments_P2) ## 11. Finish the linkage analysis Since all markers have been assigned to a linkage group, we now do the linkage calculations per linkage group of every marker type combination (so not only with SxN markers). This is done with the finish_linkage_analysis function: all_linkages_list_P1 <- finish_linkage_analysis(marker_assignment = marker_assignments$P1,
dosage_matrix = filtered_data,
target_parent = "P1",
other_parent = "P2",
convert_palindrome_markers = FALSE,
ploidy = 4,
pairing = "random",
LG_number = 5)

all_linkages_list_P2 <- finish_linkage_analysis(marker_assignment = marker_assignments$P2, dosage_matrix = filtered_data, target_parent = "P2", other_parent = "P1", convert_palindrome_markers = TRUE, # convert 3.1 markers ploidy = 4, pairing = "random", LG_number = 5) The output is returned in a list: str(all_linkages_list_P1) Using the assignment data, we can also split this list in homologues using split_linkage_info: all_linkages_list_P1_split <- split_linkage_info(all_linkages = all_linkages_list_P1, marker_assignment = marker_assignments_P1, ploidy = 4) all_linkages_list_P2_split <- split_linkage_info(all_linkages = all_linkages_list_P2, marker_assignment = marker_assignments_P2, ploidy = 4) This results in a nested list, which you can write to files: str(all_linkages_list_P1_split) write_nested_list(nested_list = all_linkages_list_P1, directory = "linkages_P1") ## 12. Optional: Marker binning If you are using JoinMap to do the ordering, the number of markers than can be used per homologue is limited. If the number of markers per homologue approaches or exceeds 100, we first need to perform some sort of binning strategy, putting co-segregating markers into “bins” and only using one of them as a representative marker for the others in the bin. Marker binning can be done on one linkage_df separately, using the marker_binning function, or by using its wrapper marker_binning_list which performs binning across all linkage groups: all_linkages_list_P1.binned <- marker_binning_list(dosage_matrix = filtered_data, linkage_list = all_linkages_list_P1_split, target_parent = "P1", other_parent = "P2", return_removed_marker_info = TRUE) all_linkages_list_P2.binned <- marker_binning_list(dosage_matrix = filtered_data, linkage_list = all_linkages_list_P2_split, target_parent = "P2", other_parent = "P1", return_removed_marker_info = TRUE) ## 13. Optional: Writing to a .pwd file If the marker ordering is to be done by the JoinMap GUI, the linkage data can be written to JoinMap compatible .pwd files using write.pwd or its wrapper write_pwd_list: write_pwd_list(linkages_list = all_linkages_list_P1$linkage_list,
target_parent = "P1",
binned = T)

write_pwd_list(linkages_list = all_linkages_list_P2$linkage_list, target_parent = "P2", binned = T) ## 14. Marker ordering The basis of producing linkage maps has now been achieved, that is the pairwise recombination frequencies have been estimated and LOD scores for these estimates have been calculated for all markers. There are two options in the polymapR package for marker ordering: weighted linear regression (based on the JoinMap algorithm) or multi-dimensional scaling. The first is a reliable and widely used algorithm, but is extremely slow for higher marker numbers. The latter option has recently been developed by Katherine Preedy and Christine Hackett [2] and offers similar levels of accuracy as the weighted linear regression but at much higher speed. Our approach is to create genetic maps of the homologues as well as an integrated map across all hoologues and parents. As an example we use the multidimensional scaling approach, using the function MDSMap_from_list. This function applies the estimate.map function from the MDSMap package to a list of linkage dataframes. Below, we are using the default settings of estimate.map, however, they can be changed by supplying extra arguments. maplist_P1_LG1 <- MDSMap_from_list(all_linkages_list_P1_split$LG1,
write_to_file = FALSE
)
maplist_P2_LG1 <- MDSMap_from_list(all_linkages_list_P2_split$LG1, write_to_file = FALSE) By calling the MDSMap_from_list function you automatically produce the input .txt files that MDSMap requires in the correct format. Note that unless you provide a different folder name (by specifying mapdir), the files will be overwritten each time you run the function. We generally recommend using the default mapping settings of the estimate.map function (so using the method of principal curves in 2 dimensions with LOD2 as weights), but there are a number of options that can be specified which are described in the manual of that package (check out https://cran.R-project.org/package=MDSMap). Output plots can be saved as .pdf files by specifying write_to_file = TRUE, in the same mapdir folder that contains map input files as well. ## 14.1 Creating an integrated chromosomal linkage map In the previous example we created separate linkage maps per homologue. However, it is also possible to create an integrated chromosomal map using MDSMap, by simply supplying all the linkage information for the whole linkage group together. To create an integrated map of all linkage groups, we first combine the linkage information together and then run the MDSMap_from_list function which prepares the files and passes them to MDSMap for the mapping: linkages <- list() for(lg in names(all_linkages_list_P1)){ linkages[[lg]] <- rbind(all_linkages_list_P1[[lg]], all_linkages_list_P2[[lg]]) } integrated.maplist <- MDSMap_from_list(linkages) ## 14.2 Optional: Adding back duplicated markers If you ran the function screen_for_duplicate_markers before mapping, you may want to add back the duplicate markers to the map. This can be achieved using the add_dup_markers function. If you also want to include the duplicate markers on a phased map file (next section), then you also need to update the marker_assignments as well (otherwise, just leave the marker_assignments argument out). This is done as follows: complete_mapdata <- add_dup_markers(maplist = integrated.maplist, bin_list = screened_data4$bin_list,
marker_assignments = marker_assignments)

Note that the output of this function is again a list, which we might like to re-assign as follows:

integrated.maplist_complete <- complete_mapdata$maplist marker_assignments_complete <- complete_mapdata$marker_assignments

Note that if we are producing a phased maplist, we also need the marker dosages matrix to be present (in order to check whether the phasing corresponds with the parental scores). Therefore it is no longer appropriate to use filtered_data if we have added back duplicates (where the duplicate markers have been removed). Using screened_data3 from earlier would be the correct choice.

## 14.3 Phasing an integrated map

One final step that is useful is to generate phased linkage maps from the integrated linkage map and marker assignments. This provides information on the coverage of markers across the parental homologues. To phase the markers, we use the create_phased_maplist function as follows:

phased.maplist <- create_phased_maplist(maplist = integrated.maplist,
dosage_matrix.conv = filtered_data,
ploidy = 4,
marker_assignment.1 = marker_assignments$P1, marker_assignment.2 = marker_assignments$P2)

The N_linkages option specifies the minimum number of significant linkages of a marker to a chromosomal linkage group to be confident of its assignment. Note that this level of significance (e.g. LOD > 3) was already defined in the homologue_lg_assignment function earlier. There are a number of other arguments with create_phased_maplist, including original_coding which produces a phased mapfile in the original, uncoverted format. This may have advantages for tracking marker alleles, although it has the disadvantage that the homologues appear to be more saturated with markers than they actually are!

## 14.4 Optional: Independent homologue map integration

In some situations it may be desirable to independently integrate homologue maps rather than directly generating integrated chromosomal maps as we did in the previous section. For example, there may be cases where the rate of recombination is different between parents and you would like to compare integrated chromosomal maps using different methods.

The LPMerge package [3] in R is able to integrate homologue maps in a polyploid but has a number of limitations, one of them being that it is unable to recognise when the maps to be integrated are not orientated correctly. A genetic position of 0 cM merely means we are at the outermost marker position in a linkage group – it does not correspond to a “0” position in any real sense. We therefore have to first test whether the homologue maps are consistently ordered before proceeding with the integration step.

In order to test for relative ordering, we need at least 2 markers in common (bridge markers) between each pair of homologue maps (Figure 1). Of course, more than 2 markers in common is better (and desirable for map integration) but we will consider it as the absolute minimum. A simple test to see whether the order is consistent as in Figure 1.a is then to perform a correlation on the marker positions, and see whether the result is positive or negative.

Both the correlation and integration steps have been put together into a single function, orient_and_merge_maps. LPmerge produces a number of integrated maps, based on an input parameter called the maximum interval size. A maximum interval size of one means that only adjacent markers on the underlying homologue maps are considered in the estimation of root mean square error (RMSE) in the final consensus map; a higher interval size means non-adjacent markers are also considered. The default setting is 4 – we recommend increasing this to 8. More details can be found in the literature on the subject [3]. The “best” map is generally the one with the minimum RMSE (but you may also like to select the integrated map with the smallest difference between the length of the consensus map and the mean map length of the homologue maps). Integrating 8 homologues of a single linkage group would look like so:

integrated_map_LG2 <- orient_and_merge_maps(maplist_P1[["LG2"]],
maplist_P2[["LG2"]],
connection_threshold = 3,
plot_graph=TRUE)

A code to provide a list of integrated maps of all linkage groups would look like this:

integrated_maps <- list()
for(i in 1:5){
integrated_maps[[paste0("LG", i)]] <- orient_and_merge_maps(maplist_P1[[paste0("LG", i)]],
maplist_P2[[paste0("LG", i)]],
LPmerge_interval = 2,
connection_threshold = 3,
plot_graph=TRUE)
}

## 15. Plotting a map

A well-know software program for plotting maps is MapChart [4]. polymapR can write out MapChart compatible files (.mct) using the function write.mct. There are many plotting options available in MapChart, here we only currently incorporate a single option, namely showMarkerNames, which by default is FALSE in anticipation of high-density maps. Further formatting can be achieved within the MapChart environment itself.

However, polymapR also comes with its own simple built-in function to plot maps, namely the plot_map function:

plot_map(maplist = integrated.maplist)

We might also want to visualise the integrated map output using the plot_phased_maplist function (but here we limit it to a single linkage group). Note that we need to have first used the create_phased_maplist, described earlier.

plot_phased_maplist(phased.maplist = phased.maplist[1], #Can plot full list also, remove "[1]"
ploidy = 4,
cols = c("black","grey50","grey50"))

This function can also visualise phased hexaploid maps, or triploid maps (tetraploid x diploid) if ploidy2 is specified.

## 16. Evaluating map quality

MDSMap produces output that can be examined each time a map is produced, namely the principal coordinate plots with the principal curve (on which the map is based) - highlighting possible outlying markers that do not associate well with the rest of the markers, and a plot of the nearest-neighbour fit (nnfit) for the markers along each linkge group, giving an indication of markers with high stress. Both of these can help the user to evaluate the quality of the map (for more details, we recommend reading the MDSMap publication [2]) and often results in a number of rounds of mapping, where problematic markers are identified and removed followed by subsequent remapping and re-evaluation. Once the maps have been produced it is also an idea to get a general overview of the map quality using the check_map function. Note that by default this function expects lists as input arguments, enabling diagnostics to be run over all maps. Here we run it for a single linkage group, LG1:

check_map(linkage_list = list(linkages$LG1), maplist = list(integrated.maplist$LG1))

For each map produced, there are 3 diagnostic plots. The first of these shows the differences between the pairwise estimates of recombination frequency and the effective estimate of recombination frequency based on the map itself (the multi-point estimate of the recombination frequency) . These differences are compared to the LOD score for each pairwise estimate. An overall estimate of the weighted root mean square error of these differences (weighted by the LOD scores) is printed on the plot. In general, we expect that if the difference between the expected and realised recombination frequency is high, the LOD should be low. If this is not the case, something went wrong in the final mapping step. The second plot shows the comparison between a marker’s map position and the recombination frequency estimates to all other markers. In gerenal we expect that small recombination frequency estimates come from nearby markers and should therefore be concentrated around the diagonal, which are shown as light green regions. Areas of light green off the diagonal suggest a poor map order. The final plot is similar to the previous except that LOD values are shown in place of recombination frequencies. Again, we expect a concentration of higher LOD scores around the diagonal. By default, LOD scores less than 5 are not shown to make things clearer, but this value can be varied using the lod.thresh argument.

## 17. Preferential pairing

Up to now, we have been working under the assumption of random bivalent pairing in the parents. However, it has been shown that in certain species the pairing behaviour is neither completely random nor completely preferential (as we have in allopolyploids) but something in-between, a condition termed segmental allopolyploidy [5]. A comprehensive study on this topic was recently performed in rose [6], which found evidence of disomic behaviour in certain regions of the genome, with tetrasomic behaviour everywhere else. This sort of mixed inheritance can complicate the analysis and in cases where this effect is particularly pronounced, it is probably unwise to ignore [7].

The polymapR package can currently accomodate preferential pairing in tetraploid species only, by correcting the recombination frequency estimates once the level of preferential pairing in a parental chromosome is known. In order to determine whether this is necessary, we can run some diagnostic tests using the marker data after an initial map has been produced which can inform our choice regarding whether to include a preferential pairing parameter or not (in a subsequent re-analysis).

First, a word of warning. There is a function to test for preferential pairing among pairs of closely-linked simplex x nulliplex markers in repulsion phase within polymapR. However, it is certainly also advised to use multi-point methods to test for preferential pairing (if possible). The only possible drawback is that multi-point methods (using all markers across a chromosome) generally assume uniform pairing behaviour across the length of a chromosome, which may not always occur [6]). A robust multi-point approach as described in [6] used identity-by-descent probabilities for the population, estimated using TetraOrigin [8], which estimates the most likely pairing behaviour per individual and can therefore reveal whether there are deviations from the assumption of random chromosomal pairing and recombination across the population.

The function test_prefpairing examines closely-linked repulsion-phase simplex x nulliplex marker pairs only, and is therefore the results of this function may not be as robust as a multi-point whole-chromosome approach. Be that as it may, it is probably the simplest check we can preform post-mapping for the possibility of unusual pairing behaviour. The method used here is described also in Bourke et al.[6], which is a development on ideas originally described by Qu and Hancock[9]. The idea is to look for signatures of preferential pairing in repulsion-phase pairs which map to the same locus.

We first need to define a minimum distance by which to consider markers essentially at the same locus. This is somewhat arbitrary - in our study of tetraploid rose, we used pairs of duplicated individuals to determine an approximate error rate which, along with considerations of the size of the population and rate of missing values, led us to conclude that distances less than ~1 cM were close enough to be informative (and for which we could assume 0 recombinations). The smaller the distance we choose here the better, although by choosing a very small distance we are limiting the number of repulsion-pairs available. By default in test_prefpairing we use a value of 0.5 cM. This can be increased or decreased as required using the min_cM argument.

P1.prefPairing <- test_prefpairing(dosage_matrix = ALL_dosages,
maplist = integrated.maplist,
LG_hom_stack = LGHomDf_P1_1,
min_cM = 1, #changed from default of 0.5 cM
ploidy = 4)

head(P1.prefPairing)

The output shows the repulsion-phase marker pairs tested. The most important column is probably the column P_value.adj, which gives the FDR-adjusted P-values from a Binomial test of the hypothesis that the disomic repulsion estimate of recombination frequency differs significantly from $$\frac{1}{3}$$ (actually it is a one-sided test, so we are testing that r disom falls below $$\frac{1}{3}$$). In cases where this is significant (with P_value.adj < 0.01, say), we might have some evidence for preferential pairing (but from that marker pair only!). The output of test_prefpairing also lists the markers, their positions and on which homologues they have been mapped. There is also an estimate of the pairwise preferential pairing parameter pref.p. Note that there are two possible (maximum-likelihood) estimators of a preferential pairing parameter in the case of a pair of simplex x nulliplex markers, depending on whether the marker alleles reside on homologous chromosomes, or homoeologous chromosomes (i.e. whether the pair are contained within a “subgenome”, or are straddled across “subgenomes”, to borrow the language of allopolyploids for convenience). If there is strong evidence of preferential pairing from multiple marker pairs, the approach we recommend is to take the average of the pref.p values for a particular chromosomal linkage group, which becomes one of the parameters p1 or p2 in the linkage function.

Here, we have purely random pairing in our parents, so the estimates of pref.p should not be considered meaningful. The preferential parameter p is defined such that $$0 < p < \frac{2}{3}$$. Supposing this were not the case, and we wanted to estimate the strength of preferential pairing on LG1 of parent 1, we would do something like the following:

mean(P1.prefPairing[P1.prefPairing$P_value.adj < 0.01 & P1.prefPairing$LG_a == 1,]$pref.p) Note that again we require first that there be significant evidence of disomic behaviour before including the estimate. If there is evidence of preferential pairing, it is probably a good idea to go back and reduce the min_cM argument even further, since we really want this value to be as close to 0 as possible for an accurate estimate of a preferential pairing parameter. Finally, to re-run the linkage analysis with this estimate for parent 1 of linkage group 1, we would first have to subset our marker dosage data corresponding to that linkage group only (since we do not need to re-analyse any other chromosome!), and then re-run the linkage analysis. Note that in cases with extreme levels of preferential pairing, the clustering into separate homologues can also be affected. Supposing we had estimated a value of 0.25 for the preferential pairing parameter we could proceed as follows: lg1_markers <- unique(c(rownames(marker_assignments_P1[marker_assignments_P1[,"Assigned_LG"] == 1,]), rownames(marker_assignments_P2[marker_assignments_P2[,"Assigned_LG"] == 1,]))) filtered_data_lg1 <- filtered_data[rownames(filtered_data) %in% lg1_markers,] all_linkages_list_P1_lg1 <- finish_linkage_analysis(marker_assignment = marker_assignments$P1[rownames(marker_assignments\$P1) %in% lg1_markers,],
dosage_matrix = filtered_data_lg1,
target_parent = "P1",
other_parent = "P2",
convert_palindrome_markers = FALSE,
ploidy = 4,
pairing = "preferential",
prefPars = c(0.25,0), #just for example!
LG_number = 1 #interested in just 1 chm.
)

## 18. QTL analysis

Currently it is not possible to conduct QTL analysis within polymapR, although extensions in this direction are planned (either within the package or as a separate package). However, for tetraploid populations the TetraploidSNPMap [10] software provides the possibility of not just generating linkage maps but performing IBD-based interval QTL mapping as well. The creation of input files for TetraploidSNPMap is simple once phased mapfiles have been created with the create_phased_maplist function (see section 14.2).

To generate these files, use the write.TSNPM function as follows:

write.TSNPM(phased.maplist = phased.maplist,ploidy=4)

## 19. Concluding remarks

We hope you have been able to successfully create an integrated map using the methods described here. We anticipate polymapR will improve over the coming years (in terms of applicability as well as finding and removing bugs). Feedback on performance including bug reporting is most welcome; please refer to the package information on CRAN for details on who to contact regarding maintenance issues.

## 20. References

1. Voorrips RE, Gort G, Vosman B: Genotype calling in tetraploid species from bi-allelic marker data using mixture models. BMC bioinformatics 2011, 12:172.

2. Preedy KF, Hackett CA: A rapid marker ordering approach for high-density genetic linkage maps in experimental autotetraploid populations using multidimensional scaling. Theoretical and Applied Genetics 2016, 129:2117–2132.

3. Endelman JB, Plomion C: LPmerge: An R package for merging genetic maps by linear programming. Bioinformatics 2014, 30:1623–1624.

4. Voorrips R: MapChart: software for the graphical presentation of linkage maps and QTLs. Journal of heredity 2002, 93:77–78.

5. Stebbins GL: Types of polyploids: their classification and significance. Advances in genetics 1947, 1:403–429.

6. Bourke PM et al: Partial preferential chromosome pairing is genotype dependent in tetraploid rose. The Plant Journal 2017, 90:330–343.

7. Bourke PM, Voorrips RE, Kranenburg T, Jansen J, Visser RGF, Maliepaard C: Integrating haplotype-specific linkage maps in tetraploid species using SNP markers. Theoretical and Applied Genetics 2016, 129:2211–2226.

8. Zheng C, Voorrips RE, Jansen J, Hackett CA, Ho J, Bink MC: Probabilistic multilocus haplotype reconstruction in outcrossing tetraploids. Genetics 2016, 203:119–131.

9. Qu L, Hancock J: Detecting and mapping repulsion-phase linkage in polyploids with polysomic inheritance. Theoretical and Applied Genetics 2001, 103:136–143.

10. Hackett CA, Boskamp B, Vogogias A, Preedy KF, Milne I: TetraploidSNPMap: Software for linkage analysis and qtl mapping in autotetraploid populations using snp dosage data. Journal of Heredity 2017, 108:438–442.