# Marker-based Evaluations

#### 2017-05-23

Marker-based evaluations are an alternative or addition to pedigree based methods. Pedigree based methods enable to estimate the expectations of population specific and individual specific parameters. The realized values, however, deviate from these expectations due to mendelian segregation. In an increasing number of breeds pedigree based methods are being replaced or combined with marker based methods to overcome this limitation. In any case, the goal is to select animals for breeding in a way that accelerates genetic gain, restricts the rate of inbreeeding, and maintains genetic originality of the breed.

All genomic evaluations included in this package are based on haplotype segments. Every individual has two haplotypes: a paternal and a maternal one. A haplotype segment is usually present in many individuals because the individuals have a common ancestor from which the segment originates. It is, however, also possible that two segments are identical by chance. Of course, the likelihood for being identical by chance is the smaller the more markers are included in the segments. Therefor, a part of a chromosome is considered a segment only if it contains at least minSNP>=15 consecutive markers.

If a segment is in a region with low recombination rate or high marker density, however, then two segments could be identical even though they originate from an ancestor who lived long time before breed separation. Therefor, segments must have a minimum length which may be measured in centiMorgan (unitL="cM") or Mega base pairs (unitL="Mb"). The minimum length of a segment is often defined as minL=2.0 Mb.

The proportion of the genome captured by a segment is usually obtained from it’s length in Mega base pairs (unitP="Mb") as the length of the segment divided by the length of the genome.

This section covers the following evaluations based on marker data:

Note that the calculation of breeding values is not included as there already exist R packages for this purpose.

### Data Preparation

#### Genotype File Format

In all genotype files, markers must be in rows and individuals in columns. This enables to process one marker at a time without the need to have the whole file in memory. For all functions included in this package

• there is one file for each chromosome,
• genotypes must be phased,
• each file has a header and no row names,
• cells are separated by blank spaces,
• the number of rows is equal to the number of markers from the respective chromosome,
• the markers are in the same order as in the map,
• there can be some extra columns on the left hand side containing no genotype data,
• the text within columns must have no white spaces,
• the first rows may contain some comments,
• the alleles of an individual are separated by a character, e.g. A/B, 0/1, A|B, A B, or 0 1,
• the same two symbols must be used for all markers,
• the IDs of the individuals are used as column names and must have no white spaces,
• if the blank space is used as separator (i.e. A B, or 0 1), then the ID of each individual must be repeated in the header, so that the number of column names is equal to the number of columns.

Example 1:

Note that the marker names are ignored when the file is processed.
I id 6415 6415 2636 2636
M ARS-BFGL-NGS-16466 0 0 1 0
M ARS-BFGL-NGS-98142 0 1 0 1
M ARS-BFGL-NGS-114208 0 0 1 0

Example 2:

There can be some comments in the first lines
and some extra columns on the left hand side.
Column1 Column2 Column3 6415 2636
M NA dfdf 0/0 1/0
X NA sdfg 0/1 0/1
N NA fgjh 0/0 1/0

#### Marker Map Format

For all functions reading from genotype files, a marker map must be provided in argument map. This is a data frame or data table with columns including

Name: marker name

Chr: chromosome number, and possibly

Mb: position on the chromosome in Mega base pairs, and

cM: position in centiMorgan.

The order of the markers must be the same as in the genotype files.

### Example Data Set

All evaluations are demonstrated at the example of cattle data contained in the package. Breed names, years of birth, simulated breeding values, simulated sexes, and herds are provided in data frame Cattle.

library("optiSel")
data(Cattle)
phen <- Cattle
head(phen)
##                           Indiv Born  Breed         BV    Sex herd
## 276000101676415 276000101676415 1991 Angler -1.0706066   male <NA>
## 276000108612636 276000108612636 1994 Angler -0.3362574 female    2
## 276000102372349 276000102372349 1986 Angler -2.0735649 female    1
## 276000102379430 276000102379430 1987 Angler  1.5968307   male <NA>
## 276000108826036 276000108826036 1994 Angler  1.0023969   male <NA>
## 276000111902076 276000111902076 1998 Angler -0.2426676   male <NA>

The data frame contains information on 4 breeds:

table(phen$Breed) ## ## Angler Fleckvieh Holstein Rotbunt ## 268 100 100 100 The “Angler” is an endangered German cattle breed, which had been upgraded with Red Holstein (also called “Rotbunt”). The Rotbunt cattle are a subpopulation of the “Holstein” breed. The “Fleckvieh” or Simmental breed is unrelated to the Angler. The marker map is: data(map) head(map) ## Name Chr Position cM Mb ## ARS-BFGL-NGS-16466 ARS-BFGL-NGS-16466 1 267940 0 0.267940 ## ARS-BFGL-NGS-98142 ARS-BFGL-NGS-98142 1 471078 0 0.471078 ## ARS-BFGL-NGS-114208 ARS-BFGL-NGS-114208 1 533815 0 0.533815 ## ARS-BFGL-NGS-65067 ARS-BFGL-NGS-65067 1 883895 0 0.883895 ## ARS-BFGL-BAC-32722 ARS-BFGL-BAC-32722 1 929617 0 0.929617 ## ARS-BFGL-BAC-34682 ARS-BFGL-BAC-34682 1 950841 0 0.950841 This small example data set contains only genotypes from the first parts of the first two chromosomes: tapply(map$Mb, map$Chr, max) ## 1 2 ## 40.42745 33.79876 Consequently the results obtained for specific individuals will be rather inaccurate. The genotypes are included in the following files: dir <- system.file("extdata", package="optiSel") GTfiles <- file.path(dir, paste("Chr", unique(map$Chr), ".phased", sep=""))

### Individual Specific Parameters

#### Inbreeding Coefficients

The inbreeding coefficient of an individual is the probability that two alleles chosen at random from the maternal and paternal haplotypes belong to identical segments. This parameter estimates the extent to which the individual may suffer from inbreeding depression and predicts the homogeneity of its offspring. It can be calculated with

Animal <- segInbreeding(GTfiles, map, minSNP=20, minL=1.0)
head(Animal)
##                           Indiv       Inbr
## 276000101676415 276000101676415 0.04986142
## 276000108612636 276000108612636 0.00000000
## 276000102372349 276000102372349 0.00000000
## 276000102379430 276000102379430 0.02484842
## 276000108826036 276000108826036 0.00000000
## 276000111902076 276000111902076 0.00000000

#### Kinship

The segment based kinship between two individuals is the probability that two alleles randomly chosen from both individuals belong to segments which are identical in both individuals. A matrix containing the kinship between all pairs of individuals can be computed with function segIBD:

segKIN <- segIBD(GTfiles, map, minSNP=20, minL=1.0)
segKIN[1:3,1:3]
##                 276000101676415 276000108612636 276000102372349
## 276000101676415     0.524930710     0.004341751     0.000000000
## 276000108612636     0.004341751     0.500000000     0.003575048
## 276000102372349     0.000000000     0.003575048     0.500000000

The R code below displays the kinship between the female with ID 276000102372349 and all genotyped Angler males that have a breeding value larger than 2.0.

Males  <- phen$Indiv[phen$Sex=="male" & phen$Breed=="Angler" & phen$BV>2.0]
segKIN[rownames(segKIN) %in% Males, "276000102372349", drop=FALSE]
##                 276000102372349
## 276000120061822     0.004997202
## 276000120949468     0.008381874
## 276000121243787     0.000000000
## 276000121507437     0.004041023
## 276000121590755     0.000000000

In general, the males that have lowest kinship with the female should be favoured for mating. In this case, however, all kinships are low, so this criterion can be neglected.

#### Genetic Distances

There are several possibilities to compute a dissimilarity matrix from a similarity matrix. One possibility which seems especially suitable for multidimensional scaling is to define the dissimilarity of individuals $$i$$ and $$j$$ as $D_{ij}=(-log(b + (1-b)f_{ij}))^a,$ whereby the term $$b + (1-b)f_{ij}$$ adjusts the kinship between individuals $$i$$ and $$j$$ for non-detectable ancestral inbreeding, the function $$g(x)=(-log(x))^a$$ maps the adjusted kinships from the interval [0,1] to positive real numbers, and parameter $$a$$ may be chosen such that the stress value is minmized. This can be done with function sim2dis, whereby b=baseF:

D     <- sim2dis(segKIN, a=6.0, baseF=0.03, method=1)
color <- c(Angler="red", Rotbunt="green", Fleckvieh="blue", Holstein="black")
col   <- color[phen[rownames(D), "Breed"]]
Res   <- cmdscale(D)
plot(Res, pch=18, col=col, main="Multidimensional Scaling", cex=0.5, xlab="",ylab="", asp=1)

#### Haplotype Frequencies

Artificial selection and the substantial genetic drift in populations with small effective sizes have increased the frequencies of various haplotype segments in commercial breeds. Although these segments may contribute to the economic value of a breed, their presences in an endangered breed decreases the conservation value of the breed because they are so common in the species that their conservation does not need to be subsidized.

For individuals with genetic contributions from several breeds, each marker belongs to a haplotype segment that originates from a specific breed. Gene flow is usually from commercial breeds to the endangered breeds but not vice-versa. Thus, if a sufficiently long segment from an endangered breed can also be found in a commercial breed, then it can be concluded that the segment is not native in the endangered breed. Instead, the segment is assigned to the breed in which it has maximum frequency. This can be done with function haplofreq.

Below, the frequency of each segment from haplotype 2 of Angler 276000101676415 in Rotbunt cattle is plotted with function plot (red area).

Haplo <- haplofreq(GTfiles, phen, map, thisBreed="Angler", refBreeds="Rotbunt",   minSNP=20, minL=1.0)
plot(Haplo, ID="276000101676415", hap=2)

It can be concluded that the first chromosome originates from Rotbunt cattle or from a closely related breed. In contrast, the second chromosome does not originate from Rotbunt cattle.

This evaluation can be done simulateneously for several reference breeds. Below, the frequencies of each Angler haplotype segment within Rotbunt, Holstein, and Fleckvieh are computed, the results are combined into the single R object Haplo with function freqlist, and plotted with function plot. The red area shows the frequency of each haplotype segment in Rotbunt cattle, whereas the black line shows the maximum frequency the segment has in one of the evaluated reference breeds.

Haplo <- freqlist(
haplofreq(GTfiles, phen, map, thisBreed="Angler", refBreeds="Rotbunt",   minSNP=20, minL=1.0),
haplofreq(GTfiles, phen, map, thisBreed="Angler", refBreeds="Holstein",  minSNP=20, minL=1.0),
haplofreq(GTfiles, phen, map, thisBreed="Angler", refBreeds="Fleckvieh", minSNP=20, minL=1.0)
)

plot(Haplo, ID=1, hap=2, refBreed="Rotbunt")

Hence, most segments from Chromosome 2 have very low frequency in other breeds, so they can be classified to be native for Angler.

The classification of haplotype segments can be done with a single call to function haplofreq. In this case, argument refBreeds is either the vector with breeds to be used as reference breeds, or the default refBreeds="others" is used, in which case all breeds with genotypes are used as reference breeds, except thisBreed.

Haplo <- haplofreq(GTfiles, phen, map, thisBreed="Angler", refBreeds="others", ubFreq=0.01, minL=2.5)

The result is a list. Component freq is a matrix containing the maximum frequency each haplotype segment has in one of the reference breeds.

Haplo$freq[1:10,1:3] ## 276000101676415 276000101676415 276000108612636 ## ARS-BFGL-NGS-16466 0.25 0 0 ## ARS-BFGL-NGS-98142 0.25 0 0 ## ARS-BFGL-NGS-114208 0.25 0 0 ## ARS-BFGL-NGS-65067 0.25 0 0 ## ARS-BFGL-BAC-32722 0.25 0 0 ## ARS-BFGL-BAC-34682 0.25 0 0 ## ARS-BFGL-NGS-3964 0.25 0 0 ## ARS-BFGL-NGS-98203 0.25 0 0 ## ARS-BFGL-BAC-2376 0.25 0 0 ## ARS-BFGL-BAC-31722 0.25 0 0 Component match is a matrix containing for each segment the first letter of the name of the breed in which the segment has maximum frequency. If the frequency of the segment is smaller than ubFreq=0.01 in all reference breeds, then the segment is classified to be native and coded as 1. Haplo$match[1:10,1:3]
##                     276000101676415 276000101676415 276000108612636
## ARS-BFGL-NGS-16466  "R"             "1"             "1"
## ARS-BFGL-NGS-98142  "R"             "1"             "1"
## ARS-BFGL-NGS-114208 "R"             "1"             "1"
## ARS-BFGL-NGS-65067  "R"             "1"             "1"
## ARS-BFGL-BAC-32722  "R"             "1"             "1"
## ARS-BFGL-BAC-34682  "R"             "1"             "1"
## ARS-BFGL-NGS-3964   "R"             "1"             "1"
## ARS-BFGL-NGS-98203  "R"             "1"             "1"
## ARS-BFGL-BAC-2376   "R"             "1"             "1"
## ARS-BFGL-BAC-31722  "R"             "1"             "1"

If individuals are genotyped for many markers, then the working memory could become a limitation. This can be avoided by writing the results to files. Results will be written to files if argument w.dir is defined as the name of a directory. In this case function haplofreq returns a data frame with file names:

wdir  <- file.path(tempdir(), "HaplotypeEval")
wfile <- haplofreq(GTfiles, phen, map, thisBreed="Angler", minSNP=20, minL=1.0, w.dir=wdir)

#### Breed Composition

Mating decisions should not only depend on the breeding value of the male and the kinship between male and female, but also on the genetic contribution of the male from foreign breeds. Many endangered breeds have been graded up with commercial high-yielding breeds. These increasing migrant contributions displace the original genetic background of the endangered breed, decrease the genetic contribution from native ancestors, and reduce the conservation value of the breed. The breed composition of individuals can be estimated with function segBreedComp.

Comp  <- segBreedComp(Haplo$match, map) head(Comp[,-1]) ## native F H R ## 276000101676415 0.6820668 0.0000000000 0.06420559 0.25372765 ## 276000108612636 0.5952110 0.0392385937 0.23581096 0.12973943 ## 276000102372349 0.8953642 0.0647239060 0.02188978 0.01802207 ## 276000102379430 0.5127613 0.0004341256 0.18018789 0.30661670 ## 276000108826036 0.3338797 0.0031128187 0.23518540 0.42782203 ## 276000111902076 0.3158735 0.0000000000 0.24816876 0.43595772 The average breed composition of Angler cattle is Average <- apply(Comp[,-1],2,mean) round(Average, 3) ## native F H R ## 0.477 0.020 0.237 0.266 Since Red Holstein is a subpopulation of Holstein cattle, their contributions should be added. Thus, the average contribution of Angler cattle from Holstein is 0.503, the contribution from Fleckvieh is only 0.02, and the native contribution is 0.477. This is in good accordance with pedigree-based results #### Kinship at Native Segments Since animals with low migrant contributions tend to be related, the inbreeding level could increase considerably when introgressed genetic material is removed from the population. This could be avoided by restricting the increase in kinship at native haplotype segments in the population. Matrix segKINatN containing the kinship of individuals at native haplotype segments can be calculated from the results of function segIBDatN: fD <- segIBDatN(GTfiles, phen, map, thisBreed="Angler", ubFreq=0.01, minL=1.0) segKINatN <- fD$segIBDandN/fD$segN segKINatN[c(2,4,5), c(2,4,5)] ## 276000108612636 276000102379430 276000108826036 ## 276000108612636 0.65929193 0.09438978 0.000000 ## 276000102379430 0.09438978 0.88219967 0.000000 ## 276000108826036 0.00000000 0.00000000 0.758341 ### Population Specific Parameters #### Genetic Diversity The genetic diversity of a population is the probability that two alleles chosen at random from the population belong to identical segments. It is one minus the average segment based kinship of the individuals. Thus, it can be computed as keep <- phen$Indiv[phen$Breed=="Angler"] 1 - mean(segKIN[keep, keep]) ## [1] 0.9431438 The diversity of this population is high due to historic introgression with other breeds. #### Kinship and Diversity at Native Segments The kinship at native segments in the population is the probability that two alleles chosen at random from the population belong to identical segments, given that the segments originate from native founders. Since it is defined as a conditional probability, it can be computed as the ratio of two means. The kinship at native segments is mean(fD$segIBDandN)/mean(fD$segN) ## [1] 0.06695171 The genetic diversity at native segments is one minus the kinship at native segments. Thus, it can be calculated as 1 - mean(fD$segIBDandN)/mean(fD$segN) ## [1] 0.9330483 The diversity at native segments is high. This could have several reasons: • breeds that have been used for introgression are missing in the data set, so contributions from these breeds were wrongly classified as native and contribute to the diversity at native segments, • the minimum lentgh of haplotype segments is too high or the marker density is too low, so that short introgressed segments cannot be classified to be non-native. • the diversity at native segments is indeed high. In this case, an introgressed breed, the Norwegian Red, is missing in the data set. A high diversity at native segments is important if a goal of the breeding program is to remove introgressed genetic material from the population. Without maintenance of a high diversity at native segments, inbreeding coefficients will soon rise to an unreasonable level. ### Multi-Breed Specific Parameters Most evaluations for multiple breeds with segment based methods require high density marker genotypes that enable the detection of short haplotype segments that originate from common ancestors who lived before breed separation. Hence, the examples shown below are only illustrative. #### Kinships Within and Between Breeds Average segment based kinships between and within breeds can be computed with function opticomp: segKIN <- segIBD(GTfiles, map, minSNP=20, minL=1.0) Breed <- phen[rownames(segKIN),"Breed"] CoreSet <- opticomp(segKIN, Breed) round(CoreSet$f, 3)
##           Angler Fleckvieh Holstein Rotbunt
## Angler     0.057     0.006    0.046   0.046
## Fleckvieh  0.006     0.073    0.005   0.005
## Holstein   0.046     0.005    0.112   0.094
## Rotbunt    0.046     0.005    0.094   0.111

It can be seen that inbreeding is lowest in Angler cattle and highest in Rotbunt cattle. The kinship between Holstein and Rotbunt is almost as high as the kinships within the breeds, so both breeds are closely related. In contrast, Fleckvieh is only distantly related to all other breeds included in the data set.

#### Genetic Distances Between Breeds

Genetic distances between breeds can be computed from the kinships between breeds. There are various possibilities to define genetic distances and the method of choice depends on the intended use. The distance between two breed $$i$$, $$j$$, defined as $\Delta_{bl} = \sqrt{\frac{f_{bb}+f_{ll}}{2}-f_{bl}}$ can be considered an estimate of the expected differences in population means for a neutral polygenic trait (Wellmann, Bennewitz, and Meuwissen 2014). It can be obtained as

round(CoreSet$Dist, 3) ## Angler Fleckvieh Holstein Rotbunt ## Angler 0.000 0.242 0.196 0.193 ## Fleckvieh 0.242 0.000 0.295 0.295 ## Holstein 0.196 0.295 0.000 0.130 ## Rotbunt 0.193 0.295 0.130 0.000 #### Prioritizing Breeds for Conservation Since resources available for conservation are limited, prioritizing breeds for conservation is of high importance to halt the erosion of genetic diversity observed in livestock species. This requires to estimate conservation values of breeds. In the core set approach, a hypothetical subdivided population is considered, consisting of individuals from various breeds. This population is called the core set. The contributions of each breed to the core set are determined such that the diversity of the core set is maximized. The conservation value of a particular breed measures how much the diversity decreases if the breed is removed from the core set. Function opticomp can be used to compute the contributions of the breeds to a core set with maximum diversity. CoreSet <- opticomp(segKIN, Breed) CoreSet$bc
##     Angler  Fleckvieh   Holstein    Rotbunt
## 0.47908531 0.42425265 0.04434691 0.05231513

The Rotbunt cattle have only a small contribution to the core set, as their genes are already present in Angler and Holstein cattle. For this core set the diversity is

CoreSet$value ## [1] 0.9656585 Removing the Angler cattle from the core set CoreSet <- opticomp(segKIN, Breed, ub=c(Angler=0)) CoreSet$bc
##    Angler Fleckvieh  Holstein   Rotbunt
## 0.0000000 0.5910616 0.1876452 0.2212932

increases the Rotbunt and Holstein contributions, and decreases the diversity of the core set:

CoreSet\$value
## [1] 0.9548693

### References

Wellmann, R., J. Bennewitz, and T. H. Meuwissen. 2014. “A Unified Approach to Characterize and Conserve Adaptive and Neutral Genetic Diversity in Subdivided Populations.” Genetical Research (Camb) 96.