This package contains a collection of various low-level tools which may be of interest for general re-use. These functions were accumulated over a number of years when treating hight-throughput data from biomedical applications. Besides, these functions are used/integrated in more specialized functions dedicated to specific applications in other packages (eg wrProteo or wrGraph). To get started, we need to load the package “wrMisc” available from CRAN.
# If not already installed, you'll have to install the package first.
install.packages("wrMisc")
# Now you cat start this vignette by
browseVignettes("wrMisc") # and the select the html output
## [1] '1.3.1'
In hight-throughput experiments in biology (like transcriptomics, proteomics etc…) many different features get measured a number if times (different samples like patients or evolution of a disease). The resulting data typically conatain many (independent) rows (eg >1000 different genes or proteins who’s abundance was measured) and much fewer columns that may get further organized in groups of replicates. As R is a versatile language, multiple options exist for assessing the global characteristics of such data, some are more efficient on a computational point of view. In order to allow fast treatment of very large data-sets some tools have been re-designed for optimal performance.
Many mesurement techniques applied in high throughput manner suffer from precision. This means, the same measurements taken twice in a row (ie repeated on the same subject) will very likely not give an identical result. For this reason it is common practice to make replicate measurements to i) estimate mean (ie representative) values and ii) asses the factors contributing to the variablity observed. Briefly, technical replicates represent the case where multiple read-outs of the very same sample are generated and the resulting variability is associated to technical issues during the process of taking measures. Biological replicates represent independant samples and reflect therefore the varibility a given parameter may have in a certain population of individuals. With the tools presented here, both technical and biological replicates can be dealt with. In several cases the interpretation of the resulting numbers should consider the experimental setup, though.
Let’s make a simple matrix as toy data:
grp1 <- rep(LETTERS[1:3],c(3,4,3))
sampNa1 <- paste0(grp1,c(1:3,1:4,1:3))
set.seed(2016); dat1 <- matrix(round(c(runif(50000)+rep(1:1000,50)),3),ncol=10,
dimnames=list(NULL,sampNa1))
dim(dat1)
## [1] 5000 10
## A1 A2 A3 B1 B2 B3 B4 C1 C2 C3
## [1,] 1.180 1.640 1.199 1.118 1.425 1.745 1.253 1.554 1.303 1.856
## [2,] 2.143 2.237 2.730 2.693 2.603 2.293 2.542 2.452 2.148 2.776
## [3,] 3.842 3.155 3.191 3.520 3.686 3.408 3.409 3.871 3.345 3.588
## [4,] 4.134 4.394 4.982 4.320 4.380 4.888 4.965 4.462 4.250 4.647
## [5,] 5.478 5.472 5.488 5.570 5.626 5.765 5.551 5.016 5.659 5.139
## [6,] 6.121 6.294 6.718 6.890 6.999 6.316 6.542 6.119 6.763 6.487
Now lets estimate the standard deviation (sd) for every row:
## [1] 0.2583693 0.2426026 0.2477899 0.3089102 0.2307722 0.3124493
## user system elapsed
## 0 0 0
## user system elapsed
## 0.07 0.02 0.08
On most systems the equivalent calculation using apply() will run much slower.
Note, there is a minor issue with rounding :
##
## FALSE TRUE
## 1 4999
Similarly we can easily calculate the CV (coefficient of variance) for every row using rowCVs
:
## user system elapsed
## 0 0 0
## user system elapsed
## 0.08 0.00 0.08
## [1] 0.18101959 0.09855083 0.07076678 0.06800894 0.04213940 0.04788568
## [1] 0.18101959 0.09855083 0.07076678 0.06800894 0.04213940 0.04788568
Note, these calculations will be very efficient as long as the number of rows is much higher (>>) than the number of columns.
Now, let’s assume our data is contains 3 initial samples measured as several replicates (already defined in grp1). Similarly, we can also calculate the sd or CV for each line while splitting into groups of replicates (rowGrpSds
and rowGrpCV
):
## [1] "A" "A" "A" "B" "B" "B" "B" "C" "C" "C"
## user system elapsed
## 0.01 0.00 0.02
## A B C
## [1,] 0.260269732 0.27074758 0.2768917
## [2,] 0.315291928 0.17144557 0.3140531
## [3,] 0.386666523 0.13115989 0.2632534
## [4,] 0.434443706 0.33521970 0.1986530
## [5,] 0.008082904 0.09672297 0.3413156
## [6,] 0.307168249 0.31475851 0.3230934
# Let's check the results of the first line :
sd1Gr[1,] == c(sd(dat1[1,1:3]),sd(dat1[1,4:7]),sd(dat1[1,8:10]))
## A B C
## TRUE TRUE TRUE
## user system elapsed
## 0.02 0.00 0.02
## A B C
## [1,] 0.194279471 0.19545033 0.17625186
## [2,] 0.133034569 0.06769147 0.12773308
## [3,] 0.113859400 0.03741279 0.07309886
## [4,] 0.096471585 0.07227288 0.04461104
## [5,] 0.001475162 0.01718603 0.06474939
## [6,] 0.048163108 0.04707197 0.05004286
The function na.omit() from the package stats also keeps a trace of all omitted instances. This can be penalizing in terms of memory usage when handeling very large vectors with a high content of NAs (eg >10000 NAs). If you don’t need to document precisely which elements got eliminated, the function naOmit()
may offer smoother functioning for such very large objects.
## num [1:4] 11 12 13 10
## num [1:4] 11 12 13 10
## - attr(*, "na.action")= 'omit' int [1:2] 4 6
If you need to find the closest neighbour(s) of a numeric vector, the function minDiff()
will tell you the distance (“dif”,“ppm” or “ratio”) and index (“best”) of the closest neighbour. In case of multiple shortest distaces the index if the first one is reported, and the column “nbest” will display a value of >1.
## [1] 10.2 6.4 5.7 3.9 8.7 8.7
## index value dif rat ncur nbest best
## [1,] 1 10.2 -0.2 0.981 1 1 19
## [2,] 2 6.4 0.4 1.070 1 1 15
## [3,] 3 5.7 0.3 0.950 2 1 15
## [4,] 4 3.9 0.2 1.050 1 1 10
## [5,] 5 8.7 0.5 1.060 2 1 18
## [6,] 6 8.7 0.5 1.060 2 1 18
## [7,] 7 1.4 0.1 1.080 1 1 13
## [8,] 8 5.3 0.3 1.060 4 1 17
## [9,] 9 5.7 0.3 0.950 2 1 15
## [10,] 10 3.7 -0.2 0.949 1 1 4
## [11,] 11 7.7 -0.5 0.939 1 1 18
## [12,] 12 1.0 -0.3 0.769 1 1 13
## [13,] 13 1.3 -0.1 0.929 1 1 7
## [14,] 14 5.3 0.3 1.060 4 1 17
## [15,] 15 6.0 0.3 1.050 1 2 9
## [16,] 16 4.9 -0.1 0.980 1 1 17
## [17,] 17 5.0 0.1 1.020 1 1 16
## [18,] 18 8.2 0.5 1.060 1 1 11
## [19,] 19 10.4 0.2 1.020 1 1 1
## [20,] 20 9.3 0.6 1.070 1 2 6
## [21,] 21 5.3 0.3 1.060 4 1 17
## [22,] 22 5.3 0.3 1.060 4 1 17
When you look at the first line, the value of 10.2 has one single closest value which is 10.4, which is located in line number 19 (the column ‘best’ gives the index of the best). Line number 19 points back to line number 1. You can see, that some elements (like 5.7) occur multiple times (line no 3 and 9), multiple occurances are counted in the column ncur. This is why column nbest for line 15 (value =6.0) inidicates that it appears twice as closest value nbest.
When input from different places gets collected and combined into a list, this may give a collection of different types of data. The function partUnlist()
will to preserve multi-column elements as they are (and just bring down one level):
bb <- list(fa=gl(2,2),c=31:33,L2=matrix(21:28,ncol=2),li=list(li1=11:14,li2=data.frame(41:44)))
partUnlist(lapply(bb,.asDF2))
## $fa
## V1
## 1 1
## 2 1
## 3 2
## 4 2
##
## $c
## V1
## 1 31
## 2 32
## 3 33
##
## $L2
## V1 V2
## 1 21 25
## 2 22 26
## 3 23 27
## 4 24 28
##
## $li1
## [1] 11 12 13 14
##
## $li2
## X41.44
## 1 41
## 2 42
## 3 43
## 4 44
This won’t be possible using unlist().
## $fa1
## [1] 1
##
## $fa2
## [1] 1
##
## $fa3
## [1] 2
##
## $fa4
## [1] 2
##
## $c1
## [1] 31
##
## $c2
## [1] 32
To uniform such data to obtain a list with one column only for each list-element, the function asSepList()
provides help :
bb <- list(fa=gl(2,2),c=31:33, L2=matrix(21:28,ncol=2), li=list(li1=11:14,li2=data.frame(41:44)))
asSepList(bb)
## $fa
## [1] 1 1 2 2
##
## $c
## [1] 31 32 33
##
## $li1
## [1] 11 12 13 14
##
## $li2
## [1] 41 42 43 44
##
## $V1
## [1] 21 22 23 24
##
## $V2
## [1] 25 26 27 28
When a matrix (or data.frame) gets split into a list, like in the example using by(), as a reverse-function such lists can get joined using lrbind
in an rbind-like fashion.
dat2 <- matrix(11:34, ncol=3, dimnames=list(letters[1:8],colnames=LETTERS[1:3]))
lst2 <- by(dat2, rep(1:3,c(3,2,3)), as.matrix)
lst2
## INDICES: 1
## A B C
## a 11 19 27
## b 12 20 28
## c 13 21 29
## ------------------------------------------------------------
## INDICES: 2
## A B C
## d 14 22 30
## e 15 23 31
## ------------------------------------------------------------
## INDICES: 3
## A B C
## f 16 24 32
## g 17 25 33
## h 18 26 34
## A B C
## a 11 19 27
## b 12 20 28
## c 13 21 29
## d 14 22 30
## e 15 23 31
## f 16 24 32
## g 17 25 33
## h 18 26 34
When list-elements have the same name, their content (of named numeric or character vectors) may get fused using fuseCommonListElem()
according to the names of the list-elements :
val1 <- 10 +1:26
names(val1) <- letters
(lst1 <- list(c=val1[3:6], a=val1[1:3], b=val1[2:3] ,a=val1[12], c=val1[13]))
## $c
## c d e f
## 13 14 15 16
##
## $a
## a b c
## 11 12 13
##
## $b
## b c
## 12 13
##
## $a
## l
## 22
##
## $c
## m
## 23
## [1] "c" "a" "b" "a" "c"
## $c
## c d e f m
## 13 14 15 16 23
##
## $a
## a b c l
## 11 12 13 22
##
## $b
## b c
## 12 13
The function listBatchReplace()
works similar to sub() and allows to search & replace exact matches to a character string along all elements of a list.
## $aa
## [1] 1 2 3 4
##
## $bb
## [1] "abc" "efg" "abhh" "effge"
##
## $cc
## [1] "abdc" "efg" "efgh"
## $aa
## [1] "1" "2" "3" "4"
##
## $bb
## [1] "abc" "EFG" "abhh" "effge"
##
## $cc
## [1] "abdc" "EFG" "efgh"
Named numeric or character vectors can be organized into lists using listGroupsByNames(), based on their names (only the part before any extensions starting with a point gets considered). Of course, other separators may be defined using the argument sep.
## $AA
## AA AA.1 AA.b
## 1 3 5
##
## $BB
## BB BB.e
## 2 6
##
## $CC
## CC
## 4
##
## $A
## A
## 7
If no names are present, the content of the vector itself will be used as name :
## -> listGroupsByNames : no names found in 'x' !!
## $`0`
## 0 0 0 0
## 0.2 0.4 0.6 0.8
##
## $`1`
## 1 1 1 1 1
## 1.0 1.2 1.4 1.6 1.8
##
## $`2`
## 2
## 2
In the view of object-oriented progamming several methods produce results integrated into lists or S3-objects (eg limma). The function filterList()
aims facilitating the filtering of all elements of lists or S3-objects. List-elements with inapropriate number of lines will be ignored.
set.seed(2020); dat1 <- round(runif(80),2)
list1 <- list(m1=matrix(dat1[1:40],ncol=8),m2=matrix(dat1[41:80],ncol=8),other=letters[1:8])
rownames(list1$m1) <- rownames(list1$m2) <- paste0("line",1:5)
# Note: the list-element list1$other has a length different to that of filt. Thus, it won't get filtered.
filterList(list1, list1$m1[,1] >0.4) # filter according to 1st column of $m1 ...
## $m1
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## line1 0.65 0.07 0.76 0.54 0.20 0.17 0.96 0.37
## line3 0.62 0.39 0.83 0.65 0.82 0.75 0.96 0.93
## line4 0.48 0.00 0.42 0.55 0.94 0.45 0.95 0.52
##
## $m2
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## line1 0.99 0.57 0.58 0.00 0.21 0.61 0.61 0.30
## line3 0.86 0.70 0.90 0.22 0.23 0.58 0.39 0.06
## line4 0.88 0.80 0.52 0.54 0.42 0.65 0.47 0.67
##
## $other
## [1] "a" "b" "c" "d" "e" "f" "g" "h"
## $m1
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## line1 0.65 0.07 0.76 0.54 0.20 0.17 0.96 0.37
## line3 0.62 0.39 0.83 0.65 0.82 0.75 0.96 0.93
## line4 0.48 0.00 0.42 0.55 0.94 0.45 0.95 0.52
## line5 0.14 0.62 0.41 0.27 0.88 0.56 0.00 0.22
##
## $m2
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## line1 0.99 0.57 0.58 0.00 0.21 0.61 0.61 0.30
## line3 0.86 0.70 0.90 0.22 0.23 0.58 0.39 0.06
## line4 0.88 0.80 0.52 0.54 0.42 0.65 0.47 0.67
## line5 0.62 0.17 0.83 0.49 0.86 0.17 0.53 0.72
##
## $other
## [1] "a" "b" "c" "d" "e" "f" "g" "h"
## A B C
## a 1 5 9
## b 2 6 10
## c 3 7 11
## d 4 8 12
## List of 3
## $ A: Named num [1:4] 1 2 3 4
## ..- attr(*, "names")= chr [1:4] "A.a" "A.b" "A.c" "A.d"
## $ B: Named num [1:4] 5 6 7 8
## ..- attr(*, "names")= chr [1:4] "B.a" "B.b" "B.c" "B.d"
## $ C: Named num [1:4] 9 10 11 12
## ..- attr(*, "names")= chr [1:4] "C.a" "C.b" "C.c" "C.d"
Let’s get stared with a little toy-array:
(arr1 <- array(c(6:4,4:24), dim=c(4,3,2), dimnames=list(c(LETTERS[1:4]),
paste("col",1:3,sep=""),c("ch1","ch2"))))
## , , ch1
##
## col1 col2 col3
## A 6 5 9
## B 5 6 10
## C 4 7 11
## D 4 8 12
##
## , , ch2
##
## col1 col2 col3
## A 13 17 21
## B 14 18 22
## C 15 19 23
## D 16 20 24
Now we can obtain the CV (coefficient of variance) by splitting along 3rd dimesion (ie this is equivalent to an apply alon the 3rd dimension) using arrayCV
:
## ch1 ch2
## A 0.3122499 0.2352941
## B 0.3779645 0.2222222
## C 0.4788934 0.2105263
## D 0.5000000 0.2000000
## [,1] [,2]
## A 0.3122499 0.2352941
## B 0.3779645 0.2222222
## C 0.4788934 0.2105263
## D 0.5000000 0.2000000
Similarly we can split along any other dimension, eg the 2nd dimension :
## col1 col2 col3
## A 0.5210260 0.7713892 0.5656854
## B 0.6698906 0.7071068 0.5303301
## C 0.8187552 0.6527140 0.4991342
## D 0.8485281 0.6060915 0.4714045
This procedure is similar to (re-)organizing an intial array into clusters, here we split along a user-defined factor/vector. If a clustering-algorithm produces the cluster assignments, this function can be used to organize the input data accordingly using cutArrayInCluLike
.
## $`2`
## , , ch1
##
## col1 col2 col3
## A 6 5 9
## C 4 7 11
##
## , , ch2
##
## col1 col2 col3
## A 13 17 21
## C 15 19 23
##
##
## $`1`
## , , ch1
##
## col1 col2 col3
## B 5 6 10
## D 4 8 12
##
## , , ch2
##
## col1 col2 col3
## B 14 18 22
## D 16 20 24
Let’s cut by filtering along the 3rd dimension for all lines where column ‘col2’ is >7, and then display only the content of columns ‘col1’ and ‘col2’ (using filt3dimArr
):
## [[1]]
## col1 col2
## 4 8
##
## [[2]]
## col1 col2
## A 13 17
## B 14 18
## C 15 19
## D 16 20
Semantics : Please note, that there are two ways of interpreting the term ‘unique’ :
In regular understanding one describes this way an event which occurs only once, and thus does not occur/happen anywhere else.
The command unique()
makes a vector with some redundant entries ‘unique’, ie in the resultant vector all values/content (values) occur only once. However, initially repeated values will still be present. Also, you cannot tell any more which ones were not unique initially ! Thus, the result of unique() does not tell which values were ‘unique’ in the first place.
In some applications (eg proteomics) initial identifyers (IDs) may occur multiple times in the data and we frequently need to isolate events/values that occur only once, as the first meaning of ‘unique’. This package provides functions to easily distinguish values occuring just once (ie unique) from those occuring multiple times. Furthermore, there are functions to rename/remove/combine replicated elements.
The function table() (from the package -base_) is very useful get some insights when working with smaller objects, but may be slow to handle very large objects. As mentioned, unique() will make everything unqiue, and afterwards you won’t know any more who was unique in the first place ! The function duplicated()
(also from package base) helps us getting the information who is repeated.
## tr
## li0 li2 li3 n
## 1 2 2 5
## [1] "li0" "n" NA "li2" "li3"
## [1] FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## a b c d e f g h i j k l
## 11 12 13 14 15 16 NA 14 13 12 NA 14
findRepeated()
(from this package) will return the position/index (and content/value) of repeated elements. However, the output in form of a ist is not very convenient to the human eye.
## $`12`
## [1] 2 10
##
## $`13`
## [1] 3 9
##
## $`14`
## [1] 4 8 12
firstOfRepeated()
tells the index of the first instance of repeated elements, which elements you need to make the vector ‘unique’, and which elements get stripped off when making unique. Please note, that NA (no matter if they occure once or more times) are automatically in the part suggested to be removed.
## $indRepeated
## 12 13 14
## 2 3 4
##
## $indUniq
## a b c d e f
## 1 2 3 4 5 6
##
## $indRedund
## g h i j k l
## 7 8 9 10 11 12
## a b c d e f
## 11 12 13 14 15 16
## [1] 11 12 13 14 15 16 NA
If necessary a counter will be added to non-unique entries, thus no individual values get eliminated and the length and order of the resultant object maintains the same.
This is of importance when assigning rownames to a data.frame : Assingning redundant values/text as rownames of a data.frame will result in an error !
## a b c d e f g h i j k
## "11" "12_1" "13_1" "14_1" "15" "16" "NA_1" "14_2" "13_2" "12_2" "NA_2"
## l
## "14_3"
## a b c d e f g h i j k
## "11" "12.1" "13.1" "14.1" "15" "16" NA "14.2" "13.2" "12.2" NA
## l
## "14.3"
First, the truly unique values are reported and then the first occurance of repeated elements is given, NA instances get ignored.
## [1] 11 12 13 14 15 16 NA
## a e f amb_b amb_c
## 11 15 16 12 13
## $unique
## a e f
## 11 15 16
##
## $ambig
## amb_b amb_c
## 12 13
Here, it is supposed that you want to join 2 or more matrixes describing different properties of the same collection of individuals (as rows). Common column-names are interpreted that theur respective information should be combined (either as average or as sum).
## C B A
## 1 1 3 5
## 2 2 4 6
## C D E
## 1 11 13 15
## 2 12 14 16
## now we can join 2 or more matrixes
cbindNR(ma1, ma2, summarizeAs="mean") # average of both columns 'C'
## -> cbindNR : treating 5 different (types of) columns : C B A D E
## -> cbindNR : sorting columns of output
## A B C D E
## 1 5 3 6 13 15
## 2 6 4 7 14 16
This ressembles to the functioning of unique(), but applies to a user-specified column of the matrix.
## A B
## a 1 1
## b 2 2
## c 3 2
## d 4 3
## e 5 3
## f 6 3
## A B
## a 1 1
## b 2 2
## d 4 3
This function was rather designed for dealing with character input, it allows concatenating all columns and to remove redundant.
mat2 <- matrix(c("e","n","a","n","z","z","n","z","z","b",
"","n","c","n","","","n","","","z"), ncol=2)
firstOfRepLines(mat2, out="conc")
## [1] "e" "nn" "ac" "z" "bz"
## [1] 1 2 3 5 10
This function also includes a counter of redundant instances encountered (for 1st column specified)
## xA xB xC
## 1 a h A
## 2 b h B
## 3 c f C
## 4 d e D
## 5 e f E
## xA xB xC nSamePep concID
## 1 a h A 2 C//E
## 3 c f C 2 A//B
## 4 d e D 1 D
## xA xB xC
## 1 a h A
## 3 c f C
## 4 d e D
## xA xB xC
## 1 a h A
## 3 c f C
## 4 d e D
mat2 <- cbind(no=as.character(1:20), seq=sample(LETTERS[1:15], 20, repl=TRUE),
ty=sample(c("full","Nter","inter"),20,repl=TRUE), ambig=rep(NA,20), seqNa=1:20)
(mat2uniq <- get1stOfRepeatedByCol(mat2, sortBy="seq", sortSupl="ty"))
## no seq ty ambig seqNa
## [1,] "6" "M" "Nter" NA "6"
## [2,] "11" "C" "inter" NA "11"
## [3,] "12" "N" "Nter" NA "12"
## [4,] "17" "J" "full" NA "17"
## [5,] "18" "A" "full" NA "18"
## [6,] "19" "O" "Nter" NA "19"
## [7,] "7" "B" "Nter" "TRUE" "_7"
## [8,] "10" "D" "full" "TRUE" "_10"
## [9,] "8" "E" "full" "TRUE" "_8"
## [10,] "9" "F" "full" "TRUE" "_9"
## [11,] "13" "G" "Nter" "TRUE" "_13"
## [12,] "3" "H" "Nter" "TRUE" "_3"
##
## A B C D E F G H J M N O
## 1 1 1 1 1 1 1 1 1 1 1 1
# This will return all first repeated (may be >1) but without furter sorting
# along column 'ty' neither marking in comumn 'ambig').
mat2[which(duplicated(mat2[,2],fromLast=FALSE)),]
## no seq ty ambig seqNa
## [1,] "5" "H" "Nter" NA "5"
## [2,] "8" "E" "full" NA "8"
## [3,] "9" "F" "full" NA "9"
## [4,] "13" "G" "Nter" NA "13"
## [5,] "14" "D" "full" NA "14"
## [6,] "15" "D" "full" NA "15"
## [7,] "16" "B" "Nter" NA "16"
## [8,] "20" "F" "full" NA "20"
## A B
## 1 1 1
## amb_3 3 2
## amb_6 6 3
Here another example, ambiguous will be marked by an ’_’ :
set.seed(2017); mat3 <- matrix(c(1:100,round(rnorm(200),2)),ncol=3,
dimnames=list(1:100,LETTERS[1:3]));
head(mat3U <- nonAmbiguousMat(mat3, by="B", na="_", uniqO=FALSE), n=15)
## A B C
## 81 81 -2.59 -0.14
## 93 93 -2.02 -0.03
## 7 7 -1.96 0.52
## 4 4 -1.76 0.84
## _74 74 -1.65 0.30
## 55 55 -1.59 1.25
## 52 52 -1.58 -0.24
## 15 15 -1.43 -0.60
## 98 98 -1.34 0.41
## 63 63 -1.33 0.26
## 19 19 -1.13 0.70
## 41 41 -1.06 -0.56
## _56 56 -1.03 -1.07
## 94 94 -0.98 -0.02
## 95 95 -0.97 0.08
## A B C
## 1 1 1.43 0.02
## 2 2 -0.08 1.38
## 3 3 0.74 -0.07
## 4 4 -1.76 0.84
## 5 5 -0.07 -0.97
## 6 6 0.45 -1.97
lst2 <- list(aa_1x=matrix(1:12,nrow=4,byrow=TRUE),ab_2x=matrix(24:13,nrow=4,byrow=TRUE))
combineReplFromListToMatr(lst2)
## $a
## 1
## [1,] 1
## [2,] 4
## [3,] 7
## [4,] 10
## [5,] 2
## [6,] 5
## [7,] 8
## [8,] 11
## [9,] 3
## [10,] 6
## [11,] 9
## [12,] 12
##
## $b
## 2
## [1,] 24
## [2,] 21
## [3,] 18
## [4,] 15
## [5,] 23
## [6,] 20
## [7,] 17
## [8,] 14
## [9,] 22
## [10,] 19
## [11,] 16
## [12,] 13
mat4 <- matrix(rep(c(1,1:3,3,1),2),ncol=2,dimnames=list(letters[1:6],LETTERS[1:2]))
nonRedundLines(mat4)
## A B
## a 1 1
## c 2 2
## d 3 3
## f 1 1
# input: c and dd are repeated :
filtSizeUniq(list(A="a",B=c("b","bb","c"),D=c("dd","d","ddd","c")),filtUn=TRUE,minSi=NULL)
## -> filtSizeUniq : 2 out of 8 peptides redundant
## $A
## A
## "a"
##
## $B
## B.1 B.2
## "b" "bb"
##
## $D
## D.1 D.2 D.3
## "dd" "d" "ddd"
# here a,b,c and dd are repeated :
filtSizeUniq(list(A="a",B=c("b","bb","c"),D=c("dd","d","ddd","c")),ref=c(letters[c(1:26,1:3)],
"dd","dd","bb","ddd"),filtUn=TRUE,minSi=NULL)
## -> filtSizeUniq : 8 out of 8 peptides redundant
## $A
## character(0)
##
## $B
## character(0)
##
## $D
## character(0)
t3 <- data.frame(ref=rep(11:15,3),tx=letters[1:15],
matrix(round(runif(30,-3,2),1),nc=2), stringsAsFactors=FALSE)
# First we split the data.frame in list
by(t3,t3[,1],function(x) x)
## t3[, 1]: 11
## ref tx X1 X2
## 1 11 a 0.4 -1.1
## 6 11 f 0.6 1.0
## 11 11 k 0.1 1.2
## ------------------------------------------------------------
## t3[, 1]: 12
## ref tx X1 X2
## 2 12 b 2.0 -0.4
## 7 12 g -0.3 1.8
## 12 12 l -1.4 0.3
## ------------------------------------------------------------
## t3[, 1]: 13
## ref tx X1 X2
## 3 13 c 0.8 -1.6
## 8 13 h 0.8 -2.4
## 13 13 m 0.9 1.8
## ------------------------------------------------------------
## t3[, 1]: 14
## ref tx X1 X2
## 4 14 d 1.7 0.7
## 9 14 i -1.6 -2.6
## 14 14 n 1.4 -1.1
## ------------------------------------------------------------
## t3[, 1]: 15
## ref tx X1 X2
## 5 15 e -0.9 0.4
## 10 15 j -1.7 0.0
## 15 15 o -1.2 -1.8
## [,1] [,2] [,3] [,4]
## 11 11 "a" 0.1 1.2
## 12 12 "b" -0.3 1.8
## 13 13 "c" 0.8 -2.4
## 14 14 "d" -1.6 -2.6
## 15 15 "e" -1.2 -1.8
## -> makeNRedMatr : Common summarization method 'mean', run as batch
## -> makeNRedMatr : Summarize redundant based on col 'ref' using method(s) : 'mean', 'mean', 'mean' and 'mean' yielding 4 cols
## ID ref tx X1 X2 nRedLi
## 11 11 11 a 0.3666667 0.3666667 3
## 12 12 12 b 0.1000000 0.5666667 3
## 13 13 13 c 0.8333333 -0.7333333 3
## 14 14 14 d 0.5000000 -1.0000000 3
## 15 15 15 e -1.2666667 -0.4666667 3
## -> makeNRedMatr : Summarize redundant based on col 'ref' using method(s) : 'maxAbsOfRef' and col 'X1' yielding 4 cols
## ref tx X1 X2 nRedLi
## 11 11 "a" 0.6 1 3
## 12 12 "b" 2 -0.4 3
## 13 13 "c" 0.9 1.8 3
## 14 14 "d" 1.7 0.7 3
## 15 15 "e" -1.7 0 3
matr <- matrix(c(letters[1:6],"h","h","f","e",LETTERS[1:5]), ncol=3,
dimnames=list(letters[11:15],c("xA","xB","xC")))
combineRedBasedOnCol(matr, colN="xB")
## xA xB xC
## 2 "a,d" "f" "A,D"
## 3 "b,c" "h" "B,C"
## 1 "e" "e" "E"
## xA xB xC
## 2 "a,d" "f" "A,D"
## 3 "b,c" "h" "B,C"
## 1 "e" "e" "E"
x <- 1
dat1 <- matrix(1:10,ncol=2)
rownames(dat1) <- letters[c(1:3,2,5)]
## as.data.frame(dat1) ... would result in an error
convMatr2df(dat1)
## ID X1 X2
## a a 1 6
## b_1 b 2 7
## c c 3 8
## b_2 b 4 9
## e e 5 10
## ID a b c
## 1 1 0.5 A 1
## 2 2 1.0 B 2
## 3 3 1.5 C 3
tmp <- data.frame(a=as.character((1:3)/2), b=LETTERS[1:3],c=1:3, stringsAsFactors=FALSE)
convMatr2df(tmp)
## ID a b c
## 1 1 0.5 A 1
## 2 2 1.0 B 2
## 3 3 1.5 C 3
## ID a b
## 1 1 0.5 1
## 2 2 1.0 2
## 3 3 1.5 3
set.seed(2013)
datT2 <- matrix(round(rnorm(200)+3,1), ncol=2, dimnames=list(paste("li",1:100,sep=""),
letters[23:24]))
# (mimick) some short and longer names for each line
inf2 <- cbind(sh=paste(rep(letters[1:4],each=26),rep(letters,4),1:(26*4),sep=""),
lo=paste(rep(LETTERS[1:4],each=26), rep(LETTERS,4), 1:(26*4), ",",
rep(letters[sample.int(26)],4), rep(letters[sample.int(26)],4), sep=""))[1:100,]
## We'll use this to test :
head(datT2,n=10)
## w x
## li1 2.9 3.7
## li2 3.8 3.3
## li3 2.3 3.3
## li4 4.4 3.1
## li5 4.5 1.8
## li6 0.4 2.4
## li7 3.7 3.3
## li8 3.3 4.0
## li9 5.0 4.1
## li10 1.6 1.1
## let's assign to each pair of x & y values a 'cluster' (column _clu_, the column _combInf_ tells us which lines/indexes are in this cluster)
head(combineOverlapInfo(datT2, disThr=0.03), n=10)
## w x combInf clu isComb
## li1 2.9 3.7 1+16+22+47+91 1 TRUE
## li2 3.8 3.3 2+7+48+54 2 TRUE
## li3 2.3 3.3 3+66+92 3 TRUE
## li4 4.4 3.1 4 52 FALSE
## li5 4.5 1.8 5 53 FALSE
## li6 0.4 2.4 6 54 FALSE
## li7 3.7 3.3 2+7+48+54 2 TRUE
## li8 3.3 4.0 8+100 4 TRUE
## li9 5.0 4.1 9 55 FALSE
## li10 1.6 1.1 10 56 FALSE
## it is also possible to rather display names (eg gene or protein-names) instead of index values
head(combineOverlapInfo(datT2, suplI=inf2[,2], disThr=0.03), n=10)
## w x combInf clu isComb
## li1 2.9 3.7 AA1+AP16+AV22+BU47+DM91 1 TRUE
## li2 3.8 3.3 AB2+AG7+BV48+CB54 2 TRUE
## li3 2.3 3.3 AC3+CN66+DN92 3 TRUE
## li4 4.4 3.1 AD4,ww 52 FALSE
## li5 4.5 1.8 AE5,aj 53 FALSE
## li6 0.4 2.4 AF6,nl 54 FALSE
## li7 3.7 3.3 AB2+AG7+BV48+CB54 2 TRUE
## li8 3.3 4.0 AH8+DV100 4 TRUE
## li9 5.0 4.1 AI9,ic 55 FALSE
## li10 1.6 1.1 AJ10,ee 56 FALSE
dat <- 11:19
names(dat) <- letters[c(6:3,2:4,8,3)]
## Here the names are not unique.
## Thus, the values can be binned by their (non-unique) names and a representative values calculated.
## Let's make a 'datUniq' with the mean of each group of values :
datUniq <- round(tapply(dat, names(dat),mean),1)
## now we propagate the mean values to the full vector
getValuesByUnique(dat, datUniq)
## f e d c b c d h c
## 11.0 12.0 15.0 16.3 15.0 16.3 15.0 18.0 16.3
cbind(ini=dat,firstOfRep=getValuesByUnique(dat, datUniq),
indexUniq=getValuesByUnique(dat, datUniq,asIn=TRUE))
## ini firstOfRep indexUniq
## f 11 11.0 5
## e 12 12.0 4
## d 13 15.0 3
## c 14 16.3 2
## b 15 15.0 1
## c 16 16.3 2
## d 17 15.0 3
## h 18 18.0 6
## c 19 16.3 2
For example, if you whish to create group-labels considering the eye- and hair-color of a small group students (supposed a sort of controlled vocabulary was used), the function combineByEitherFactor() will help. So basically, this is an empric segmentation-approach for two categorical variables. Please note, that with large data-sets and very disperse data this approach will not provide great results. In the example below we’ll attempt to ‘cluster’ according to columns nn and qq, the resultant cluster number can be found in column grp.
nn <- rep(c("a","e","b","c","d","g","f"),c(3,1,2,2,1,2,1))
qq <- rep(c("m","n","p","o","q"),c(2,1,1,4,4))
nq <- cbind(nn,qq)[c(4,2,9,11,6,10,7,3,5,1,12,8),]
## Here we consider 2 columns 'nn' and 'qq' whe trying to regroup common values
## (eg value 'a' from column 'nn' and value 'o' from 'qq')
combineByEitherFactor(nq,1,2,nBy=FALSE)
## nn qq grp
## m2 "a" "m" "1"
## q2 "f" "q" "3"
## q1 "d" "q" "3"
## o2 "b" "o" "2"
## p "e" "p" "4"
## o4 "c" "o" "2"
## q4 "g" "q" "3"
## n "a" "n" "1"
## m1 "a" "m" "1"
## q3 "g" "q" "3"
## o1 "b" "o" "2"
## o3 "c" "o" "2"
The argument nBy simply allows adding an additional column with the group/cluster-number.
## nn qq grp nGrp
## m2 "a" "m" "1" "3"
## q2 "f" "q" "3" "4"
## q1 "d" "q" "3" "4"
## o2 "b" "o" "2" "4"
## p "e" "p" "4" "1"
## o4 "c" "o" "2" "4"
## q4 "g" "q" "3" "4"
## n "a" "n" "1" "3"
## m1 "a" "m" "1" "3"
## q3 "g" "q" "3" "4"
## o1 "b" "o" "2" "4"
## o3 "c" "o" "2" "4"
## Not running further iterations works faster, but you may not reach 'convergence' immediately
combineByEitherFactor(nq,1,2,nBy=FALSE)
## nn qq grp
## m2 "a" "m" "1"
## q2 "f" "q" "3"
## q1 "d" "q" "3"
## o2 "b" "o" "2"
## p "e" "p" "4"
## o4 "c" "o" "2"
## q4 "g" "q" "3"
## n "a" "n" "1"
## m1 "a" "m" "1"
## q3 "g" "q" "3"
## o1 "b" "o" "2"
## o3 "c" "o" "2"
## another example
mm <- rep(c("a","b","c","d","e"),c(3,4,2,3,1))
pp <- rep(c("m","n","o","p","q"),c(2,2,2,2,5))
combineByEitherFactor(cbind(mm,pp),1,2, con=FALSE, nBy=TRUE)
## -> combineByEitherFactor : did not reach convergence at 2nd pass
## mm pp grp nGrp
## m1 "a" "m" "1" "4"
## m2 "a" "m" "1" "4"
## n1 "a" "n" "1" "4"
## n2 "b" "n" "1" "4"
## o1 "b" "o" "2" "4"
## o2 "b" "o" "2" "4"
## p1 "b" "p" "2" "4"
## p2 "c" "p" "2" "4"
## q1 "c" "q" "3" "5"
## q2 "d" "q" "3" "5"
## q3 "d" "q" "3" "5"
## q4 "d" "q" "3" "5"
## q5 "e" "q" "3" "5"
## [1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
## va simil
## [1,] 4.000035 0
## [2,] 5.000070 0
## [3,] 6.000105 0
## [4,] 7.000140 1
## [5,] 7.000175 1
## [6,] 7.000210 1
## [7,] 7.000245 1
## [8,] 7.000280 1
## [9,] 8.000315 0
## [10,] 9.000350 0
## [11,] 10.000385 0
The search for similar values may be preformed as absolute distance or as ‘ppm’ (as it is eg usual in proteomics when comparing measured and theoretically expected mass).
aA <- c(11:17); bB <- c(12.001,13.999); cC <- c(16.2,8,9,12.5,15.9,13.5,15.7,14.1,5)
(cloMa <- findCloseMatch(x=aA, y=cC, com="diff", lim=0.5, sor=FALSE))
## $x2
## y4
## 0.5
##
## $x3
## y4 y6
## -0.5 0.5
##
## $x4
## y6 y8
## -0.5 0.1
##
## $x6
## y1 y5 y7
## 0.2 -0.1 -0.3
The result of findCloseMatch() is a list organized by each ‘x’, telling all instances of ‘y’ found within the distance tolerance given by lim. Using closeMatchMatrix() the result obtained above, can be presented in a more convient format for the human eye.
# all matches (of 2d arg) to/within limit for each of 1st arg ('x'); 'y' ..to 2nd arg = cC
# first let's display only one single closest/best hit
(maAa <- closeMatchMatrix(cloMa, aA, cC, lim=TRUE)) #
## id.aA aA id.cC cC disToPred ppmToPred nByGrp isMin nBest
## [1,] 2 12 4 12.5 -0.5 -40000.0 1 1 1
## [2,] 3 13 4 12.5 0.5 40000.0 2 1 2
## [3,] 3 13 6 13.5 -0.5 -37037.0 2 1 2
## [4,] 4 14 8 14.1 -0.1 -7092.2 2 1 1
## [5,] 6 16 5 15.9 0.1 6289.3 3 1 1
Using the argument limitToBest=FALSE we can display all distances within the limits imposed, some values/points may occur multiple times. For example, value number 4 of ‘cC’ (=12.5) or value number 3 of ‘aA’ (=13) now occur multiple times…
## id.aA aA id.cC cC disToPred ppmToPred nByGrp isMin nBest
## [1,] 2 12 4 12.5 -0.5 -40000.0 1 1 1
## [2,] 3 13 4 12.5 0.5 40000.0 2 1 2
## [3,] 3 13 6 13.5 -0.5 -37037.0 2 1 2
## [4,] 4 14 6 13.5 0.5 37037.0 2 0 0
## [5,] 4 14 8 14.1 -0.1 -7092.2 2 1 1
## [6,] 6 16 7 15.7 0.3 19108.0 3 0 0
## [7,] 6 16 5 15.9 0.1 6289.3 3 1 1
## [8,] 6 16 1 16.2 -0.2 -12346.0 3 0 0
(maAa <- closeMatchMatrix(cloMa, cbind(valA=81:87,aA), cbind(valC=91:99,cC), colM=2,
colP=2,lim=FALSE))
## -> closeMatchMatrix : reset argument 'origNa' to FALSE since names of 'predMatr' and/or 'measMatr' result of formula and would be too long
## id.pred valA aA id.meas valC cC disToPred ppmToPred nByGrp isMin nBest
## [1,] 2 82 12 4 94 12.5 -0.5 -40000.0 1 1 1
## [2,] 3 83 13 4 94 12.5 0.5 40000.0 2 1 2
## [3,] 3 83 13 6 96 13.5 -0.5 -37037.0 2 1 2
## [4,] 4 84 14 6 96 13.5 0.5 37037.0 2 0 0
## [5,] 4 84 14 8 98 14.1 -0.1 -7092.2 2 1 1
## [6,] 6 86 16 7 97 15.7 0.3 19108.0 3 0 0
## [7,] 6 86 16 5 95 15.9 0.1 6289.3 3 1 1
## [8,] 6 86 16 1 91 16.2 -0.2 -12346.0 3 0 0
## .. xxidentToMatr2a
## .. xxidentToMatr2c
## -> closeMatchMatrix : reset argument 'origNa' to FALSE since names of 'predMatr' and/or 'measMatr' result of formula and would be too long
## .. xxidentToMatr2d
## .. xxidentToMatr2e
## .. xxidentToMatr2f
## id.pred aA valA id.meas measMatr disToPred ppmToPred nByGrp isMin nBest
## [1,] 2 12 82 4 12.5 -0.5 -40000.0 1 1 1
## [2,] 3 13 83 4 12.5 0.5 40000.0 2 1 2
## [3,] 3 13 83 6 13.5 -0.5 -37037.0 2 1 2
## [4,] 4 14 84 6 13.5 0.5 37037.0 2 0 0
## [5,] 4 14 84 8 14.1 -0.1 -7092.2 2 1 1
## [6,] 6 16 86 7 15.7 0.3 19108.0 3 0 0
## [7,] 6 16 86 5 15.9 0.1 6289.3 3 1 1
## [8,] 6 16 86 1 16.2 -0.2 -12346.0 3 0 0
a2 <- aA; names(a2) <- letters[1:length(a2)]; c2 <- cC; names(c2) <- letters[10+1:length(c2)]
(cloM2 <- findCloseMatch(x=a2, y=c2, com="diff", lim=0.5, sor=FALSE))
## $b
## n
## 0.5
##
## $c
## n p
## -0.5 0.5
##
## $d
## p r
## -0.5 0.1
##
## $f
## k o q
## 0.2 -0.1 -0.3
(maA2 <- closeMatchMatrix(cloM2, predM=cbind(valA=81:87,a2),
measM=cbind(valC=91:99,c2), colM=2, colP=2, lim=FALSE, asData=TRUE))
## -> closeMatchMatrix : reset argument 'origNa' to FALSE since names of 'predMatr' and/or 'measMatr' result of formula and would be too long
## id.pred valA a2 id.meas valC c2 disToPred ppmToPred nByGrp isMin nBest
## b b 82 12 n 94 12.5 -0.5 -40000.0 1 1 1
## c_1 c 83 13 n 94 12.5 0.5 40000.0 2 1 2
## c_2 c 83 13 p 96 13.5 -0.5 -37037.0 2 1 2
## d_1 d 84 14 p 96 13.5 0.5 37037.0 2 0 0
## d_2 d 84 14 r 98 14.1 -0.1 -7092.2 2 1 1
## f_1 f 86 16 q 97 15.7 0.3 19108.0 3 0 0
## f_2 f 86 16 o 95 15.9 0.1 6289.3 3 1 1
## f_3 f 86 16 k 91 16.2 -0.2 -12346.0 3 0 0
(maA2 <- closeMatchMatrix(cloM2, cbind(id=names(a2),valA=81:87,a2), cbind(id=names(c2),
valC=91:99,c2), colM=3, colP=3, lim=FALSE, deb=FALSE))
## -> closeMatchMatrix : reset argument 'origNa' to FALSE since names of 'predMatr' and/or 'measMatr' result of formula and would be too long
## id.pred valA a2 id.meas valC c2 disToPred ppmToPred nByGrp
## b "b" "82" "12" "n" "94" "12.5" "-0.5" "-40000" "1"
## c "c" "83" "13" "n" "94" "12.5" "0.5" "40000" "2"
## c "c" "83" "13" "p" "96" "13.5" "-0.5" "-37037" "2"
## d "d" "84" "14" "p" "96" "13.5" "0.5" "37037" "2"
## d "d" "84" "14" "r" "98" "14.1" "-0.0999999999999996" "-7092.2" "2"
## f "f" "86" "16" "q" "97" "15.7" "0.300000000000001" "19108" "3"
## f "f" "86" "16" "o" "95" "15.9" "0.0999999999999996" "6289.3" "3"
## f "f" "86" "16" "k" "91" "16.2" "-0.199999999999999" "-12346" "3"
## isMin nBest
## b "1" "1"
## c "1" "2"
## c "1" "2"
## d "0" "0"
## d "1" "1"
## f "0" "0"
## f "1" "1"
## f "0" "0"
aA <- c(11:17); bB <- c(12.001,13.999); cC <- c(16.2,8,9,12.5,12.6,15.9,14.1)
aZ <- matrix(c(aA,aA+20), ncol=2, dimnames=list(letters[1:length(aA)],c("aaA","aZ")))
cZ <- matrix(c(cC,cC+20), ncol=2, dimnames=list(letters[1:length(cC)],c("ccC","cZ")))
findCloseMatch(cC,aA,com="diff",lim=0.5,sor=FALSE)
## $x1
## y6
## -0.2
##
## $x4
## y2 y3
## -0.5 0.5
##
## $x5
## y3
## 0.4
##
## $x6
## y6
## 0.1
##
## $x7
## y4
## -0.1
## aA predMatr[, ] cC measMatr disToPred ppmToPred nByGrp isMin nBest
## [1,] 2 12 4 12.5 -0.5 -40000.0 1 1 1
## [2,] 3 13 4 12.5 0.5 40000.0 2 0 0
## [3,] 3 13 5 12.6 0.4 31746.0 2 1 1
## [4,] 4 14 7 14.1 -0.1 -7092.2 1 1 1
## [5,] 6 16 6 15.9 0.1 6289.3 2 1 1
## [6,] 6 16 1 16.2 -0.2 -12346.0 2 0 0
## cC predMatr[, ] aA measMatr disToPred ppmToPred nByGrp isMin nBest
## [1,] 1 16.2 6 16 0.2 12500.0 1 1 1
## [2,] 4 12.5 2 12 0.5 41667.0 2 1 2
## [3,] 4 12.5 3 13 -0.5 -38462.0 2 1 2
## [4,] 5 12.6 3 13 -0.4 -30769.0 1 1 1
## [5,] 6 15.9 6 16 -0.1 -6250.0 1 1 1
## [6,] 7 14.1 4 14 0.1 7142.9 1 1 1
## aA predMatr[, ] cC measMatr disToPred ppmToPred nByGrp isMin nBest
## [1,] 2 12 4 12.5 -0.5 -40000.0 1 1 1
## [2,] 3 13 4 12.5 0.5 40000.0 2 0 0
## [3,] 3 13 5 12.6 0.4 31746.0 2 1 1
## [4,] 4 14 7 14.1 -0.1 -7092.2 1 1 1
## [5,] 6 16 6 15.9 0.1 6289.3 2 1 1
## [6,] 6 16 1 16.2 -0.2 -12346.0 2 0 0
## xxfindSimilFrom2sets2
## .. xxidentToMatr2a
## .. xxidentToMatr2c
## .. xxidentToMatr2d
## .. xxidentToMatr2e
## .. xxidentToMatr2f
## aA predMatr[, ] cC measMatr disToPred ppmToPred nByGrp isMin nBest
## [1,] 2 12 4 12.5 -0.5 -40000.0 1 1 1
## [2,] 3 13 4 12.5 0.5 40000.0 2 0 0
## [3,] 3 13 5 12.6 0.4 31746.0 2 1 1
## [4,] 4 14 7 14.1 -0.1 -7092.2 1 1 1
## [5,] 6 16 6 15.9 0.1 6289.3 2 1 1
## [6,] 6 16 1 16.2 -0.2 -12346.0 2 0 0
## [7,] 7 17 1 16.2 0.8 49383.0 1 1 1
## aA predMatr[, ] cC measMatr disToPred ppmToPred nByGrp isMin nBest
## [1,] 2 12 4 12.5 -0.5 -40000.0 2 1 1
## [2,] 2 12 5 12.6 -0.6 -47619.0 3 0 0
## [3,] 3 13 4 12.5 0.5 40000.0 2 0 0
## [4,] 3 13 5 12.6 0.4 31746.0 3 1 1
## [5,] 3 13 7 14.1 -1.1 -78014.0 3 0 0
## [6,] 4 14 7 14.1 -0.1 -7092.2 1 1 1
## [7,] 5 15 7 14.1 0.9 63830.0 3 1 2
## [8,] 5 15 6 15.9 -0.9 -56604.0 3 1 2
## [9,] 5 15 1 16.2 -1.2 -74074.0 2 0 0
## [10,] 6 16 6 15.9 0.1 6289.3 3 0 0
## [11,] 6 16 1 16.2 -0.2 -12346.0 2 0 0
## [12,] 7 17 6 15.9 1.1 69182.0 2 1 0
## [13,] 7 17 1 16.2 0.8 49383.0 2 1 2
# below: find fewer 'best matches' since search window larger (ie more good hits compete !)
findSimilFrom2sets(aA,cC,comp="ppm",lim=9e4,bestO=TRUE)
## aA predMatr[, ] cC measMatr disToPred ppmToPred nByGrp isMin nBest
## [1,] 2 12 4 12.5 -0.5 -40000.0 2 1 1
## [2,] 3 13 5 12.6 0.4 31746.0 3 1 1
## [3,] 4 14 7 14.1 -0.1 -7092.2 1 1 1
## [4,] 5 15 7 14.1 0.9 63830.0 3 1 2
## [5,] 5 15 6 15.9 -0.9 -56604.0 3 1 2
## [6,] 7 17 6 15.9 1.1 69182.0 2 1 0
## [7,] 7 17 1 16.2 0.8 49383.0 2 1 2
When you have already identified the closest neighbour of a set of values, you may want to re-organize/fuse such pairs to a given number of total clusters.
## [,1] [,2]
## [1,] 1 2
## [2,] 2 3
## [3,] 3 4
## [4,] 4 5
## [5,] 5 6
## [6,] 8 9
## 1 2 3 4 4 5 6 8 9
## 1 1 1 1 2 2 2 3 3
When visualizing larger data-sets in an x&y space one may find many points overlapping when their values are almost the same.
The function elimCloseCoord() aims to do reduce a bivariate data-set to ‘non-overlapping’ points, somehow similar to human perception.
## [,1] [,2]
## [1,] 0 0
## [2,] 99 1
## [3,] 198 2
## [4,] 297 3
## [5,] 396 4
## [6,] 0 0
## -> elimCloseCoord : reducing 'x' from 15 to 7 lines
## [,1] [,2]
## 1 0 0.0
## 2 99 1.0
## 3 198 2.0
## 4 297 3.0
## 5 396 4.0
## 12 99 1.1
## 15 396 4.5
Some programs produce a series of csv files, where a large experiment/data-set was split in multiple files. The function readCsvBatch() was designed for reading multiple csv files of exactely the same layout and to join their content. As output a list with the content of each file can be produced (one matrix per file), or the data may be fused into an array, as shown below.
path1 <- system.file("extdata",package="wrMisc")
fiNa <- c("pl01_1.csv","pl01_2.csv","pl02_1.csv","pl02_2.csv")
datAll <- readCsvBatch(fiNa,path1)
## -> readCsvBatch : ready to start reading 4 files; expecting header : TRUE
## -> readCsvBatch : csv-format used 'Europe' with 96 lines & 3 cols
## -> readCsvBatch : ..reading file C:/Users/wraff/AppData/Local/Temp/RtmpCSKQHa/Rinst495424913e00/wrMisc/extdata/pl01_1.csv
## -> readCsvBatch -> .removeEmptyCol : columns no 1, 2 were considered empty and removed
## -> readCsvBatch : ..reading file C:/Users/wraff/AppData/Local/Temp/RtmpCSKQHa/Rinst495424913e00/wrMisc/extdata/pl01_2.csv
## -> readCsvBatch : ..reading file C:/Users/wraff/AppData/Local/Temp/RtmpCSKQHa/Rinst495424913e00/wrMisc/extdata/pl02_1.csv
## -> readCsvBatch : ..reading file C:/Users/wraff/AppData/Local/Temp/RtmpCSKQHa/Rinst495424913e00/wrMisc/extdata/pl02_2.csv
## num [1:96, 1:4, 1] 158808 174272 183176 175752 49272 ...
## - attr(*, "dimnames")=List of 3
## ..$ : chr [1:96] "A01" "B01" "C01" "D01" ...
## ..$ : chr [1:4] "1_1.csv" "1_2.csv" "2_1.csv" "2_2.csv"
## ..$ : chr "StainA"
When setting the first argument fileNames to NULL, you can read all files of a given path.
Sometimes were may get confronted with data which look like ‘incomplete’ tables. In such cases some rows do not conatain as many elements/columns as other columns. Files with this type of data pose a problem for read.table() (from the utils package). The function readVarColumns() (from this package) was designed to provide help in such cases. Basically each line is read an parsed separetely, the user should check the proper separator is used.
The example below lists people’s names in different locations, some locations have more persons … Sometimes exporting such data will gererate shorter lines in locations with fewer elements (here ‘London’) and no additonal separators will get added (to mark all empty fields) towards the end. The function readVarColumns() (from this package) provides help to read such data, if the content (and separtors) of the last columns are missing.
path1 <- system.file("extdata",package="wrMisc")
fiNa <- "Names1.tsv"
datAll <- readVarColumns(fiName=file.path(path1,fiNa), sep="\t")
## -> readTsv2 : setting 'refCo' to 'Location'
## chr [1:2, 1:5] "Paris" "London" "Caroline" "James" "Marie" "Stella" ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:2] "Paris" "London"
## ..$ : chr [1:5] "Location" "Names" "Names_2" "Names_3" ...
The main reason of normalization is to remove variability in the data which is not directly linked to the (original/biological) concept of a given experiment. High throuput data from real world measurements may easily contain various deformations due to technical reasons, eg slight temperature variations, electromagnetic interferance, instability of reagents etc. In paticular, transferring constant amounts of liquids/reagents in highly repeated steps over large experiments is ofter also very challenging, small variations of the amounts of liquid (or similar) are typically adressed by normalization. However, applying aggressive normalization to the data also brings considerable risk of starting to loose some of the effects one intended to study. At some point it may rather be better to eliminate a few samples or branches of an experiment to avoid too invasive intervention. This shows that quality control can be tighly linked to decisions about data-normalization. In conclusion, normalization may be far more challenging than simply running some algorithms.
In general the use has to assume/define some hypothesis to justify intervention. Sometimes specific elements of an experiment are known to be not affected and can therefor be used to normalize the rest. Eg, if you observe growth of trees in a forest, big blocks of rock on the floor are assumed no to change their location. So one could use them as alignment-marks to superpose pictures taken at slightly different positions.
The hypothesis of no global changes is very common : During the course of many biological experiments (eg change of nutrient) one assumes that only a small portion of the elements measured (eg the abundance of all different gene-products) do change, since many processes of a living cell like growth, replication and interacion with neighbour-cells are assumed not to be affected. So, if one assumes that there are no global changes one normalizes the input-data in a way that the average or median across each experiment will give the same value. In analogy, if one takes photographs on a partially cloudy day, most cameras will adjust light settings (sun r clouds) so that global luminosity stays the same. However, if too many of the mesured alements are affected, this normalization approach will lead to (additional) loss of information.
I suggest the user also to include graphical representation(s) of the data helping to visualize variability in various samples along a given experiment and how different normalization procedures affect these outcomes.
Before jumping into normalization it may be quite useful to filter the data first. The overal idea is, that most high-throughput experiments do produce some non-meaningful data (artefacts) and it may be wise to remove such ‘bad’ data first as they may effect normalization (in particular extreme values). A special case of such problematic data concerns NA-values.
Frequent NA-values may represent another potential issue. With NA-values there is no general optimal advice. To get started, you should try to investigate how and why NA-values occured to check if there is a special ‘meaning’ to them. For example, on some measurement systems values below detection limit may be simply reported as NAs. If the lines of your data represent different features quantified (eg proteins), than lines with mostly NA-values represent features that may not be well exploited anyway. Therefore many times one tries to filter away lines of ‘bad’ data. Of course, if there is a column (sample) with an extremely high content of NAs, one should also investigate what might be particular with this column (sample), to see if one might be better of to eliminate the entire column.
The function presenceFilt()
allows to eliminate lines containing too many NA-values.
dat1 <- matrix(1:56,ncol=7)
dat1[c(2,3,4,5,6,10,12,18,19,20,22,23,26,27,28,30,31,34,38,39,50,54)] <- NA
dat1; presenceFilt(dat1,gr=gl(3,3)[-(3:4)],maxGr=0)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 1 9 17 25 33 41 49
## [2,] NA NA NA NA NA 42 NA
## [3,] NA 11 NA NA 35 43 51
## [4,] NA NA NA NA 36 44 52
## [5,] NA 13 21 29 37 45 53
## [6,] NA 14 NA NA NA 46 NA
## [7,] 7 15 NA NA NA 47 55
## [8,] 8 16 24 32 40 48 56
## 1-2 1-3 2-3
## [1,] TRUE TRUE TRUE
## [2,] FALSE FALSE FALSE
## [3,] FALSE TRUE TRUE
## [4,] FALSE TRUE TRUE
## [5,] TRUE TRUE TRUE
## [6,] FALSE FALSE FALSE
## [7,] TRUE TRUE FALSE
## [8,] TRUE TRUE TRUE
## -> presenceFilt : correcting 'maxGrpMiss' for group(s) 1 and 2 due to ratMaxNA=0.1
## 1-2
## [1,] TRUE
## [2,] FALSE
## [3,] FALSE
## [4,] FALSE
## [5,] TRUE
## [6,] FALSE
## [7,] FALSE
## [8,] TRUE
## 1-2
## [1,] TRUE
## [2,] FALSE
## [3,] TRUE
## [4,] TRUE
## [5,] TRUE
## [6,] TRUE
## [7,] TRUE
## [8,] TRUE
## -> cleanReplicates : rownames of 'x' either NULL or not unique, replacing by row-numbers
## -> cleanReplicates : removing 1 entries in lines 2
## [,1] [,2] [,3]
## 1 19 18 16
## 2 20 19 NA
## 3 30 28 35
Please note, that imputing NA-values represents another option instead of filtering, multiple other packages address this in detail. All decisions of which approach to use should be data-driven.
In biological high-throughput data columns typically represent different samples, which may be organized as replicates. During high-throughput experiments thousands of (independent) elements are measured (eg abundance of gene-products), they are represented by rows. Note, that some experiments may produce a considerable amount of missing data (NAs) which require special attention (dedicated developments exist in other R-packages eg in wrProteo).
My general advice is to first carefully look where such missing data is observed and to pay attention to replicate measurements where a given element once was measured with a real numeric value and once as missing information (NA). Of course, graphical representations (PCA, MA-plots, etc) are frequently extremely important to identifying abnormalities and potential problems. The package wrGraph offers also complementary options useful in the context of normalization.
set.seed(2015); rand1 <- round(runif(300)+rnorm(300,0,2),3)
dat1 <- cbind(ser1=round(100:1+rand1[1:100]),ser2=round(1.2*(100:1+rand1[101:200])-2),
ser3=round((100:1+rand1[201:300])^1.2-3))
dat1 <- cbind(dat1,ser4=round(dat1[,1]^seq(2,5,length.out=100)+rand1[11:110],1))
dat1[dat1 <1] <- NA
## Let's get a quick overview of the data
summary(dat1)
## ser1 ser2 ser3 ser4
## Min. : 2.00 Min. : 1.00 Min. : 1.0 Min. : 37.5
## 1st Qu.: 26.75 1st Qu.: 28.00 1st Qu.: 50.0 1st Qu.: 67210.0
## Median : 49.50 Median : 59.00 Median :109.0 Median : 332524.0
## Mean : 51.14 Mean : 58.79 Mean :115.1 Mean : 542279.4
## 3rd Qu.: 76.25 3rd Qu.: 89.50 3rd Qu.:173.5 3rd Qu.: 925759.5
## Max. :100.00 Max. :121.00 Max. :263.0 Max. :2123191.7
## NA's :1
## some selected lines (indeed, the 4th column appears always much higher)
dat1[c(1:5,50:54,95:100),]
## ser1 ser2 ser3 ser4
## [1,] 100 121 251 10000.6
## [2,] 100 117 244 11500.2
## [3,] 99 120 263 12948.1
## [4,] 99 120 242 14885.1
## [5,] 97 114 236 16382.3
## [6,] 51 60 111 892534.1
## [7,] 48 58 109 812490.4
## [8,] 49 55 108 982907.4
## [9,] 50 56 107 1188787.7
## [10,] 45 47 102 915343.6
## [11,] 3 6 5 206.4
## [12,] 5 2 7 2570.0
## [13,] 8 1 3 27125.8
## [14,] 6 4 5 6975.9
## [15,] 3 3 1 237.0
## [16,] 2 1 NA 37.5
Our toy dat may be normalized by a number of different criteria. In real applications the nature of the data and the type of deformation detected/expected will largely help deciding which normalization might be the ‘best’ choice. Here we’ll try first normalizing by the mean, ie all columns will be forced to end up with the same column-mean. The trimmed mean does not consider values at extremes (as outlyers are frequently artefacts and display extreme values). When restricting even stronger which values to consider one will eventually end up with the median (3rd method used below).
no1 <- normalizeThis(dat1, refGrp=1:3, meth="mean")
no2 <- normalizeThis(dat1, refGrp=1:3, meth="trimMean", trim=0.4)
no3 <- normalizeThis(dat1, refGrp=1:3, meth="median")
no4 <- normalizeThis(dat1, refGrp=1:3, meth="slope", quantFa=c(0.2,0.8))
It is suggested to verify normalization results by plots. Note, that Box plots may not be appropriate in some cases (eg multimodal distributions), for more details you may consider using Violin-Plots from packages vioplot or wrGraph, another option might be a (cumulated) frequency plot (eg in package wrGraph).
You can see clearly, that the 4th data-set has a problem of range. So we’ll see if some proportional normalization may help to kake it more comparable to the other ones.
If you are not familiar with the way data is handled in the Bioconductor package limma and you would like to use some of the tools for running moderated t-tests therein, this will provide easy access using moderTest2grp
:
set.seed(2017); t8 <- matrix(round(rnorm(1600,10,0.4),2),ncol=8,
dimnames=list(paste("l",1:200),c("AA1","BB1","CC1","DD1","AA2","BB2","CC2","DD2")))
t8[3:6,1:2] <- t8[3:6,1:2]+3 # augment lines 3:6 for AA1&BB1
t8[5:8,5:6] <- t8[5:8,5:6]+3 # augment lines 5:8 for AA2&BB2 (c,d,g,h should be found)
t4 <- log2(t8[,1:4]/t8[,5:8])
fit4 <- moderTest2grp(t4,gl(2,2))
## now we'll use limma's topTable() function to look at the 'best' results
library(limma)
topTable(fit4,coef=1,n=5) # effect for 3,4,7,8
## logFC AveExpr t P.Value adj.P.Val B
## l 7 -0.4975806 -0.2436786 -8.712092 7.989390e-17 1.597878e-14 30.668381
## l 4 0.4020373 0.1890232 7.039234 8.460981e-12 8.460981e-10 17.723883
## l 8 -0.3735170 -0.2259811 -6.539873 1.883448e-10 1.255632e-08 14.392733
## l 3 0.3508834 0.1488240 6.143585 1.950186e-09 9.750928e-08 11.923522
## l 27 -0.1348878 -0.1011609 -2.361738 1.866790e-02 6.886965e-01 -3.878176
## -> moderTest2grp : testing alternative hypothesis: true difference in means is less than 0 (ie focus on 101 results with A less than B)
## logFC AveExpr t P.Value adj.P.Val B
## l 7 -0.4975806 -0.2436786 -8.712092 3.994695e-17 7.989390e-15 30.668381
## l 4 0.4020373 0.1890232 7.039234 1.000000e+00 1.000000e+00 17.723883
## l 8 -0.3735170 -0.2259811 -6.539873 9.417239e-11 9.417239e-09 14.392733
## l 3 0.3508834 0.1488240 6.143585 1.000000e+00 1.000000e+00 11.923522
## l 27 -0.1348878 -0.1011609 -2.361738 9.333949e-03 6.222633e-01 -3.878176
If you want to make multiple pair-wise comparisons using moderTestXgrp
:
grp <- factor(rep(LETTERS[c(3,1,4)],c(2,3,3)))
set.seed(2017); t8 <- matrix(round(rnorm(208*8,10,0.4),2),ncol=8,
dimnames=list(paste(letters[],rep(1:8,each=26),sep=""),paste(grp,c(1:2,1:3,1:3),sep="")))
t8[3:6,1:2] <- t8[3:6,1:2] +3 # augment lines 3:6 (c-f)
t8[5:8,c(1:2,6:8)] <- t8[5:8,c(1:2,6:8)] -1.5 # lower lines
t8[6:7,3:5] <- t8[6:7,3:5] +2.2 # augment lines
## expect to find C/A in c,d,g, (h)
## expect to find C/D in c,d,e,f
## expect to find A/D in f,g,(h)
test8 <- moderTestXgrp(t8,grp)
head(test8$p.value,n=8)
## A-C A-D C-D
## a1 8.736828e-02 6.776543e-02 9.397304e-01
## b1 4.384118e-01 5.400019e-01 8.205610e-01
## c1 1.094834e-19 6.344497e-01 2.571471e-21
## d1 2.671725e-13 9.915692e-01 2.858699e-13
## e1 1.802454e-03 2.413137e-08 9.735465e-16
## f1 3.188362e-01 2.527208e-32 2.226490e-22
## g1 1.166242e-29 6.410057e-33 5.484445e-01
## h1 1.141181e-05 1.943795e-05 5.674938e-01
To get an introduction into local false dicovery rate estimations you may read Strimmer 2008. An overview of R packages in this context was prepared by the strimmerlab. A convenient way to get lfdr values is preseneted in pVal2lfdr
. Note, that the toy-example used below is too small for estimating really meaningful lfdr values. For this reason the function fdrtool() from package fdrtool may issue warnings.
set.seed(2017); t8 <- matrix(round(rnorm(160,10,0.4),2),ncol=8,dimnames=list(letters[1:20],
c("AA1","BB1","CC1","DD1","AA2","BB2","CC2","DD2")))
t8[3:6,1:2] <- t8[3:6,1:2]+3 # augment lines 3:6 (c-f) for AA1&BB1
t8[5:8,5:6] <- t8[5:8,5:6]+3 # augment lines 5:8 (e-h) for AA2&BB2 (c,d,g,h should be found)
head(pVal2lfdr(apply(t8,1,function(x) t.test(x[1:4],x[5:8])$p.value)))
## Warning in fdrtool::fdrtool(z, statistic = "pvalue", plot = FALSE, verbose
## = !silent): There may be too few input test statistics for reliable FDR
## calculations!
## a b c d e f
## 1.0000000 0.5753562 0.5753562 1.0000000 1.0000000 1.0000000
path1 <- matrix(c(17,19,18,17, 4,4,2,3),ncol=2,
dimnames=list(c("A/B/C/D","A/B/G/D","A/H","A/H/I"),c("sumLen","n")))
contribToContigPerFrag(path1)
## sumLe n.frag len.rat
## A 19 4 1.000
## B 19 4 1.000
## C 17 4 0.895
## D 19 4 1.000
## G 19 4 1.000
## H 18 2 0.947
## I 17 3 0.895
If you have a set of fragments from a common ancestor and the fragment’s start- and end-sites are marked by index-positions (integers), you can make a simple grapgical display :
frag1 <- cbind(beg=c(2,3,7,13,13,15,7,9,7, 3,3,5), end=c(6,12,8,18,20,20,19,12,12, 4,5,7))
rownames(frag1) <- letters[1:nrow(frag1)]
simpleFragFig(frag1)
Now we can make a matrix telling if some fragments do start or end at exactely the same position.
## beg end beg.n beg.rat end.n end.rat
## a 2 6 NA NA NA NA
## b 3 12 3 0.2500 3 0.2500
## c 7 8 3 0.2500 NA NA
## d 13 18 2 0.1667 NA NA
## e 13 20 2 0.1667 2 0.1667
## f 15 20 NA NA 2 0.1667
## g 7 19 3 0.2500 NA NA
## h 9 12 NA NA 3 0.2500
## i 7 12 3 0.2500 3 0.2500
## j 3 4 3 0.2500 NA NA
## k 3 5 3 0.2500 NA NA
## l 5 7 NA NA NA NA
The function pasteC
allows adding quotes and separating the last element by specific text (eg ‘and’).
## [1] "1, 2, 3 and 4"
By default most color-gradients end with a color very close to the beginning.
set.seed(2015); dat1 <- round(runif(15),2)
plot(1:15,dat1,pch=16,cex=2,col=colorAccording2(dat1),main="Color gradient according to value in y")
# Here we modify the span of the color gradient
plot(1:15,dat1,pch=16,cex=2,col=colorAccording2(dat1,nStartO=0,nEndO=4,revCol=TRUE),main="blue to red")
# It is also possible to work with scales of transparency
plot(1:9,pch=3)
points(1:9,1:9,col=transpGraySca(st=0,en=0.8,nSt=9,trans=0.3),cex=42,pch=16)
For this purpose you may use convColorToTransp
.
col0 <- c("#998FCC","#5AC3BA","#CBD34E","#FF7D73")
col1 <- convColorToTransp(col0,alph=0.7)
layout(1:2)
pie(rep(1,length(col0)), col=col0, main="no transparency")
pie(rep(1,length(col1)), col=col1, main="new transparency")
There are many ways of creating reports. If you want simply to combine a few plots into a pdf, the function tableToPlot
may be helpful to add a small table (eg overview of points/samples/files used in other plots of the same pdf). This function prints tables in the current graphical output/window (which may by a pdf-device).
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=C LC_CTYPE=French_France.1252
## [3] LC_MONETARY=French_France.1252 LC_NUMERIC=C
## [5] LC_TIME=French_France.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] limma_3.44.3 wrMisc_1.3.1
##
## loaded via a namespace (and not attached):
## [1] compiler_4.0.2 magrittr_1.5 htmltools_0.5.0 tools_4.0.2
## [5] yaml_2.2.1 fdrtool_1.2.15 stringi_1.4.6 rmarkdown_2.3
## [9] knitr_1.29 stringr_1.4.0 digest_0.6.25 xfun_0.16
## [13] rlang_0.4.7 evaluate_0.14