Package MKomics includes a collection of functions that I found and find useful for the analysis of omics data. It includes:
similarity plots based on correlation and median absolute deviation (MAD)
adjusting colors for heatmaps
aggregate technical replicates
calculate pairwise fold-changes and log fold-changes
compute one- and two-way ANOVA
simplified interface to package ‘limma’ (Ritchie et al. 2015) for moderated t-test and one-way ANOVA
Hamming and the Levenshtein (edit) distance of strings as well as optimal alignment scores for global (Needleman-Wunsch) and local (Smith-Waterman) alignments with constant gap penalties (Merkl and Waack 2009).
The package requires Bioconductor package limma, which can be installed via
## Install package BiocManager
install.packages("BiocManager")
## Use BiocManager to install limma
BiocManager::install("limma")
We first load the package.
The functions to compute the moderated t-test and one-way ANOVA were motivated by the fact that my students had problems to understand and correctly adapt the code of the limma package (Ritchie et al. 2015).
We generate some artificial data and compute moderated t-tests.
## mean 2.5% 97.5% t p.value adj.p.value B
## 1 0.8167034 0.3347307 1.298676 3.342451 1.000055e-03 1.250069e-03 -1.2889029
## 2 0.8597711 0.3777984 1.341744 3.518711 5.425967e-04 9.043279e-04 -0.6914649
## 3 0.8024507 0.3204780 1.284423 3.284121 1.217895e-03 1.353217e-03 -1.4798576
## 4 1.1286441 0.6466713 1.610617 4.619104 7.092114e-06 2.364038e-05 3.7323676
## 5 0.7657581 0.2837853 1.247731 3.133952 1.998232e-03 1.998232e-03 -1.9559897
## 6 1.2950566 0.8130838 1.777029 5.300166 3.189578e-07 1.594789e-06 7.0697754
## 7 1.0246341 0.5426613 1.506607 4.193431 4.210188e-05 8.420376e-05 1.8791782
## 8 1.5096517 1.0276790 1.991624 6.178421 3.856033e-09 3.856033e-08 12.0500350
## 9 0.8280052 0.3460325 1.309978 3.388705 8.537765e-04 1.219681e-03 -1.1350943
## 10 1.0959262 0.6139534 1.577899 4.485202 1.258857e-05 3.147142e-05 3.1301177
## Two-sample test
set.seed(123)
X <- rbind(matrix(rnorm(5*20), nrow = 5, ncol = 20),
matrix(rnorm(5*20, mean = 1), nrow = 5, ncol = 20))
g2 <- factor(c(rep("group 1", 10), rep("group 2", 10)))
mod.t.test(X, group = g2)
## difference in means 2.5% 97.5% t p.value adj.p.value
## 1 0.30099317 -0.5318079 1.1337943 0.71317027 0.4766640 0.9665915
## 2 -0.13637650 -0.9691776 0.6964246 -0.32312914 0.7469725 0.9665915
## 3 0.49260925 -0.3401919 1.3254104 1.16718354 0.2446801 0.8156005
## 4 -0.15266680 -0.9854679 0.6801343 -0.36172723 0.7179801 0.9665915
## 5 0.05546448 -0.7773366 0.8882656 0.13141699 0.8955922 0.9665915
## 6 0.03152278 -0.8012783 0.8643239 0.07468976 0.9405445 0.9665915
## 7 0.69311916 -0.1396820 1.5259203 1.64226977 0.1022800 0.8156005
## 8 0.17595066 -0.6568505 1.0087518 0.41689577 0.6772514 0.9665915
## 9 0.54524220 -0.2875589 1.3780433 1.29189154 0.1980510 0.8156005
## 10 0.01770155 -0.8150996 0.8505027 0.04194188 0.9665915 0.9665915
## B mean of group 1 mean of group 2
## 1 -4.608912 0.07083562 0.371828786
## 2 -4.619655 0.13562592 -0.000750575
## 3 -4.586223 -0.30436677 0.188242477
## 4 -4.618953 0.18497579 0.032308986
## 5 -4.621971 0.08494719 0.140411667
## 6 -4.622282 1.11828347 1.149806246
## 7 -4.550748 0.67404426 1.367163425
## 8 -4.617811 0.53111339 0.707064052
## 9 -4.578072 0.92210638 1.467348578
## 10 -4.622384 0.48495033 0.502651884
## Paired two-sample test
subjID <- factor(rep(1:10, 2))
mod.t.test(X, group = g2, paired = TRUE, subject = subjID)
## mean of differences 2.5% 97.5% t p.value adj.p.value
## 1 0.30099317 -0.5606128 1.1625992 0.69402426 0.4894545 0.9675331
## 2 -0.13637650 -0.9979825 0.7252295 -0.31445430 0.7539038 0.9675331
## 3 0.49260925 -0.3689968 1.3542152 1.13584893 0.2590354 0.8634514
## 4 -0.15266680 -1.0142728 0.7089392 -0.35201618 0.7256490 0.9675331
## 5 0.05546448 -0.8061415 0.9170705 0.12788893 0.8985222 0.9675331
## 6 0.03152278 -0.8300832 0.8931288 0.07268461 0.9422184 0.9675331
## 7 0.69311916 -0.1684868 1.5547252 1.59818084 0.1135079 0.8634514
## 8 0.17595066 -0.6856553 1.0375567 0.40570365 0.6859233 0.9675331
## 9 0.54524220 -0.3163638 1.4068482 1.25720898 0.2119307 0.8634514
## 10 0.01770155 -0.8439044 0.8793075 0.04081589 0.9675331 0.9675331
## B mean of group 1 mean of group 2
## 1 -4.608863 0.07083562 0.371828786
## 2 -4.618525 0.13562592 -0.000750575
## 3 -4.588455 -0.30436677 0.188242477
## 4 -4.617893 0.18497579 0.032308986
## 5 -4.620608 0.08494719 0.140411667
## 6 -4.620887 1.11828347 1.149806246
## 7 -4.556550 0.67404426 1.367163425
## 8 -4.616866 0.53111339 0.707064052
## 9 -4.581125 0.92210638 1.467348578
## 10 -4.620978 0.48495033 0.502651884
We generate some artifical data and compute moderated one-way ANOVAs.
set.seed(123)
X <- rbind(matrix(rnorm(5*20), nrow = 5, ncol = 20),
matrix(rnorm(5*20, mean = 1), nrow = 5, ncol = 20))
gr <- factor(c(rep("A1", 5), rep("B2", 5), rep("C3", 5), rep("D4", 5)))
mod.oneway.test(X, gr)
## topTableF is obsolete and will be removed in a future version of limma. Please considering using topTable instead.
## A1 vs B2 A1 vs C3 A1 vs D4 B2 vs C3 B2 vs D4 C3 vs D4
## 1 1.0974330 0.2271669 0.26827976 -0.87026609 -0.8291532 0.04111286
## 2 0.0769194 0.9622020 -0.61252965 0.88528264 -0.6894491 -1.57473169
## 3 -0.3105477 -0.7130963 -0.58266984 -0.40254866 -0.2721221 0.13042651
## 4 -0.5832304 -0.2440290 -0.03386782 0.33920142 0.5493626 0.21016117
## 5 -0.9579128 -0.4501003 -0.61874146 0.50781250 0.3391713 -0.16864116
## 6 -0.6013595 -0.5109767 -0.15342834 0.09038278 0.4479311 0.35754833
## 7 0.3471695 -0.3493145 -0.68975433 -0.69648400 -1.0369239 -0.34043986
## 8 -0.9277388 -0.5691081 -0.71053196 0.35863064 0.2172068 -0.14142382
## 9 -0.5998638 -1.5488157 -0.14153255 -0.94895185 0.4583313 1.40728311
## 10 1.5527067 0.5179562 0.99934740 -1.03475050 -0.5533593 0.48139118
## grand mean F p.value adj.p.value
## 1 0.22133220 1.4119467 0.23709800 0.5868380
## 2 0.06743767 2.5678204 0.05255479 0.1751826
## 3 -0.05806215 0.6094415 0.60879068 0.7253749
## 4 0.10864239 0.4386138 0.72537489 0.7253749
## 5 0.11267943 0.9691456 0.40611786 0.5868380
## 6 1.13404486 0.5002131 0.68212276 0.7253749
## 7 1.02060384 1.2185746 0.30111975 0.5868380
## 8 0.61908872 0.9594030 0.41078662 0.5868380
## 9 1.19472748 2.9871234 0.02980893 0.1751826
## 10 0.49380111 2.6903326 0.04456696 0.1751826
We can also perform a moderated one-way ANOVA with repeated measures.
X <- rbind(matrix(rnorm(6*18), nrow = 6, ncol = 18),
matrix(rnorm(6*18, mean = 1), nrow = 6, ncol = 18))
gr <- factor(c(rep("T1", 6), rep("T2", 6), rep("T3", 6)))
subjectID <- factor(c(rep(1:6, 3)))
mod.oneway.test(X, gr, repeated = TRUE, subject = subjectID)
## grand mean F p.value adj.p.value
## 1 0.439887277 0.24258038 0.91426233 0.9856427
## 2 -0.348176298 0.25294001 0.90800757 0.9856427
## 3 -0.001648741 0.19504576 0.94108489 0.9856427
## 4 0.380034581 0.64717729 0.62882454 0.9856427
## 5 0.116088830 0.19890913 0.93903209 0.9856427
## 6 -0.215112459 0.13298495 0.97031807 0.9856427
## 7 1.305439170 0.08992079 0.98564266 0.9856427
## 8 0.927363547 2.87518053 0.02147716 0.2577260
## 9 1.057394093 1.60523754 0.16983981 0.6793592
## 10 0.985162868 1.66804543 0.15425972 0.6793592
## 11 0.780801913 0.48820193 0.74443860 0.9856427
## 12 0.787626019 0.45961349 0.76543996 0.9856427
As a optional post-hoc analysis after mod.oneway.test one can use pairwise moderated t-tests (only unpaired). One should carefully think about the adjustment of p values in this context.
## T1-T2 T1-T2: t T1-T2: p T1-T2: adj.p T1-T2: B T1-T3
## 1 0.2342401 0.4014278 0.688581325 0.82059173 -5.803960 -0.05487846
## 2 0.2423784 0.4153749 0.678362198 0.82059173 -5.798701 0.24537369
## 3 -0.1966000 -0.3369222 0.736568028 0.82059173 -5.825944 -0.13384174
## 4 -0.2086677 -0.3576033 0.721059196 0.82059173 -5.819314 -0.26788464
## 5 -0.1845143 -0.3162105 0.752209088 0.82059173 -5.832189 0.17817385
## 6 -0.2977183 -0.5102134 0.610527064 0.82059173 -5.758182 -0.34726718
## 7 -0.3016336 -0.5169231 0.605844436 0.82059173 -5.755000 -0.01197801
## 8 1.6357283 2.8032215 0.005614332 0.06737198 -2.251098 1.58693190
## 9 0.4583369 0.7854726 0.433209202 0.82059173 -5.593554 0.46639256
## 10 -0.2211433 -0.3789833 0.705146931 0.82059173 -5.812045 -0.19783574
## 11 0.4886687 0.8374536 0.403448173 0.82059173 -5.554613 0.54861003
## 12 0.1275920 0.2186602 0.827162496 0.82716250 -5.856273 0.33974584
## T1-T3: t T1-T3: p T1-T3: adj.p T1-T3: B T2-T3 T2-T3: t
## 1 -0.09404770 0.925175869 0.98364550 -5.806707 -0.289118564 -0.495475544
## 2 0.42050798 0.674615985 0.98260953 -5.730101 0.002995244 0.005133086
## 3 -0.22937064 0.818841277 0.98260953 -5.786748 0.062758227 0.107551609
## 4 -0.45908601 0.646726303 0.98260953 -5.714626 -0.059216945 -0.101482754
## 5 0.30534457 0.760456541 0.98260953 -5.768222 0.362688176 0.621555112
## 6 -0.59512745 0.552505716 0.98260953 -5.649223 -0.049548837 -0.084914080
## 7 -0.02052726 0.983645501 0.98364550 -5.810548 0.289655548 0.496395797
## 8 2.71959689 0.007176506 0.08611807 -2.437802 -0.048796409 -0.083624610
## 9 0.79927800 0.425182677 0.98260953 -5.519404 0.008055682 0.013805387
## 10 -0.33904005 0.734974762 0.98260953 -5.758320 0.023307578 0.039943250
## 11 0.94017779 0.348386345 0.98260953 -5.407634 0.059941321 0.102724150
## 12 0.58223779 0.561135089 0.98260953 -5.656144 0.212153838 0.363577617
## T2-T3: p T2-T3: adj.p T2-T3: B
## 1 0.6208688 0.9959101 -4.606091
## 2 0.9959101 0.9959101 -4.609593
## 3 0.9144712 0.9959101 -4.609428
## 4 0.9192802 0.9959101 -4.609446
## 5 0.5350208 0.9959101 -4.604082
## 6 0.9324241 0.9959101 -4.609490
## 7 0.6202208 0.9959101 -4.606078
## 8 0.9334478 0.9959101 -4.609493
## 9 0.9890005 0.9959101 -4.609590
## 10 0.9681826 0.9959101 -4.609570
## 11 0.9182963 0.9959101 -4.609442
## 12 0.7166001 0.9959101 -4.607707
Often we are in a situation that we want to compare more than two groups pairwise.
In the analysis of omics data, the FC or logFC are important measures of effect size and are often used in combination with (adjusted) p values (Shi et al. 2006).
x <- rnorm(100) ## assumed as log-data
g <- factor(sample(1:4, 100, replace = TRUE))
levels(g) <- c("a", "b", "c", "d")
## modified FC
pairwise.fc(x, g)
## a vs b a vs c a vs d b vs c b vs d c vs d
## 1.444915 1.232566 1.347226 -1.172281 -1.072511 1.093025
## a vs b a vs c a vs d b vs c b vs d c vs d
## 1.4449145 1.2325662 1.3472261 0.8530375 0.9323915 1.0930253
## a vs b a vs c a vs d b vs c b vs d c vs d
## 0.5309842 0.3016652 0.4299920 -0.2293190 -0.1009922 0.1283268
The function returns a modified FC. That is, if the FC is smaller than 1 it is transformed to -1/FC. One can also use other functions than the mean for the aggregation of the data.
## a vs b a vs c a vs d b vs c b vs d c vs d
## 0.6128841 0.2182627 0.4992445 -0.3946214 -0.1136396 0.2809818
In case of omics experiments it is often the case that technical replicates are measured and hence it is part of the preprocessing of the raw data to aggregate these technical replicates. This is the purpose of function repMeans.
M <- matrix(rnorm(100), ncol = 5)
FL <- matrix(rpois(100, lambda = 10), ncol = 5) # only for this example
repMeans(x = M, flags = FL, use.flags = "max", ndups = 5, spacing = 4)
## $exprs
## [,1] [,2] [,3] [,4] [,5]
## [1,] -1.18155923 0.2353980 0.9606907 0.1573533 -1.93850470
## [2,] 1.37000399 -1.5160676 0.1781902 -0.4929374 0.12071933
## [3,] -0.54174319 0.8405398 -0.2997625 -0.8341882 0.86364843
## [4,] -0.07335999 -0.2858454 0.3183903 0.8993541 -0.03515489
##
## $flags
## [,1] [,2] [,3] [,4] [,5]
## [1,] 11 14 12 14 16
## [2,] 13 13 13 21 13
## [3,] 11 15 18 14 17
## [4,] 13 15 13 14 11
Functions oneWayAnova and twoWayAnova return a function that can be used to perform a 1- or 2-way ANOVA, respectively.
## [1] 0.3936899
x <- matrix(rnorm(12*10), nrow = 10)
## 2-way ANOVA with interaction
af1 <- twoWayAnova(c(rep(1,6),rep(2,6)), rep(c(rep(1,3), rep(2,3)), 2))
## p values
apply(x, 1, af1)
## [,1] [,2] [,3] [,4] [,5] [,6]
## cov1 0.7189770 0.1199344 0.7797292 0.3399445 0.4791519 0.3287401
## cov2 0.1248696 0.1246937 0.2553811 0.5648113 0.6118778 0.0463841
## interaction 0.3941863 0.1489906 0.9090044 0.1403097 0.1257246 0.8367084
## [,7] [,8] [,9] [,10]
## cov1 0.4046772 0.8125137 0.01638546 0.7791182
## cov2 0.2884822 0.7388120 0.85514661 0.8902903
## interaction 0.2453559 0.7667181 0.20801934 0.2081726
## 2-way ANOVA without interaction
af2 <- twoWayAnova(c(rep(1,6),rep(2,6)), rep(c(rep(1,3), rep(2,3)), 2),
interaction = FALSE)
## p values
apply(x, 1, af2)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## cov1 0.7150619 0.1423638 0.7661585 0.3758909 0.5173798 0.29988569 0.4157757
## cov2 0.1172529 0.1476506 0.2264692 0.5948831 0.6432147 0.03443076 0.2988795
## [,8] [,9] [,10]
## cov1 0.8018355 0.01789566 0.7880695
## cov2 0.7242262 0.86114929 0.8948589
In the analysis of omics data correlation and absolute correlation distance matrices are often used during quality control. Function corDist can compute the Pearson, Spearman, Kendall or Cosine sample correlation and absolute correlation as well as the minimum covariance determinant or the orthogonalized Gnanadesikan-Kettenring correlation and absolute correlation implemented in package robustbase (Maechler et al. 2019).
## 1 2 3 4
## 2 0.2875716
## 3 0.9914823 1.0146980
## 4 0.9961413 0.9955740 1.0016289
## 5 1.0063069 0.9774429 1.0012735 1.0165059
## 1 2 3 4
## 2 1.0801350
## 3 0.9244866 1.0314483
## 4 1.0515473 0.9741859 1.1059056
## 5 1.0247986 0.9784685 0.9892507 0.9815645
## 1 2 3 4
## 2 0.9298225
## 3 0.9496162 1.0243463
## 4 1.0301798 0.9683927 1.0295911
## 5 0.9421068 0.9974223 1.1032783 0.8426071
In case of outliers the MAD (median absolute deviation = median of the absolute deviations from the median) may be useful as measure of scale.
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.000000 3.118364 3.006511 2.974456 2.878586
## [2,] 3.118364 0.000000 2.617500 2.612594 2.474732
## [3,] 3.006511 2.617500 0.000000 3.278506 3.096261
## [4,] 2.974456 2.612594 3.278506 0.000000 2.899905
## [5,] 2.878586 2.474732 3.096261 2.899905 0.000000
First, we can plot a similarity matrix based on correlation.
M <- matrix(rnorm(1000), ncol = 20)
colnames(M) <- paste("Sample", 1:20)
M.cor <- cor(M)
corPlot(M.cor, minCor = min(M.cor), labels = colnames(M))
There is a second implementation based on package (Gu, Eils, and Schlesner 2016).
Next, we can use the MAD.
Nowadays there are better solutions e.g. provided by Bioconductor package ComplexHeatmap (Gu, Eils, and Schlesner 2016). This is my solution to get a better coloring of heatmaps.
## generate some random data
data.plot <- matrix(rnorm(100*50, sd = 1), ncol = 50)
colnames(data.plot) <- paste("patient", 1:50)
rownames(data.plot) <- paste("gene", 1:100)
data.plot[1:70, 1:30] <- data.plot[1:70, 1:30] + 3
data.plot[71:100, 31:50] <- data.plot[71:100, 31:50] - 1.4
data.plot[1:70, 31:50] <- rnorm(1400, sd = 1.2)
data.plot[71:100, 1:30] <- rnorm(900, sd = 1.2)
nrcol <- 128
## Load required packages
library(RColorBrewer)
myCol <- rev(colorRampPalette(brewer.pal(10, "RdBu"))(nrcol))
heatmap(data.plot, col = myCol, main = "standard colors")
In Bioinformatics the (pairwise and multiple) alignment of strings is an important topic. For this one can use the distance or similarity between strings. The Hamming and the the Levenshtein (edit) distance are implemented in function stringDist (Merkl and Waack 2009).
## GACGGATTATG
## GATCGGAATAG 3
## GACGGATTATG
## GATCGGAATAG 3
In case of the Levenshtein (edit) distance, the respective scoring and traceback matrices are attached as attributes to the result.
## gap G A T C G G A A T A G
## gap 0 1 2 3 4 5 6 7 8 9 10 11
## G 1 0 1 2 3 4 5 6 7 8 9 10
## A 2 1 0 1 2 3 4 5 6 7 8 9
## C 3 2 1 1 1 2 3 4 5 6 7 8
## G 4 3 2 2 2 1 2 3 4 5 6 7
## G 5 4 3 3 3 2 1 2 3 4 5 6
## A 6 5 4 4 4 3 2 1 2 3 4 5
## T 7 6 5 4 5 4 3 2 2 2 3 4
## T 8 7 6 5 5 5 4 3 3 2 3 4
## A 9 8 7 6 6 6 5 4 3 3 2 3
## T 10 9 8 7 7 7 6 5 4 3 3 3
## G 11 10 9 8 8 7 7 6 5 4 4 3
## gap G A T C G G A A T A
## gap "start" "i" "i" "i" "i" "i" "i" "i" "i" "i" "i"
## G "d" "m" "i" "i" "i" "m/i" "m/i" "i" "i" "i" "i"
## A "d" "d" "m" "i" "i" "i" "i" "m/i" "m/i" "i" "m/i"
## C "d" "d" "d" "mm" "m" "i" "i" "i" "i" "i" "i"
## G "d" "d/m" "d" "d/mm" "d/mm" "m" "m/i" "i" "i" "i" "i"
## G "d" "d/m" "d" "d/mm" "d/mm" "d/m" "m" "i" "i" "i" "i"
## A "d" "d" "d/m" "d/mm" "d/mm" "d" "d" "m" "m/i" "i" "m/i"
## T "d" "d" "d" "m" "d/mm/i" "d" "d" "d" "mm" "m" "i"
## T "d" "d" "d" "d/m" "mm" "d" "d" "d" "d/mm" "m" "mm/i"
## A "d" "d" "d/m" "d" "d/mm" "d/mm" "d" "d/m" "m" "d" "m"
## T "d" "d" "d" "d/m" "d/mm" "d/mm" "d" "d" "d" "m" "d"
## G "d" "d/m" "d" "d" "d/mm" "m" "d/m" "d" "d" "d" "d/mm"
## G
## gap "i"
## G "m/i"
## A "i"
## C "i"
## G "m/i"
## G "m/i"
## A "i"
## T "i"
## T "mm/i"
## A "i"
## T "mm"
## G "m"
The characters in the trace-back matrix reflect insertion of a gap in string y (d: deletion), match (m), mismatch (mm), and insertion of a gap in string x (i).
The function stringSim computes the optimal alignment scores for global (Needleman-Wunsch) and local (Smith-Waterman) alignments with constant gap penalties (Merkl and Waack 2009). Scoring and trace-back matrix are again attached as attributes to the results.
## GACGGATTATG
## GATCGGAATAG 6
## gap G A T C G G A A T A G
## gap 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11
## G -1 1 0 -1 -2 -3 -4 -5 -6 -7 -8 -9
## A -2 0 2 1 0 -1 -2 -3 -4 -5 -6 -7
## C -3 -1 1 1 2 1 0 -1 -2 -3 -4 -5
## G -4 -2 0 0 1 3 2 1 0 -1 -2 -3
## G -5 -3 -1 -1 0 2 4 3 2 1 0 -1
## A -6 -4 -2 -2 -1 1 3 5 4 3 2 1
## T -7 -5 -3 -1 -2 0 2 4 4 5 4 3
## T -8 -6 -4 -2 -2 -1 1 3 3 5 4 3
## A -9 -7 -5 -3 -3 -2 0 2 4 4 6 5
## T -10 -8 -6 -4 -4 -3 -1 1 3 5 5 5
## G -11 -9 -7 -5 -5 -3 -2 0 2 4 4 6
## gap G A T C G G A A T A
## gap "start" "i" "i" "i" "i" "i" "i" "i" "i" "i" "i"
## G "d" "m" "i" "i" "i" "m/i" "m/i" "i" "i" "i" "i"
## A "d" "d" "m" "i" "i" "i" "i" "m/i" "m/i" "i" "m/i"
## C "d" "d" "d" "mm" "m" "i" "i" "i" "i" "i" "i"
## G "d" "d/m" "d" "d/mm" "d" "m" "m/i" "i" "i" "i" "i"
## G "d" "d/m" "d" "d/mm" "d" "d/m" "m" "i" "i" "i" "i"
## A "d" "d" "d/m" "d/mm" "d" "d" "d" "m" "m/i" "i" "m/i"
## T "d" "d" "d" "m" "d/i" "d" "d" "d" "mm" "m" "i"
## T "d" "d" "d" "d/m" "mm" "d" "d" "d" "d/mm" "m" "mm/i"
## A "d" "d" "d/m" "d" "d/mm" "d" "d" "d/m" "m" "d" "m"
## T "d" "d" "d" "d/m" "d/mm" "d" "d" "d" "d" "m" "d"
## G "d" "d/m" "d" "d" "d/mm" "m" "d/m" "d" "d" "d" "d/mm"
## G
## gap "i"
## G "m/i"
## A "i"
## C "i"
## G "m/i"
## G "m/i"
## A "i"
## T "i"
## T "mm/i"
## A "i"
## T "mm"
## G "m"
## GACGGATTATG
## GATCGGAATAG 6
## gap G A T C G G A A T A G
## gap 0 0 0 0 0 0 0 0 0 0 0 0
## G 0 1 0 0 0 1 1 0 0 0 0 1
## A 0 0 2 1 0 0 0 2 1 0 1 0
## C 0 0 1 1 2 1 0 1 1 0 0 0
## G 0 1 0 0 1 3 2 1 0 0 0 1
## G 0 1 0 0 0 2 4 3 2 1 0 1
## A 0 0 2 1 0 1 3 5 4 3 2 1
## T 0 0 1 3 2 1 2 4 4 5 4 3
## T 0 0 0 2 2 1 1 3 3 5 4 3
## A 0 0 1 1 1 1 0 2 4 4 6 5
## T 0 0 0 2 1 0 0 1 3 5 5 5
## G 0 1 0 1 1 2 1 0 2 4 4 6
## gap G A T C G
## gap "start" "i" "i" "i" "i" "i"
## G "d" "m" "i/stop" "stop" "stop" "m"
## A "d" "d/stop" "m" "i" "i/stop" "d/stop"
## C "d" "stop" "d" "mm" "m" "i"
## G "d" "m" "d/i/stop" "d/mm/stop" "d" "m"
## G "d" "m" "mm/i/stop" "stop" "d/stop" "d/m"
## A "d" "d/stop" "m" "i" "i/stop" "d"
## T "d" "stop" "d" "m" "i" "i"
## T "d" "stop" "d/stop" "d/m" "mm" "mm/i"
## A "d" "stop" "m" "d" "d/mm" "mm"
## T "d" "stop" "d/stop" "m" "i" "d/mm/i/stop"
## G "d" "m" "i/stop" "d" "mm" "m"
## G A A T A G
## gap "i" "i" "i" "i" "i" "i"
## G "m" "i/stop" "stop" "stop" "stop" "m"
## A "d/mm/stop" "m" "m/i" "i/stop" "m" "d/i/stop"
## C "i/stop" "d" "mm" "mm/i/stop" "d/stop" "mm/stop"
## G "m/i" "i" "d/mm/i/stop" "mm/stop" "stop" "m"
## G "m" "i" "i" "i" "i/stop" "m"
## A "d" "m" "m/i" "i" "m/i" "i"
## T "d" "d" "mm" "m" "i" "i"
## T "d" "d" "d/mm" "m" "mm/i" "mm/i"
## A "d/mm/i/stop" "d/m" "m" "d" "m" "i"
## T "mm/stop" "d" "d" "m" "d" "mm"
## G "m/i" "d/i/stop" "d" "d" "d/mm" "m"
The entry stop indicates that the minimum similarity score has been reached.
Finally, the function traceBack computes an optimal global or local alignment based on a trace back matrix as provided by function stringDist or stringSim (Merkl and Waack 2009).
x <- "GACGGATTATG"
y <- "GATCGGAATAG"
## Levenshtein distance
d <- stringDist(x, y)
## optimal global alignment
traceBack(d)
## pairwise global alignment
## x' "GA-CGGATTATG"
## y' "GATCGGAATA-G"
## pairwise global alignment
## x' "GA-CGGATTATG"
## y' "GATCGGAATA-G"
## Optimal local alignment score
d <- stringSim(x, y, global = FALSE)
## optimal local alignment
traceBack(d, global = FALSE)
## pairwise local alignment
## x' "GA-CGGATTA"
## y' "GATCGGAATA"
## R version 4.1.0 Patched (2021-07-20 r80661)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Linux Mint 20.1
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/libf77blas.so.3.10.3
## LAPACK: /home/kohlm/RTOP/Rbranch/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8
## [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] RColorBrewer_1.1-2 MKomics_0.7
##
## loaded via a namespace (and not attached):
## [1] ComplexHeatmap_2.8.0 highr_0.9 bslib_0.2.5.1
## [4] DEoptimR_1.0-9 compiler_4.1.0 jquerylib_0.1.4
## [7] iterators_1.0.13 tools_4.1.0 digest_0.6.27
## [10] clue_0.3-59 jsonlite_1.7.2 evaluate_0.14
## [13] png_0.1-7 rlang_0.4.11 foreach_1.5.1
## [16] yaml_2.2.1 parallel_4.1.0 xfun_0.25
## [19] cluster_2.1.2 stringr_1.4.0 knitr_1.33
## [22] S4Vectors_0.30.0 GlobalOptions_0.1.2 sass_0.4.0
## [25] IRanges_2.26.0 stats4_4.1.0 grid_4.1.0
## [28] robustbase_0.93-8 R6_2.5.0 GetoptLong_1.0.5
## [31] rmarkdown_2.10 limma_3.48.1 magrittr_2.0.1
## [34] BiocGenerics_0.38.0 codetools_0.2-18 htmltools_0.5.1.1
## [37] matrixStats_0.60.0 shape_1.4.6 circlize_0.4.13
## [40] colorspace_2.0-2 stringi_1.7.3 doParallel_1.0.16
## [43] crayon_1.4.1 rjson_0.2.20 Cairo_1.5-12.2
Gu, Zuguang, Roland Eils, and Matthias Schlesner. 2016. “Complex Heatmaps Reveal Patterns and Correlations in Multidimensional Genomic Data.” Bioinformatics.
Maechler, Martin, Peter Rousseeuw, Christophe Croux, Valentin Todorov, Andreas Ruckstuhl, Matias Salibian-Barrera, Tobias Verbeke, Manuel Koller, Eduardo L. T. Conceicao, and Maria Anna di Palma. 2019. Robustbase: Basic Robust Statistics. http://robustbase.r-forge.r-project.org/.
Merkl, Rainer, and Stephan Waack. 2009. Bioinformatik Interaktiv: Grundlagen, Algorithmen, Anwendungen (German Edition). Wiley-Blackwell.
Ritchie, Matthew E, Belinda Phipson, Di Wu, Yifang Hu, Charity W Law, Wei Shi, and Gordon K Smyth. 2015. “limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47. https://doi.org/10.1093/nar/gkv007.
Shi, L., L. H. Reid, W. D. Jones, R. Shippy, J. A. Warrington, S. C. Baker, P. J. Collins, et al. 2006. “The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements.” Nat. Biotechnol. 24 (9): 1151–61.