Getting started with moc.gapbk

Overview

The moc.gapbk package implements the Multi-Objective Clustering Algorithm Guided by a-Priori Biological Knowledge (MOC-GaPBK) proposed by Parraga-Alava and others (2018). The algorithm combines:

It receives two distance matrices and produces a set of non-dominated clustering solutions. The second matrix is typically used to encode a-priori biological knowledge (for example, semantic similarity between genes).

Basic usage

library(moc.gapbk)

set.seed(2025)

# Toy data: 50 objects (e.g. genes) described by 20 features (e.g. samples).
x <- matrix(stats::runif(50 * 20, min = -5, max = 10),
            nrow = 50, ncol = 20)

# Two distance matrices over the same set of objects.
# Here we use amap if available (correlation distance is biologically
# common), and fall back to base R otherwise so the vignette knits
# under any configuration.
if (requireNamespace("amap", quietly = TRUE)) {
  d1 <- as.matrix(amap::Dist(x, method = "euclidean"))
  d2 <- as.matrix(amap::Dist(x, method = "correlation"))
} else {
  d1 <- as.matrix(stats::dist(x, method = "euclidean"))
  d2 <- as.matrix(stats::dist(x, method = "manhattan"))
}

res <- moc.gapbk(dmatrix1 = d1,
                 dmatrix2 = d2,
                 num_k = 3,
                 generation = 5,
                 pop_size = 6)

Pareto-front population

res$population contains the medoids that survived the last generation, together with the values of the two objective functions, the Pareto ranking and the crowding distance.

head(res$population)
#>   V1 V2 V3     obj1     obj2 paretoranking crowding
#> 1  1 28  9 3.060216 4.821277             1      Inf
#> 2  1 28  3 3.357799 3.090347             1      Inf

Cluster assignments per solution

res$matrix.solutions is a data frame whose columns are the clustering assignments produced by each non-dominated solution.

head(res$matrix.solutions)
#>   1 2
#> 1 1 1
#> 2 1 1
#> 3 3 3
#> 4 1 1
#> 5 1 1
#> 6 1 1

Convenient per-solution vectors

res$clustering exposes the same information as a list of named integer vectors, ready to be passed to validation indices, plotting helpers, etc.

str(res$clustering[[1]])
#>  Named int [1:50] 1 1 3 1 1 1 3 3 3 3 ...
#>  - attr(*, "names")= chr [1:50] "1" "2" "3" "4" ...
table(res$clustering[[1]])
#> 
#>  1  2  3 
#> 24  6 20

Tips for biological applications

In bioinformatics workflows, dmatrix1 is usually a distance derived from numerical expression profiles (for example, correlation or Euclidean distance on log-expression values), while dmatrix2 is a distance derived from a-priori biological knowledge (for example, semantic similarity between Gene Ontology terms). The Xie-Beni validity index is computed independently on each matrix and acts as one of the two objective functions of the NSGA-II engine.

Backward compatibility

Versions before 0.2.0 exported the function as moc.gabk (with a single p). That name is preserved as a deprecated alias and emits a warning; all new code should call moc.gapbk directly.

References

Parraga-Alava, J., Dorn, M., Inostroza-Ponta, M. (2018). A multi-objective gene clustering algorithm guided by apriori biological knowledge with intensification and diversification strategies. BioData Mining 11(1), 1-16. https://doi.org/10.1186/s13040-018-0178-4