The CliquePercolation package entails a number of functions related to the clique percolation community detection algorithms for undirected, unweighted and weighted networks (Farkas, Ábel, Palla, & Vicsek, 2007; as described in Palla, Derényi, Farkas, & Vicsek, 2005).
The package entails functions for…
This document provides an introduction to this workflow with some random example data sets.
Interconnected entities can be represented as networks (Barabási, 2011). Each network entails two sets, namely nodes, which are the entities, and edges, which are the connections between entities. Many networks are undirected such that edges simply connect two nodes with each other without any directionality in the edges. For instance, nodes could represent people and edges could represent whether they are friends or not. Or nodes could represent cities and edges could represent whether there is a street connecting the cities. Such networks, in which edges are either present or absent, are called unweighted. In other cases, nodes could represent people and edges could represent the frequency with which these people meet. Or nodes could represent questionnaire items and edges could represent the correlations of these items. In such a case, the edges are not only present or absent, but may count the frequency or strength of interactions. Such networks are called weighted. The edges in the network can further be directed, such that edges can point from one node to another. For instance, the internet is a network of webpages in which a directed edge could represent a hyperlink from one webpage to another. Here, only undirected networks are discussed.
Analyzing the structure of such networks is a key task across sciences. One structural feature that is often investigated is the identification of strongly connected subgraphs in the network, which is called community detection (Fortunato, 2010). Most community detection algorithms thereby put each node in only one community. However, it is likely that nodes are often shared by a number of communities. This could occur, for instance, when a friend is part of multiple groups. One community detection algorithm that is aimed at identifying overlapping communities is the clique percolation algorithm, which has been developed for unweighted (Palla et al., 2005) and weighted networks (Farkas et al., 2007).
The clique percolation algorithm for unweighted networks proceeds as follows:
This also showcases that the clique percolation algorithm can lead to nodes that are shared by communities. In the current example, node \(d\) is part of both the green and the pink community. Nodes \(a\), \(b\), and \(c\) are part of only the green community and nodes \(e\), \(f\), and \(g\) are part of only the pink community. Importantly, clique percolation can also lead to nodes that are part of no community. These are called isolated nodes, in the current example, node \(h\).
One way to plot the community partition on the original network would be to color the nodes according to the communities they belong to. Shared nodes have multiple colors and isolated nodes remain white.
For weighted networks, the algorithm has just one intermediate additional step. Specifically, after identifying the \(k\)-cliques, they are considered further only if their Intensity exceeds a specified threshold \(I\). The Intensity of a \(k\)-clique is defined as the geometric mean of the edge weights, namely
\[ I_C = \Bigg(\prod_{i<j;\,i,j\in C} w_{ij}\Bigg)^{2/k(k-1)} \] Where \(C\) is the clique, \(i\) and \(j\) are the nodes, \(w\) is the edge weight, and \(k\) is the clique size. For instance, a 3-clique with edge weights 0.1, 0.2, and 0.3 would have
\[ I_C = (0.1*0.2*0.3)^{2/3(3-1)} \approx 0.18 \]
To show the influence of \(I\), here is the network from the previous example, but this time I added edge weights.
If \(I = 0.17\), only two cliques (\(a\)–\(b\)–\(c\), \(a\)–\(b\)–\(d\)) would survive the threshold. If only these are further considered, the clique percolation method would find only one community (the formerly green community) and four nodes would be isolated (\(e\), \(f\), \(g\), \(h\)). If, however, \(I = 0.09\), all cliques would survive the threshold, leading to the same community partition as for the unweighted network.
The program that the developers of the clique percolation algorithm designed, CFinder, adds yet another intermediate step to this procedure, even though this intermediate step is not described in Farkas et al. (2007). Specifically, the Intensity threshold \(I\) is applied not only to the \(k\)-cliques but additionally to the overlap of the \(k\)-cliques. For instance, in the case of the 3-cliques in our example, the overlap of the 3-cliques \(a\)–\(b\)–\(c\) and \(a\)–\(b\)–\(d\) is the edge \(a\)–\(b\). If \(I = 0.17\), the two 3-cliques would survive the threshold. Subsequently, it is checked whether the \(a\)–\(b\) edge also exceeds the threshold. In the current case, it would not, because the edge weights is only 0.1. Given that the edge does not exceed the threshold, the two 3-cliques are not considered adjacent. Therefore, instead of putting them in the same community, the two 3-cliques are considered to be two separate communities with two shared nodes in CFinder.
It is challenging to optimize \(k\) and \(I\) for the clique percolation algorithm. For unweighted networks, only an optimal \(k\) needs to be found, as \(I\) is not used. Hence, when describing ways to optimize \(k\) and \(I\) for weighted networks, the same guidelines apply to unweighted networks, yet, only for \(k\). Palla et al. (2005) and Farkas et al. (2007) provide guidelines based on ideas in percolation theory, which are better illustrated for weighted networks. Specifically, for each \(k\), a large \(I\) will lead to the exclusion of most if not all \(k\)-cliques for further consideration. The other extreme, a very low \(I\), will lead to the inclusion of most if not all \(k\)-cliques for further consideration and these \(k\)-cliques then often combine to a gigantic component to which all nodes belong. The optimal \(I\) for each \(k\) is just above the point of the emergence of the gigantic component. Percolation theory supports that at this point, the size distribution of the communities follows a power-law. An indicator of this point with the broadest possible distribution is the ratio of the largest to second largest community sizes. \(I\) and \(k\) are optimal, if the largest community is twice as large as the second largest.
Notably, applying this threshold requires a large number of communities, otherwise this point might not be very stable. For somewhat smaller, yet still rather large networks, Farkas et al. (2007) propose to instead check the variance of the community sizes after excluding the largest community. The formula they provide is
\[ \chi = \sum^{n_\alpha \ne n_{max}} \frac{n_\alpha^2}{\big(\sum^\beta n_\beta\big)^2} \]
where \(\chi\) is the variance, \(n_\alpha\) is a set of communities excluding the largest one, and \(n_\beta\) is a set of communities that does neither include \(n_\alpha\) nor the largest community. For instance, if there are four communities with 9, 7, 6, and 3 nodes, respectively, then one would exclude the largest community, and determine the variance of the last three by
\[ \chi = \frac{7^2}{(6+3)^2} + \frac{6^2}{(7+3)^2} + \frac{3^2}{7+6^2} \approx 1.02 \]
The logic behind \(\chi\) is as follows. When \(I\) is higher, there will be many equally small communities of size \(k\). When \(I\) is very low, there will be a gigantic community and many equally small ones. Thus, after excluding the largest community, for higher and smaller \(I\), \(\chi\) will be small. In contrast, when the community size distribution is broad (i.e., close to a power-law), then the variance of the communities (after excluding the largest community) will be higher. Thus, the point of the maximal variance is another way to optimize \(I\) for different \(k\).
Nevertheless, also a stable estimate of \(\chi\) requires a moderate number of communities. If the network is very small, such that only a few communities can be expected, also the \(\chi\) threshold is unreliable. For instance, networks that represent psychological constructs such as attitudes (Dalege et al., 2016) or emotions (Lange, Dalege, Borsboom, Kleef, & Fischer, 2020), entail typically only up to 25 nodes. Moreover, it would be undesirable to have, for instance, two communities of which one is twice as large as the other. Instead, equally sized communities are preferable. An alternative way to optimize \(I\) for different \(k\) for very small networks would be to rely on the entropy of the community partition, for instance, an entropy measure based on Shannon Information (Shannon, 1948). The idea is to find the most surprising community partition, defined as a low probability of knowing to which community a randomly picked node belongs. If there is only one community, surprisingness would be zero, because it would be certain that a randomly picked node belongs to this community. If there are, for instance, two communities of sizes 15 and 3, it would still not be very surprising, because most likely a randomly picked node will belong to the larger community. Surprisingness would be higher, however, when the communities are equal in size. Entropy based on Shannon Information captures this idea in the equation
\[ Entropy = -\sum_{i=1}^N p_i * \log_2 p_i \]
where \(N\) is the number of communities and \(p_i\) is the probability of being in community \(i\). For the four community sizes above (9, 7, 6, 3), the result would be
\[ \begin{aligned} Entropy &= -\Bigg(\Big(\frac{9}{9+7+6+3} * log_2 \frac{9}{9+7+6+3}\Big) \\ & + \Big(\frac{7}{9+7+6+3} * log_2 \frac{7}{9+7+6+3}\Big) \\ & + \Big(\frac{6}{9+7+6+3} * log_2 \frac{6}{9+7+6+3}\Big) \\ & + \Big(\frac{3}{9+7+6+3} * log_2 \frac{3}{9+7+6+3}\Big)\Bigg) \\ &\approx 1.91 \end{aligned} \]
As pointed out, with only one community, this equation equals zero. When calculating entropy, isolated nodes are treated as another pseudo-community and shared nodes are split equally among the communities they belong to (e.g., a node shared by two communities is added as 0.5 nodes to each of them). As such, entropy favors equally sized communities with few isolated nodes in very small networks. This is exactly what would be desired. The \(I\) that has the maximum entropy for the respective \(k\) optimizes \(k\) and \(I\).
Because for very small networks, a specific amount of communities can occur just by chance alone, permutation tests may complement the calculation of entropy. Specifically, the edges of the network are randomly shuffled a number of times and for each of these random permutations, the highest entropy of a range of \(I\) values is determined separately for each \(k\). Over all permutations, this analysis leads to a distribution of entropy values for each \(k\) separately. Only entropy values that exceed the confidence interval of the distribution of each \(k\) can be considered more surprising than would already be expected by chance alone. For large and very large networks, the entropy threshold should, however, not be used. As it prefers equally sized community sizes, it would penalize the broad power-law distribution of community sizes that is preferred for larger networks.
More recently, Santiago et al. (Santiago, Soares, Quintero, & Jamieson, 2022) used simulations to compare the performance of different thresholds for optimizing \(k\) and \(I\). They argued convincingly that even entropy may often fail to detect the best community partition. Their simulations instead supported the conclusion that thresholds based on fuzzy modularity outperform entropy. Details are available in their manuscript.
Finally, for unweighted networks, the optimization of \(k\) and \(I\) is similar. Yet, given the limited range of possible \(k\) (e.g., 3 to 6) many thresholds will not be very sensitive. Nevertheless, a small \(k\) (e.g., \(k = 3\)) will often result in the emergence of a giant component to which most of the nodes belong. Increasing \(k\) (e.g., \(k = 6\)) will often lead to a large number of smaller communities. \(k\) is optimal when there is no giant component for very large networks, when \(\chi\) is maximal for large networks, or when entropy is maximal for very small networks.
In sum, the clique percolation algorithm proceeds by identifying \(k\)-cliques and by putting adjacent \(k\)-cliques (i.e., \(k\)-cliques that share \(k - 1\) nodes) into a community. For weighted networks, the Intensity of the \(k\)-cliques must further exceed \(I\). If not, they are removed from consideration. Finally, in the CFinder program, the \(k - 1\) overlap of \(k\)-cliques must further exceed \(I\). If not, the \(k\)-cliques are considered separately and not as a single community. Optimal \(k\) and \(I\) can be determined with the ratio test for large networks, \(\chi\) for small networks, or entropy for very small networks.
The CliquePercolation package entails functions to conduct the clique percolation community detection algorithms for unweighted and weighted networks and to interpret the results. In what follows, an example is used to explain the workflow suggested by the package.
CliquePercolation accepts networks that were analyzed with the qgraph package (Epskamp, Cramer, Waldorp, Schmittmann, & Borsboom, 2012). qgraph accepts output from various different packages such as igraph (Csárdi & Nepusz, 2006), bootnet (Epskamp & Fried, 2018), or lavaan (Rosseel, 2012). For details see the documentation of qgraph. Moreover, I will use the Matrix (Bates & Maechler, 2019) package for the example below. Thus, we need to load the following packages in R to run the example
library(CliquePercolation) #version 0.3.0
library(qgraph) #version 1.6.5
library(Matrix) #version 1.2-18
First, I will create an example network. When you want to apply the clique percolation algorithm, you will already have a network. The network can be a qgraph object. Alternatively, you can also provide a symmetric matrix with rows and columns representing the nodes and each entry representing the edge connecting the respective nodes (i.e., an adjacency matrix or a weights matrix). The network I am using is weighted, with edge weights drawn from a random distribution with mean of 0.3 and a standard deviation of 0.1.
# create qgraph object; 150 nodes with letters as names; 1/7 of edges different from zero
<- matrix(c(0), nrow = 150, ncol = 150, byrow = TRUE)
W <- paste(letters[rep(seq(from = 1, to = 26), each = 26)],
name.vector seq(from = 1, to = 26)], sep = "")[1:nrow(W)]
letters[rownames(W) <- name.vector
colnames(W) <- name.vector
set.seed(4186)
upper.tri(W)] <- sample(c(rep(0,6),1), length(W[upper.tri(W)]), replace = TRUE)
W[<- stats::rnorm(length(which(W == 1)), mean = 0.3, sd = 0.1)
rand_w which(W == 1)] <- rand_w
W[
<- Matrix::forceSymmetric(W)
W <- qgraph::qgraph(W, theme = "colorblind", layout = "spring", cut = 0.4) W
To run the clique percolation algorithm for weighted networks, we
initially need to optimize \(k\) and
\(I\). To this end, the
cpThreshold
function can be used. For a specific network
(i.e., a qgraph object or a matrix), it determines the
ratio of the largest to second largest community, \(\chi\), entropy and/or (signed) fuzzy
modularity over a range of \(k\) and
\(I\) values. It does so by running the
clique percolation algorithm for either unweighted
(method = "unweighted"
) or weighted networks
(method = "weighted"
). The algorithm as implemented in
CFinder can also be requested
(method = "weighted.CFinder"
). For each \(k\) and \(I\) combination, cpThreshold
determines the respective threshold and saves the results in a data
frame, specifying the respective \(k\),
\(I\), the number of communities, the
number of isolated nodes, and the requested thresholds. Note that the
ratio threshold can be determined only when there are at least two
communities and \(\chi\) can be
determined only when there are at least three communities. Otherwise,
the function will return NA
values in these cases. Entropy
and (signed) fuzzy modularity can always be determined. In the current
case, as the network is rather large, I will request only the ratio and
\(\chi\) thresholds. I will do so for
\(k\) values of 3 and 4 and \(I\) values of 0.40 to 0.01 in steps of
0.005. Even smaller steps might in many cases be preferred, given that
even small changes of 0.005 can lead to rather different (and
potentially optimal) values. However, to save computation time, I took
these larger steps. The range of \(I\)
values was chosen based on the fact that my mean edge weight was set to
0.3 when generating the network. Thus, \(I =
0.40\) should allow me to find a broad community size
distribution. However, to be sure, one could start by setting the
highest tested value of \(I\) to the
maximum edge weight in the network. This is recommended by Farkas et al.
(2007). But then, very high \(I\) values will rarely lead to the
identification of any \(k\)-clique. The
k.range
and I.range
are entered as vectors of
numbers. The requested thresholds are also entered as a vector of
strings. Depending on the network, this function may need lots of time
to run (the code below ran for 18 minutes on my laptop). To see how long
it takes, a progress bar is implemented.
The cpThreshold
function is used as follows
<- cpThreshold(W, method = "weighted", k.range = c(3,4),
thresholds I.range = c(seq(0.40, 0.01, by = -0.005)),
threshold = c("largest.components.ratio","chi"))
We can now access the data frame that is saved in the object
thresholds
thresholds
To decide in favor of optimal \(I\)
values for each \(k\), we now check at
which \(I\) the ratio threshold is
jumping abruptly to a high level, thereby crossing the ratio 2. The
\(I\) just before the jump should
ideally also be the point of the highest \(\chi\). For \(k =
3\), the ratio first crossed 2 at \(I =
0.35\) with \(\chi\) being
rather large. This solution has 45 communities and nine isolated nodes.
For \(k = 4\), the ratio first crossed
2 at \(I = 0.27\). The \(\chi\) is not very large and subsequently
the nodes never reach a stage at which they are all part of a single
community, most likely because \(k\) is
too large to allow that to happen. This already indicates that \(k = 4\) is probably too high. This solution
has 62 communities and 29 isolated nodes. Hence, also the number of
isolated nodes is rather high. Still, we can now use these optimized
values to run the clique percolation method. For this, we apply the
cpAlgorithm
function. Specifically, the
cpAlgorithm
function takes a network (i.e., a
qgraph object or a matrix) and runs the clique
percolation algorithm for unweighted
(method = "unweighted"
) or weighted networks
(method = "weighted"
). The algorithm as implemented in
CFinder can also be requested
(method = "weighted.CFinder"
). Additionally, we enter the
optimal \(k\) and \(I\) values determined via
cpThreshold
. Thus, we run cpAlgorithm
twice,
namely
.35 <- cpAlgorithm(W, k = 3, method = "weighted", I = 0.35)
cpk3I.27 <- cpAlgorithm(W, k = 4, method = "weighted", I = 0.27) cpk4I
These objects have the class cpAlgorithm
. They are
lists, entailing the results of the clique percolation algorithm. When
printing the objects in the console, they offer a brief summary of the
results. For instance
.35
cpk3I#>
#> Results of clique percolation community detection algorithm
#>
#> --------------------
#>
#> User-specified Settings
#>
#> method = weighted
#> k = 3
#> I = 0.35
#>
#> --------------------
#>
#> Results
#>
#> Number of communities: 45
#> Number of shared nodes: 94
#> Number of isolated nodes: 9
#>
#> --------------------
#>
#> For details, use summary() (see ?summary.cpAlgorithm).
Using summary
, it is possible to get more detailed
information about the communities, shared nodes, and isolated nodes.
summary(cpk3I.35)
By default, this function will present the communities, shared nodes, and isolated nodes with labels as identifies of the nodes. It is also possible to use the numbers as identifiers of the nodes or to restrict the output. For instance, when only the shared nodes with numbers as identifies of nodes should be requested
summary(cpk3I.35, details = "shared.nodes.numbers")
Alternatively, it is possible to access all information directly from
the objects. Each element can be requested via the $
operator. For instance, the list of communities with the node numbers as
identifiers of the nodes for the \(k =
3\) solution can be requested by typing
.35$list.of.communities.numbers cpk3I
As an example, community 45 entails nodes 9, 12, and 33. The same results with the labeled nodes can be requested via
.35$list.of.communities.labels cpk3I
The nodes 9, 12, and 33 in community 45 are named “ai”, “al”, and “bg”.
To decide in favor of one solution, we can investigate the community
size distribution. This should be maximally broad, similar to a
power-law. To look at the community size distribution, we can use
another function, namely cpCommunitySizeDistribution
. It
takes the list of communities generated by cpAlgorithm
and
turns it into a distribution plot. As the plot may not please you
visually, the function also returns a data frame with the frequency data
if you assign the results to an object. Thus, you are free to plot these
data in any way you want. The default line color of the distribution is
red, but could be changed via the color.line
argument.
cpCommunitySizeDistribution(cpk3I.35$list.of.communities.numbers)
cpCommunitySizeDistribution(cpk4I.27$list.of.communities.numbers)
Both distributions look very much like a power-law, given that
multiple very large community sizes occur rarely. To actually test
whether the distribution for \(k = 3\)
fits a power-law, cpCommunitySizeDistribution
has another
argument, test.power.law
. It tests the fit via the
fit_power_law
function of the igraph
package (Csárdi & Nepusz, 2006). The
following code runs the test
.35 <- cpCommunitySizeDistribution(cpk3I.35$list.of.communities.numbers, test.power.law = TRUE) fit_pl_k3I
The object fit_pl_k3I.35
now includes the output of
fit_power_law
, which can be accessed via
.35$fit.power.law
fit_pl_k3I#> $continuous
#> [1] FALSE
#>
#> $alpha
#> [1] 2.72733
#>
#> $xmin
#> [1] 3
#>
#> $logLik
#> [1] -88.19429
#>
#> $KS.stat
#> [1] 0.1599964
#>
#> $KS.p
#> [1] 0.1995392
The Kolmogorov-Smirnov Test is not significant, \(p = .20\), indicating that the hypothesis of a power-law cannot be rejected. This finding is in line with the notion that the distribution follows a power-law with \(\alpha = 2.73\). Thus, we decide in favor of \(k = 3\) and \(I = .35\) as the optimal parameters for clique percolation.
Oftentimes, researchers are interested in plotting the solution of
the clique percolation algorithm in another network, such that a node
represents a community, and the edges between nodes represent the number
of nodes that two communities share (i.e., the community
graph). Plotting this network can be done with the
cpCommunityGraph
function. It also takes the list of
communities from cpAlgorithm
and turns it into the
community network. To showcase which node represents which network, the
node sizes can be plotted proportional to the largest node
(node.size.method = "proportional"
). To do this, we set the
size of the largest node via max.node.size
and the others
are scaled accordingly. If the proportional visualization is not
preferred, we can also plot all nodes with equal size
(node.size.method = "normal"
). As the community network is
plotted via qgraph, one can also add all other
arguments available in qgraph to make the graph more
pleasing. Moreover, next to plotting the community network, the function
also returns the weights matrix of the community graph if the results
are assigned to an object. This matrix could then be used for other
purposes.
The current community network can be plotted as follows, by additionally using a spatial arrangement and edge coloring from qgraph.
<- cpCommunityGraph(cpk3I.35$list.of.communities.numbers,
commnetwork node.size.method = "proportional",
max.node.size = 20,
theme = "colorblind", layout = "spring", repulsion = 0.4)
The weights matrix can be accessed via
$community.weights.matrix commnetwork
As was already clear from the community size distribution, most communities are rather small, while community 5 is very large. Yet there is overlap among many communities, that, in the case of an actual network, should then be interpreted thematically.
For very small networks, the above procedures may not all apply. First, the ratio and \(\chi\) thresholds cannot be determined or are not very informative for very small networks. Second, a community network may not be the most informative representation of the results, if there are only very few communities to begin with.
To showcase the somewhat altered workflow with very small networks, we will use a very small, unweighted network.
<- matrix(c(0,1,1,1,0,0,0,0,0,0,
W 0,0,1,1,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,
0,0,0,0,1,1,0,0,0,0,
0,0,0,0,0,1,0,0,0,0,
0,0,0,0,0,0,1,1,1,0,
0,0,0,0,0,0,0,1,1,0,
0,0,0,0,0,0,0,0,1,0,
0,0,0,0,0,0,0,0,0,1,
0,0,0,0,0,0,0,0,0,0), nrow = 10, ncol = 10, byrow = TRUE)
<- forceSymmetric(W)
W rownames(W) <- letters[seq(from = 1, to = nrow(W))]
colnames(W) <- letters[seq(from = 1, to = nrow(W))]
<- qgraph(W, edge.width = 4) W
Again, we will first use cpThreshold
. As the network is
unweighted, we need to specify only the k.range
but not the
I.range
. As the network is very small, we will request
entropy instead of the ratio and \(\chi\).
<- cpThreshold(W, method = "unweighted", k.range = c(3,4),
thresholds.small threshold = "entropy")
You can now access the data frame that is saved in the object
thresholds.small
thresholds.small#> k Number.of.Communities Number.of.Isolated.Nodes Entropy.Threshold
#> 1 3 3 1 1.8567796
#> 2 4 1 6 0.9709506
I display it here because, as compared to the previous example, it is short. For \(k = 3\), there are three communities and one isolated node with an entropy of 1.86. For \(k = 4\), there is one community and six isolated nodes with an entropy of 0.97.
Now, we can estimate whether this entropy is higher than would
already be expected by chance. The permutation test for very small
networks is implemented in the function cpPermuteEntropy
.
For a specific network (i.e., a qgraph object or a
matrix) and a cpThreshold
object,
cpPermuteEntropy
creates n
permutations of the
network, performs cpThreshold
for each, extracts the
highest entropy for each \(k\), and
determines the confidence interval of the entropy for each \(k\). n = 100
and the 95%
confidence interval are the default settings. Larger n
are
certainly desired but can lead to large computation time. A progress bar
again indicates how long it will take. Given that permutations are
random, I will set a seed to allow reproducing the results. As the
function relies on parallel computation, it is not possible to set the
seed outside the function. Instead, it is necessary to set the
seed
argument inside the function. By default, the function
uses half of the computational power of the computer, but it is possible
to set the number of used cores with the ncores
argument.
The function can be used as follows
<- cpPermuteEntropy(W, cpThreshold.object = thresholds.small,
permute n = 100, interval = 0.95,
ncores = 2, seed = 4186)
#> Permutating...
The resulting object has the class cpPermuteEntropy
. The
printing function shows the results, following a repetition of the
specified settings. First, it shows the confidence intervals for each
\(k\). Second, the function also takes
the original cpThreshold
object specified in
cpThreshold.object
and removes all rows that do not exceed
the upper bound of the confidence interval.
permute#>
#> Confidence intervals for entropy values of random permutations of original network
#>
#> --------------------
#>
#> User-specified Settings
#>
#> n = 100
#> interval = 0.95
#> CFinder = FALSE
#> ncores = 2
#> seed = 4186
#>
#>
#> --------------------
#>
#> Confidence intervals
#>
#> k 95% CI lower 95% CI upper
#> 3 1.152 1.276
#> 4 0.086 0.227
#>
#>
#> --------------------
#>
#> Extracted rows from cpThreshold object
#>
#> k Number.of.Communities Number.of.Isolated.Nodes Entropy.Threshold
#> 3 3 1 1.857
#> 4 1 6 0.971
The resulting object is also a list with the two respective objects, accessible via
$Confidence.Interval
permute$Extracted.Rows permute
In the current example, both solutions, for \(k = 3\) and \(k = 4\), exceed the upper bound of the confidence interval and are thus still in the data frame.
Now we can choose an optimal \(k\). Even though both exceed the upper bound of the confidence interval, the number of isolated nodes for \(k = 4\) is rather high, leading us to accept \(k = 3\) as the optimal \(k\).
Using this optimal \(k\), we run the
cpAlgorithm
<- cpAlgorithm(W, k = 3, method = "unweighted") cpk3
By inspecting the results with
cpk3#>
#> Results of clique percolation community detection algorithm
#>
#> --------------------
#>
#> User-specified Settings
#>
#> method = unweighted
#> k = 3
#>
#> --------------------
#>
#> Results
#>
#> Number of communities: 3
#> Number of shared nodes: 2
#> Number of isolated nodes: 1
#>
#> --------------------
#>
#> For details, use summary() (see ?summary.cpAlgorithm).
summary(cpk3)
#>
#> --------------------
#> Communities (labels as identifiers of nodes)
#> --------------------
#>
#> Community 1 : f g h i
#> Community 2 : d e f
#> Community 3 : a b c d
#>
#>
#> --------------------
#> Shared nodes (labels as identifiers of nodes)
#> --------------------
#>
#> d f
#>
#>
#> --------------------
#> Isolated nodes (labels as identifiers of nodes)
#> --------------------
#>
#> j
we see that we indeed have three communities, one entailing nodes \(a\), \(b\), \(c\), and \(d\), one entailing nodes \(d\), \(e\), and \(f\), and one entailing nodes \(f\), \(g\), \(h\), and \(i\). We further see that nodes \(d\) and \(f\) are shared nodes and that node \(j\) is isolated.
We could have also requested the fuzzymod threshold for such a small network using
<- cpThreshold(W, method = "unweighted", k.range = c(3,4),
thresholds.small.fuzzymod threshold = c("entropy","fuzzymod"))
We can again inspect the results with
thresholds.small.fuzzymod#> k Number.of.Communities Number.of.Isolated.Nodes Entropy.Threshold FuzzyMod
#> 1 3 3 1 1.8567796 0.3211111
#> 2 4 1 6 0.9709506 0.1033333
showing that fuzzy modularity favors the same \(k\) and \(I\). One advantage, however, is that we do not need to run any permuation test for fuzzy modularity.
Finally, we would like to plot the results of the clique percolation
algorithm. However, using the community graph for a network with only
three communities would not be very informative. In the current case,
the community network would have three nodes with two edges. For very
small networks, an alternative may be reasonable, namely plotting the
community partition directly on the original network that was analyzed.
This can be achieved with the cpColoredGraph
function. It
takes the community partition from cpAlgorithm
and assigns
qualitatively different colors to the communities with the help of the
colorspace package (Zeileis et
al., 2019). It is based on the HCL color space, generating
palettes of colors that are rather balanced and visually pleasing.
Isolated nodes will be displayed white. Shared nodes will be divided in
multiple colors, one for each community they belong to. Original
qgraph arguments can be used to make the network appear
as before.
<- cpColoredGraph(W, list.of.communities = cpk3$list.of.communities.labels,
colored.net1 edge.width = 4)
The function also returns the colors assigned to the communities and the colors assigned to the nodes, if the results are assigned to an object. The colors assigned to the nodes are a list with vectors for each node. If a node is shared, all its colors are in the vector. These results can be accessed by
$colors.communities
colored.net1$colors.nodes colored.net1
The function used for generating qualitatively different colors in
the package colorspace is called
qualitative_hcl
. It generates colors of a range of hue
values, holding chroma and luminance constant.
cpColoredGraph
uses the recommended defaults from
qualitative_hcl
. However, these defaults can be
overwritten, if different colors are desired. For instance, to plot the
graph from before with colors in the range from yellow to blue, lower
chroma, and higher luminance, the following code could be used
<- cpColoredGraph(W, list.of.communities = cpk3$list.of.communities.labels,
colored.net2 h.cp = c(50, 210), c.cp = 70, l.cp = 70,
edge.width = 4)
Oftentimes, the clique percolation algorithm is probably run on a
network, for which there are predefined sets of nodes. For instance, it
could be that in the example network, nodes \(a\), \(b\), \(c\), \(d\), \(e\), and \(f\) are from one set (e.g., one
questionnaire) and nodes \(g\), \(h\), \(i\), and \(j\) are from another set (e.g., another
questionnaire). Such sets of nodes can be taken into account, when
plotting the network. For this, the list.of.sets
argument
needs to be specified, a list of vectors in which each vector codes one
set of nodes. If this is done, cpColoredGraph
assigns
qualitatively different colors to the sets. Then, it checks whether the
identified communities are pure (i.e., they entail nodes from only one
set) or they are mixed (i.e., they entail nodes from multiple sets). For
pure communities of each respective set, it takes the assigned color and
fades it toward white, using colorspace’s
sequential_hcl
. It does so such that there are as many
non-white colors as there are pure communities of a set. Larger
communities will be assigned stronger colors.
If a community is mixed with nodes from different sets, the colors of these sets will be mixed proportionally to the number of nodes from each set. For instance, if a community entails two nodes from one set and one node from another set, the colors will be mixed 2:1. For this, the colors are mixed subtractively (how paint mixes), which is difficult to implement. The mixing is done with an algorithm taken from Scott Burns (see http://scottburns.us/subtractive-color-mixture/ and http://scottburns.us/fast-rgb-to-spectrum-conversion-for-reflectances/). In short, colors are translated into reflectance curves. This is achieved by taking optimized reflectance curves of red, green, and blue and integrating them according to the RGB value of the color. The reflectance curves of the colors that have to be mixed are averaged via the weighted geometric mean. The resulting reflectance curve is translated back to RGB. According to substantive tests performed by Scott Burns, the mixing works nicely and is computationally efficient. Yet, it may not always produce precise results.
In the current example, with the list of sets mentioned above, two communities will be pure (\(a\)–\(b\)–\(c\)–\(d\), \(d\)–\(e\)–\(f\)) and one will be mixed (\(f\)–\(g\)–\(h\)–\(i\)). The two pure communities will hence be faded from the assigned colored towards white, making the smaller community (\(d\)–\(e\)–\(f\)) lighter. The mixed community will be mixed in 3:1, as there are three nodes from one set and one node from the other set. This could be achieved with the following code (using the first coloring).
#define list.of.sets
<- list(c("a","b","c","d","e","f"), c("g","h","i","j"))
list.of.sets1 <- cpColoredGraph(W, list.of.communities = cpk3$list.of.communities.labels,
colored.net3 list.of.sets = list.of.sets1,
edge.width = 4)
Now you can access another element, that cpColoredGraph
returns, namely a vector with the colors assigned to the sets.
$basic.colors.sets colored.net3
Note that the colors are not identical to the colors in the network, as they were faded for two communities and mixed for another community. The actual colors assigned to the sets look like this
The community entailing \(f\), \(g\), \(h\), and \(i\) is a mix of both colors.
Moreover, the fading for pure communities always generates the faded
palette for the number of pure communities of a set plus 1 (the plus one
prevents that one community becomes white, which is reserved for
isolated nodes). As such, if there are different numbers of pure
communities for different sets in a network, or if across different
networks there are different numbers of pure communities for the same
set, their luminance values are not directly comparable. To make them
comparable, we can assign a scale to the fading, such that we specify
how many faded colors the set palettes should have. This can be achieved
by specifying the set.palettes.size
argument. For instance,
if it is set to 5, then for all pure communities of all sets,
cpColoredGraph
will generate five faded colors. Larger
communities will get the stronger colors asf. As such, if there are more
pure communities of one set than for another set in the same network,
there will be lighter communities for the set with more than with less
pure communities. Moreover, if one set of nodes produces more
communities in one network than in another, its nodes will overall be
lighter in the network in which it produced more communities.
In our example network, there were two pure communities of one set.
Per default cpColoredGraph
generated two faded non-white
colors and assigned them to the communities. Thus, one community is
rather strongly colored and the other is almost white. This
overestimates that they are rather similar in size. Scaling them such
that all pure communities are assigned colors from a set of five colors
can be achieved as follows
<- cpColoredGraph(W, list.of.communities = cpk3$list.of.communities.labels,
colored.net4 list.of.sets = list.of.sets1, set.palettes.size = 5,
edge.width = 4)
The larger community kept its color, but the smaller community \(d\), \(e\), \(f\) is now somewhat darker.
One disadvantage of the color mixing is that sometimes two separate communities will have the same mixed color. Their similar coloring makes sense as both communities are structurally similar when defining a priori sets of nodes. However, it can be confusing when two different communities turn out to have the same color. For instance, imagine that nodes \(a\), \(d\), \(e\), \(f\), and \(g\) are in one set, whereas nodes \(b\), \(c\), \(h\) \(i\), and \(j\) are in the other set. Then, the communities \(a\)–\(b\)–\(c\)–\(d\) and \(f\)–\(g\)–\(h\)–\(i\) both have two nodes each from both sets, leading to the same mixed color as verified by
#define list.of.sets
<- list(c("a","d","e","f","g"), c("b","c","h","i","j"))
list.of.sets2 <- cpColoredGraph(W, list.of.communities = cpk3$list.of.communities.labels,
colored.net5 list.of.sets = list.of.sets2,
edge.width = 4)
The colors of the two large communities are indeed identical, which can be seen when inspecting the colors of the nodes.
$colors.nodes
colored.net5#> [[1]]
#> [1] "#6A8B91"
#>
#> [[2]]
#> [1] "#6A8B91"
#>
#> [[3]]
#> [1] "#6A8B91"
#>
#> [[4]]
#> [1] "#E16A86" "#6A8B91"
#>
#> [[5]]
#> [1] "#E16A86"
#>
#> [[6]]
#> [1] "#6A8B91" "#E16A86"
#>
#> [[7]]
#> [1] "#6A8B91"
#>
#> [[8]]
#> [1] "#6A8B91"
#>
#> [[9]]
#> [1] "#6A8B91"
#>
#> [[10]]
#> [1] "#ffffff"
To avoid situations in which mixed communities end up with the same
color, we can specify the avoid.repeated.mixed.colors
argument. When set to TRUE
, it identifies communities with
the same color and avoids them by introducing small random changes in
the mixing of the colors. As the procedure relies on randomness, we have
to set a seed to reproduce results.
set.seed(4186)
<- cpColoredGraph(W, list.of.communities = cpk3$list.of.communities.labels,
colored.net6 list.of.sets = list.of.sets2,
avoid.repeated.mixed.colors = TRUE,
edge.width = 4)
The colors still look almost identical, underlining that the communities are structurally similar. However, the nodes do have different colors, as verified by
$colors.nodes
colored.net6#> [[1]]
#> [1] "#698C91"
#>
#> [[2]]
#> [1] "#698C91"
#>
#> [[3]]
#> [1] "#698C91"
#>
#> [[4]]
#> [1] "#E16A86" "#698C91"
#>
#> [[5]]
#> [1] "#E16A86"
#>
#> [[6]]
#> [1] "#6A8B91" "#E16A86"
#>
#> [[7]]
#> [1] "#6A8B91"
#>
#> [[8]]
#> [1] "#6A8B91"
#>
#> [[9]]
#> [1] "#6A8B91"
#>
#> [[10]]
#> [1] "#ffffff"
Furthermore, we may want to assign our own colors to the communities,
which is possible by assigning a vector of Hex colors to the
own.colors
argument. It is possible to assign colors to the
communities (when list.of.sets = NULL
) or to the a priori
sets (when list.of.sets
is not NULL
). For
instance, if I want to assign the colors red, green, and blue to the
three communities, I use the following code
<- cpColoredGraph(W, list.of.communities = cpk3$list.of.communities.labels,
colored.net6 own.colors = c("#FF0000","#00FF00","#0000FF"),
edge.width = 4)
One final feature is part of cpColoredGraph
. In each
case, there is a palette of qualitatively different colors generated,
either for the elements in list.of.communities
(when
list.of.sets = NULL
) or the elements in
list.of.sets
(when list.of.sets
is not
NULL
). Zeileis et al. (2019)
argue that the palettes generated with qualitative_hcl
in
colorspace can produce only palettes of maximally six
colors that people can still distinguish visually.
For instance, let’s say we have a somewhat larger network with 11
communities. Plotted with qualitative colors generated via
qualitative_hcl
, the solution would look as follows
#generate network with 11 communities
<- matrix(c(0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
W 0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0), nrow = 25, ncol = 25, byrow = TRUE)
<- forceSymmetric(W)
W rownames(W) <- letters[seq(from = 1, to = nrow(W))]
colnames(W) <- letters[seq(from = 1, to = nrow(W))]
<- qgraph(W, DoNotPlot = TRUE)
W
#run the clique percolation method
<- cpAlgorithm(W, k = 3, method = "unweighted")
cpk3.large
#plot the network colored according to community partition (with qualitative_hcl)
<- cpColoredGraph(W, list.of.communities = cpk3.large$list.of.communities.labels, edge.width = 4, layout = "spring", repulsion = 0.9)
colored.net.large1 #> Warning in cpColoredGraph(W, list.of.communities = cpk3.large$list.of.communities.labels, : There are more than 6 communities. The colors might be hard to distinguish.
#> Set larger.six = TRUE to get maximally different colors.
#> Yet, colors might be visually a little less pleasing.
The warning message already mentions the problem and foreshadows a
solution. The colors become harder to distinguish, making it difficult
to identify the communities. To generate a larger set of qualitatively
different colors that are maximally different, the
createPalette
function from the package
Polychrome (Coombes, Brock,
Abrams, & Abruzzo, 2019) can be used. It generates maximally
different colors from HCL space. When specifying
larger.six = TRUE
(default is FALSE
) in
cpColoredGraph
, the qualitatively different colors for the
communities or sets are generated with createPalette
. The
rest of the procedure is identical. As createPalette
relies
partly on a random procedure, I will set a random seed to allow
reproducing the results. In the current example, the network would look
as follows
set.seed(4186)
<- cpColoredGraph(W, list.of.communities = cpk3.large$list.of.communities.labels,
colored.net.large2 larger.six = TRUE,
edge.width = 4, layout = "spring", repulsion = 0.9)
Indeed, the communities are easier to identify. However, given that the goal is to generate maximally different colors, the display may be visually less pleasing and balanced.
Note that for larger networks with more than six communities,
plotting results with cpColoredGraph
may not be preferable
anyways. Instead, plotting the community network via
cpCommunityGraph
might be more informative.