##Introduction In this vignette, we give a brief introduction to the denoiSeq package for differential expression analysis of RNA-Seq and similar data. The statistical details of the methods used can be found in the associated publication.
The denoiSeq package is based on a model for PCR sequencing developed by @ndifon2012chromatin. The model generates, in a bottom-up manner, a probability distribution for the final copy number of a gene, which is in the form of a superposition of the negative binomial and the binomial distributions. Throughout this text, the word gene is used synonymously with any genomic attribute such as exons, transcripts, peptide sequences, etc. The derived distribution has three main parameters, i.e \(N, p\) and \(f\), which represent the initial gene amount before amplification, the amplification efficiency and the dilution rate, respectively. We then use Bayesian inference to estimate the model parameters. Uniform priors are chosen for parameters \(p\) and \(f\) and a non informative prior is chosen for \(N\).
The package only performs a two group comparison and assumes that the user has the data in the form of an \(m\) by \(n\) matrix of integers that contains counts from both groups. We prefer that the matrix have atleast the rownames but this is not necessary. Each \(i,j\) element of this matrix corresponds to counts for gene \(i\) in sample \(j\), whereby \(j\) has been derived from one of the two experimental groups (conditions), i.e A and B. For an \(m\) by \(n\) matrix, inference aims at estimating the three sets of parameters, i.e \(p\), \(f\) and \(N_i\)'s (\(2m\) in total because we are considering 2 conditions with the same \(m\) genes in each). In the analysis, the counts from different experimental samples are first normalised using estimated size factors [@wolf] in order to make them comparable. We then perform differential analysis to determine if there is a signal of a significant difference between the counts for each gene (the same row) in the two conditions. For this kind of analysis, the primary parameters of interest are \(N_{iA}\) and \(N_{iB}\), for each gene \(i\).
##Preparation
We begin the analysis by creating a readsData object to handle the count data. This object has six slots namely; counts, geneNames, replicates, initValues, stepSize and output. The first four slots are specified before the analysis and the latter (output) is used to store the results of the Bayesian inference and is set by the results
function.
Briefly, the counts
slot contains the matrix of counts; geneNames
contains the names of the genes; replicates
contains a list of two elements (\(A\) and \(B\)), each of which is a vector of indices for the columns that belong to each respective condition; initValues
contains the initial values; and stepSize
contains the step sizes for sampling all parameters to be estimated.
Before we can create the readsData object, we first load the package.
library(denoiSeq)
This package is distributed with two datasets, i.e ERCC and simdat. The ERCC dataset contains RNA-seq data of biological replicates characterized during the ENCODE project [@djebali2012landscape] and simdat contains simulated data. In this first example, we will use the simdat dataset. This dataset contains 750 genes with 5 experimental samples for each condition, summarised as a 750 by 10 integer matrix. The first 428 genes are not differentially expressed between the two conditions whereas the last 322 genes are. The genes counts were generated in accordance to the probability distribution derived by @ndifon2012chromatin. We can examine the counts of the first and last 6 genes of this dataset using the head
and tail
commands in R respectively.
head(simdat)
tail(simdat)
Because Bayesian inference is computationally intensive, users are advised to perform pre-filtering before creating the readsData object, in order to save time. An example would be
simdat <- simdat[ rowSums(simdat) > 0, ],
to eliminate the non-expressed genes. <!–since as has been reported [@bullard2010evaluation], it is difficult to detect accurately a signal of difference in such genes.–> This also helps to increase the computational efficiency of the analysis. <!–Threshold values of 30 for large datasets (> 400 genes) and 15 for smaller datasets (<=100 genes) were chosen based on analysis of existing annotated real datasets.–>
From the output of head()
above, we notice that the simdat count matrix has no gene names and so we have to explicitly specify the geneNames slot. We begin by creating gene names below:
genenames <- paste("gene_", 1:750, sep = "_").
We then create a readsData object:
RD <- new("readsData", counts = simdat, geneNames = genenames)
With the exception of output, all the other remaining slots are set to default values. See the class documentation for more details using ?readsData
. In the event that the object has already been formed, the default values can be altered using the class methods setReplicates
, setInitValues
and setStepSizes
respectively.
##Bayesian inference
Once the readsData is in place, we implement Bayesian inference by calling the denoiseq
function, specifying the number of steps to be used for posterior sampling, and a smaller number steps to be used for step size tuning.
steps <- 30 #30 steps are just for illustration here. Atleast 5000 steps are adequate.
tuningSteps <- 10
BI <- denoiseq(RD, steps, tuningSteps)
The denoiSeq
function then returns the same readsData object, but with a filled output slot.
In this example, the above chunk of code ran for about 1.5 minutes on a core i5, 8GB memory PC. We note that 30 steps is a very small number of iterations to obtain adequately accurate results. This piece of code is just meant for instruction purposes. At least 5000 steps should be adequate in this example, with about a third of those used for step size tuning (takes over an hour on the same computer). A smaller dataset is used in the case study section below.
##Differential analysis
The results of the inference are used to determine differential expression. This is achieved by calling the results
function, which returns a data frame with three columns, i.e the log2 fold change (log2FC), the standard error of the log2 fold change (lgfcSE) and the test static (ROPE_propn), and with as many rows as the number of genes.
The results
function has the same arguments as denoiseq
, with an additional rope_limits
argument which is defaulted to 0.5. This parameter represents the upper bound of the log (to base 2) of the fold change that we consider non-significant.
rez <- results(BI, steps, tuningSteps)
head(rez)
##Other functions
For diagnostics about the behavior of the posterior samples generated, the package has the getSamplesOf
function that takes as argument, the name of a model parameter (i.e \(p,f\),or the gene name for any of the \(N_i\)'s) and returns a vector of posterior samples for that parameter. The other arguments are a readsData
object with a populated output slot, the number of steps used in sampling and the experimental condition being considered. Instead of a gene name, the function can also take the row index of the counts matrix. In the following example, we extract the posterior samples for the parameter \(N_{50}\) (i.e the initial count of the \(50 ^{th}\) gene) from condition \(B\).
N_50 <- getSamplesOf(BI, "gene_50", steps, condition = "B")
With these posterior samples, diagnostics such as evaluating the acceptance rate, creating a history plot, and determining the autocorrelation between posterior parameter samples can be performed, as illustrated below:
acceptance_rate <- length(unique(gene_50)) / steps
plot(gene_50, type="l", main = "History plot of N_50")
AL <- acf(gene_50, lag.max = 30, type = c("correlation"), plot = T)
If one wishes, one can also have a look at the tuned step sizes using tunedStepSize
. This function takes the readsData object returned by denoiseq
as its only argument.
SS <- tunedStepSize(BI)
The returned value can then be assigned to the stepSize
slot for use in another run of Gibbs sampling implemented with denoiseq
.
##Case study: Analysis of the ERCC dataset In this case study, we will use the ERCC dataset which contains a mixture of 92 spike-in synthetic oligonucleotides that are mixed into experimental samples A and B at four mixing ratios: ½, 2/3, 1 and 4. The genes with a mixing ratio of 1:1 (rows 24 to 46), represent the only genes with non-significant expression differences between conditions A and B. There are five experimental samples for each condition.
###Data preview
dim(ERCC)
## [1] 92 10
head(ERCC, 4)
A_1 | A_2 | A_3 | A_4 | A_5 | B_1 | B_2 | B_3 | B_4 | B_5 | |
---|---|---|---|---|---|---|---|---|---|---|
ERCC-00130 | 220103 | 182768 | 140678 | 268448 | 181856 | 66630 | 71108 | 77412 | 65282 | 64374 |
ERCC-00004 | 45385 | 32634 | 28388 | 52615 | 37544 | 12005 | 12531 | 14706 | 11119 | 11394 |
ERCC-00136 | 6650 | 4778 | 4185 | 7700 | 5882 | 1749 | 1786 | 2238 | 1694 | 2377 |
ERCC-00108 | 16725 | 14520 | 10626 | 20273 | 11323 | 4903 | 5544 | 5792 | 4963 | 3258 |
###Pre-filtering We pre-filtered the dataset by selecting rows with a rowsums above 0.
ERCC <- ERCC[rowSums(ERCC) > 0, ]
dim(ERCC)
## [1] 92 10
###Creating the readsData object Since the count matrix has row names, these were automatically assigned to the geneNames slot.
reps <- list(A=c(1,2,3,4,5), B=c(6,7,8,9,10))#specifying the columns for each condition
m <- dim(ERCC)[1]
initvalues <- list(N_A = rep(1, m), N_B = rep(1, m), p = 0.0001, f = 0.01)
stepsizes <- list(stepsizeN_A = rep(1, m), stepsizeN_B = rep(1, m)
, stepsize_p = 5e+07, stepsize_f = 1e3)
RD2 <- new("readsData", counts = ERCC, replicates = reps, initValues
= initvalues, stepSizes = stepsizes)
###Bayesian inference
The tuningSteps argument has been set to the default value = floor(steps / 3)
.
steps <- 100#100 steps are just for illustration here. Atleast 5000 steps are adequate.
BI2 <- denoiseq(RD2, steps)
NOTE: A more adequate value for the number of steps is 5000 and this took about 16 minutes on the same PC.
###Differential expression analysis In this case, we will explicitly set the value for the 'rope_limits' argument.
rope = 0.5
rez2 <- results(BI2, steps, rope_limit = rope)
head(rez2)
log2FC(B / A) | lfcSE | ROPE_propn | |
---|---|---|---|
ERCC-00130 | 0.2263791 | 0.0026065 | 0.0000000 |
ERCC-00004 | 0.2139779 | 0.0033292 | 0.0000000 |
ERCC-00136 | 0.2386529 | 0.0078052 | 0.0000000 |
ERCC-00108 | 0.2478648 | 0.0049706 | 0.0000000 |
ERCC-00116 | 0.3540129 | 0.0149003 | 0.0149254 |
ERCC-00092 | 0.2835223 | 0.0113460 | 0.0000000 |
A threshold value of the statistic is then chosen to determine significance. One can be determined by arranging the ROPE_propn column in ascending order and selecting the most differentially expressed genes. Values ranging from 0.00075 to 0.4 were obtained for the Youden's index [@youden1950index] for a number of annotated datasets. The Youden's index is a statistic that captures the performance of a binary classifier. It is defined for all points of an ROC curve and its maximum value may be used as a criterion for selecting the optimum cut-off point when a diagnostic test gives a numeric rather than a binary result.
#Re-ordering according to most differentially expressed
rez3 <- rez2[with(rez2, order(ROPE_propn)), ]
head(rez3, 10)
In this case, we chose a value of 0.38.
#Determine significant genes
sgf <- rez2[rez2$ROPE_propn < 0.38, ]
head(sgf)
dim(sgf)
Posterior samples of parameter \(N\) for gene \(ERCC-00051\) in condition A are obtained and examined below, for the perfomance of Gibbs sampling.
samples <- getSamplesOf(BI2, "ERCC-00051", steps)
acceptance_rate <- length(unique(samples)) / steps
acceptance_rate
## [1] 0.56
The history plot and autocorrelation function can then be plotted as shown below.
plot(samples, type="l", main = "History plot of ERCC-00051")
AL <- acf(samples, main = "Auto lag plot of ERCC-00051", lag.max = 30,
type = c("correlation"), plot = T)
##session info
sessionInfo()
## R version 3.3.3 (2017-03-06)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: AIMS Desktop 2017.1
##
## locale:
## [1] LC_CTYPE=en_ZA.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_ZA.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_ZA.UTF-8 LC_MESSAGES=en_ZA.UTF-8
## [7] LC_PAPER=en_ZA.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_ZA.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] denoiSeq_0.1.1 knitr_1.15.1
##
## loaded via a namespace (and not attached):
## [1] backports_1.1.1 magrittr_1.5 rprojroot_1.2 htmltools_0.3.5
## [5] tools_3.3.3 Rcpp_0.12.10 stringi_1.1.2 rmarkdown_1.6
## [9] highr_0.6 stringr_1.2.0 digest_0.6.12 evaluate_0.10
##References