mzipmed
R
packagemzipmed_data
overviewmzip()
function for analysis with zero-inflated count variablesIn this document we explain the main features of the
mzipmed
package through examples. This package performs
mediation analysis when either the outcome or mediator are a count
variable with many zeroes, called a zero-inflated count. Specifically,
this package merges the concepts of the counterfactual approach to
mediation and the marginalized zero-inflated Poisson (MZIP) model to
obtain mediation effects when there are excess zeroes. Additional
information on statistical methodology is provided elsewhere (Long et
al. 2014, Vanderweele 2016)
Mediation analysis is used to explore how a variable called a mediator helps explain the relationship between some exposure and an outcome. Many popular mediation method such as the difference and product methods collapse when there are interaction effects or the outcome is not a continuous variable. The counterfactual approach to mediation involves finding mediation effects through closed-form expressions that is extendable to many different classes of models and can account for interaction effects while minimizing computation speed. If we let \(Y\) be our outcome, \(M\) be our mediator, \(X\) be our exposure, and \(C\) be a vector of covariates. The counterfactual approach involves fitting two regression models, one for the outcome and one for the mediator.
\[g(E[Y|X=x,M=m,\textbf{C=c}])=g(E[Y|x,m,c])=\tau_0+\tau_1x+\tau_2m+\mathbf{ \tau_4'c}\] \[h(E[M|X=x,\textbf{C=c}])=h(E[M|x,c])=\theta_0+\theta_1x++\mathbf{ \theta_4'c}\] Where \(g\) and \(h\) are link functions associated with the outcome and mediator model respectively and \(\tau\) and \(\theta\) are the parameter estimates associated with the outcome and mediator model respectively. Using parameter estimates from these equations mediation effects can be estimated including the total effect (TE) or the overall effect of the exposure on the outcome, natural direct effect (NDE) or the effect of the exposure on the outcome not operating through the mediator, and natural indirect effect (NIE) or the effect operating through the mediator. Basic formulas for these equations are as follows \[NDE=\sum_m(E[Y|x,m,c]-E[Y|x^*,m,c])P(m|x^*,c)\] \[NIE=\sum_m(E[Y|x,m,c])(P(m|x,c)-P(m|x^*,c))\] Where \(x^*\) is a fixed value of the exposure to be compared to \(x\). The total effect can then be computed as the sum of NDE and NIE on a difference scale and the product of the NDE and NIE if the outcome is on a ratio scale.
Standard count regression models, such as Poisson, give biased parameter estimates and popular count models like the zero-inflated Poisson model do not directly provide population average parameter estimates. In response the marginalized zero-inflated Poisson (MZIP) model was proposed. MZIP is a mixture distribution that models the probability of being an excess zero from a Bernoulli distribution with probability \(\psi_i\) and the overall mean of the distribution from a Poisson distribution with mean \(\nu_i\), where \(\nu_i\) is the product of the mean of the non-excess zeroes, \(\mu_i\), and \(1-\psi_i\). MZIP is fit by \[logit(\psi_i)=\mathbf{Z_i'\gamma}\] \[log(\nu_i)=\mathbf{Z_i'\alpha}\] Where \(\alpha\) and \(\gamma\) are \((p\times 1)\) column vectors of parameters associated with the overall population mean and probability of being an excess zero models. \(\mathbf{Z_i}\) is a \((p\times 1)\) vector of covariates for the i-th individual for the excess zero and overall population mean components of MZIP. Exponentiating \(\alpha\) parameters yield incidence density or incidence rate ratio interpretations equivalent to Poisson model interpretations.
mzipmed_data
overviewIncluded in this package is a simulated dataset for examples called
mzipmed_data
. This dataset includes 500 observations of 10
variables. These variables are
X
is a binary exposure simulated from a Bernoulli
distribution with p=0.5C1
is a covariate that follows a standard normal
distributionC2
is a second covariate from a ~Beta(2,2)ziM
is a zero-inflated count mediator simulated from
the predictors X
,C1
, and C2
lmM
is a continuous mediator based on the predictors
with error term ~N(0,4)binM
is a binary mediator based on the predictors
simulated with a Bernoulli distributionlmY
is a continuous outcome simulated from
ziM
and the predictors with error term ~N(0,4)binY
is a binary outcome simulated from
ziM
and the predictors with a Bernoulli distributionziY1
is a zero-inflated count outcome simulated from
lmM
and the predictorsziY2
is a another zero-inflated count outcome, but
instead uses binM
and the predictorsmzip()
function for analysis with zero-inflated count
variablesMZIP can be performed with the mzip()
function. For
example, assume we would like to observe how our X
is
associated with ziY1
from mzipmed_data
. First
we need to load the mzipmed
package
library(mzipmed)
Next we can use the mzip()
function to conduct this
analysis. First we need to specify the y
argument which in
our case y=mzipmed_data$ziY1
. For the covariates we specify
the pred
argument. For a single predictor we could use
pred=mzipmed_data$X
. Suppose we adjust for C1
and C2
then we bind all the predictors into a vector using
the cbind()
function in base R,
e.g. pred=cbind(mzipmed_data$X,mzipmed_data$C1,mzipmed_data$C2)
.
If we want our covariates named in the outcome we need to specify within
the cbind()
function,
e.g. pred=cbind(X=mzipmed_data$X,C1=mzipmed_data$C1,C2=mzipmed_data$C2)
otherwise they will be named X1,X2,… in order they are included. The
final argument is print=FALSE
, if print=TRUE
parameter estimates, standard errors, p-values, and risk ratios will be
printed into the R console. The default is TRUE. In additional an offset
variable can be specified using the offset
argument,
e.g. offset=variable
. No offset is included in this
dataset. Using this example
=mzip(y=mzipmed_data$ziY1, pred=cbind(X=mzipmed_data$X,C1=mzipmed_data$C1,C2=mzipmed_data$C2), offset=NULL, print=TRUE)
test#> [1] "Overall Mean Estimates from MZIP"
#> Variable Alpha_Estimate SE P_Value Robust_SE Robust_P_Value
#> 1 intercept -0.520 0.200 0.0092 0.237 0.0282
#> 2 X 0.249 0.150 0.0974 0.209 0.2340
#> 3 C1 0.064 0.069 0.3605 0.073 0.3845
#> 4 C2 0.886 0.322 0.0059 0.366 0.0153
#> [1] "Excess Zero Estimates from MZIP"
#> Variable Gamma_Estimate SE P_Value Robust_SE Robust_P_Value
#> 1 intercept 0.515 0.259 0.0469 0.264 0.0508
#> 2 X 0.288 0.199 0.1495 0.205 0.1614
#> 3 C1 0.040 0.085 0.6436 0.081 0.6273
#> 4 C2 -0.095 0.410 0.8172 0.399 0.8124
#> [1] "Overall Mean Incidence Rate Ratio Estimates"
#> Variable incrate_ratio Lower_CI Upper_CI RobLower_CI RobUpper_CI
#> 1 X 1.283 0.956 1.722 0.851 1.934
#> 2 C1 1.066 0.930 1.221 0.923 1.230
#> 3 C2 2.426 1.291 4.558 1.185 4.967
We find no evidence that X
is significantly associated
with the outcome and we obtain a relative risk estimate of 1.28. The
robust standard errors are often used if over-dispersion is a concern as
they typically provide wider confidence intervals. In the
test
output one can also find covariance matrices,
parameter estimates, confidence intervals, etc.
Assume we want to examine how a continuous mediator explains the
relationship between an exposure and a zero-inflated count mediator. We
would fit the following models \[logit(\psi_i|x,m,c)=\gamma_0+\gamma_1x+\gamma_2m+\mathbf{\gamma_4'c}\]
\[log(\nu_i|x,m,c)=\alpha_0+\alpha_1x+\alpha_2m+\mathbf{\alpha_4'c}\]
\[E(M|x,c)=\theta_0+\theta_1x+\mathbf{\theta_4'c}\]
The overall mean outcome model is on a incidence rate ratio (IRR) scale
so NDE and NIE estimates will be as well and can be computed by \[IRR^{NDE}=e^{\alpha_1(x-x^*)}\] \[IRR^{NIE}=e^{\alpha_2\theta_1(x-x^*)}\] To
conduct this mediation analysis in the mzipmed
package we
would use the zioutlmmed()
function. This functions only
allows for a single outcome, mediator, and exposure at a time. Using the
mzipmed
data let \(X=X\)
and \(M=lmM\) and our outcome is
ziY1
. We specify these variables with the following
arguments
outcome=mzipmed_data$ziY1
,mediator=mzipmed_data$lmM
,
exposure=mzipmed_data$X
, and for the covariates/confounders
confounder=cbind(mzipmed_data$C1,mzipmed_data$C2
). Standard
errors can be computed using bootstrapping which is robust, but
computationally intensive or via delta method for instantaneous variance
estimation. For delta method use error="Delta"
which is the
default and for bootstrap use error="Boot"
. Robust standard
errors can be requested within the delta method to be used for concerns
of over-dispersion by specifying robust=TRUE
; FALSE is the
default. We also need to input values of the exposure to compare to
using the X
and Xstar
arguments, for a
dummy-coded binary exposure and by default we let X=1
and
Xstar=0
. These can be changed as needed. For bootstrap
standard error you can specify the number of iterations, by default
there are 1000 repetitions, n=1000
. An offset variable for
the ZI count outcome can be specified with the zioff
argument using the same syntax as the
outcome
,exposure
arguments. As an example
=zioutlmmed(outcome=mzipmed_data$ziY1,mediator=mzipmed_data$lmM,exposure=mzipmed_data$X, confounder=cbind(mzipmed_data$C1,mzipmed_data$C2), error="Delta", robust=FALSE,X=1,Xstar=0,n=1000, zioff=NULL)
ziout#> Estimate log SE Lower CI Upper CI
#> Natural Direct Effect (IRR) 1.021 0.147 0.766 1.361
#> Natural Indirect Effect (IRR) 1.163 0.068 1.018 1.328
#> Total Effect (IRR) 1.187 0.162 0.865 1.630
#> Proportion Mediated 0.889
Printed into the R console are the incidence rate ratios of the NDE,
NIE, and TE along with their standard errors and confidence intervals
and the proportion mediated. From the ziout data we see no evidence of a
significant direct effect of the exposure on the outcome as the
incidence rate ratio includes 1 in the confidence interval, but do see
that a significant effect operating through the mediator since 1 is not
in the confidence interval. Additional information in the ziout in the R
global environment include all information included in the outcome model
using mzip()
and the mediator model using lm()
from the stats
package in base R.
The mzip
package also includes a function for mediation
with zero-inflated count outcomes and binary mediators. For this
procedure a MZIP model is fit for the outcome and a logistic regression
is fit for the mediator model using the glm()
function in
base R. The aforementioned function is called
zioutbinmed()
. The NDE is the same as with continuous
mediators, but the NIE is now conditional on covariates
\[IRR^{NIE}=\frac{([1+e^{\theta_0+\theta_1x^*+\mathbf{\theta_2'c}}][1+e^{\alpha_2
\theta_0+\theta_1x+\mathbf{\theta_2'c}}])}{([1+e^{\theta_0+\theta_1x+\mathbf{\theta_2'c}}][1+e^{\alpha_2\theta_0+\theta_1x^*+\mathbf{\theta_2'c}}])}\]
Therefore fixed values of the covariates are required. The input for the
zioutbinmed()
function is the same as in the previous
section, but also includes an argument to specify these covariates
called C
. By default, C=NULL
each covariate
will be fixed at its mean value to approximate marginal estimates, but
can be changed using for example C=c(0,2)
. The remainder of
the input and output for this function is the same as with the a
continuous mediator and we now use
outcome=mzipmed_data$ziY2
for the outcome variable.
=zioutbinmed(outcome=mzipmed_data$ziY2,mediator=mzipmed_data$binM, exposure=mzipmed_data$X, confounder=cbind(mzipmed_data$C1,mzipmed_data$C2), error="Delta", robust=FALSE,X=1, Xstar=0, n=1000, C=NULL, zioff=NULL)
zioutb#> Estimate log SE Lower CI Upper CI
#> Natural Direct Effect (IRR) 1.267 0.138 0.966 1.661
#> Natural Indirect Effect (IRR) 1.008 0.011 0.987 1.029
#> Total Effect (IRR) 1.277 0.138 0.974 1.674
#> Proportion Mediated 0.037
Assume we have a continuous outcome, \(Y\), and a zero-inflated count mediator
\(M\). We fit a linear regression
outcome model and a MZIP mediator model as follows: \[E(Y|x,m,c)=\beta_0+\beta_1x+\beta_2m+\mathbf{\beta_4'c}\]
\[logit(\psi_i(M)|x,c)=\gamma_0+\gamma_1x+\mathbf{\gamma_4'c}\]
\[log(\nu_i(M)|x,c)=\alpha_0+\alpha_1x+\mathbf{\alpha_4'c}\]
Because the outcome is on a difference scale, so will the NDE and NIE
estimates which can be derived by \[\beta_1(x-x^*)\] \[\beta_2[e^{\alpha_0+\alpha_1x+\mathbf{\alpha_4'c}}-e^{\alpha_0+\alpha_1x^*+\mathbf{\alpha_4'c}}]\]
For this mediation analysis the lmoutzimed()
function could
be used. For example with the mzipmed_data
we assume the
same set of predictors as in the section with ZI outcomes, but now let
mediator=mzipmed_data$ziM
and
outcome=mzipmed_data=lmY
. Input required is the same as in
the section for ZI outcomes with the C
argument shown in
the ZI outcomes with binary mediators example and this function uses the
lm()
function for the outcome variable and the
mzip()
function for the mediator model with all effects
being on a difference scale.Covariates and confounders can be specified
in the confounder
argument.Standard errors can be computed
using delta method, error="Delta"
or via bootstrap
error="Boot"
. For delta method robust standard errors can
be used with the robust=TRUE
argument. For bootstrap the
number of iterations can be specified with the n
argument
with 1000 being the default. An offset for the zero-inflated count
mediator model can be specified with the zioff
argument.
For example,
=lmoutzimed(outcome=mzipmed_data$lmY,mediator=mzipmed_data$ziM, exposure=mzipmed_data$X, confounder=cbind(mzipmed_data$C1,mzipmed_data$C2),error="Delta", robust=FALSE,X=1,Xstar=0,C=NULL, zioff=NULL)
zimed#> Estimate SE Lower CI Upper CI
#> Natural Direct Effect 1.388 0.366 0.670 2.106
#> Natural Indirect Effect 0.072 0.049 -0.025 0.169
#> Total Effect 1.460 0.367 0.741 2.179
#> Proportion Mediated 0.049
After running the function we can see in the R console that the NDE
is statistically significant as the mean difference does not include 0
in the confidence interval, but the NIE is not significant because it
includes 0. Additional information on the individual outcome and
mediator models can be found in the zimed
data now in the
global environment.
Note: when using this functions extension with an interaction an
additional argument is required if an offset is specified. This argument
called OFF
requires a fixed value of the offset variable to
derive effects. If none are specified then the mean of the offset is
used. If no offset is specified then this argument is no needed.
For binary outcomes we do not use a logistic regression outcome model
because of the non-collapsability of logit-links makes deriving closed
form expression of NDE and NIE impossible so in this package we use a
Poisson outcome model with a log-link with robust covariance structure
using the vcovHC()
function in the sandwich
package. Assuming the same input as in the section for continuous
outcomes, but now outcome=mzipmed_data$binY
and NDE and NIE
estimates will now be on a ratio scale because of the Poisson outcome
model. The outcome model will be
\[log[P(Y_i=1|x,m,c)]=\tau_0+\tau_1x+\tau_2m+\mathbf{\tau_4'c}\]
The same MZIP mediator model can be used as shown in the previous section. Risk ratios of NDE and NIE can be derived as displayed below.
\[RR^{NDE}=e^{\tau_1(x-x^*)}\] \[RR^{NIE}=\frac{(1+e^{\gamma_0+\gamma_1x^*+\mathbf{\gamma_4'c}})(e^{\gamma_0+\gamma_1x+\mathbf{\gamma_4'c}}+e^{(e^{\alpha_0+\alpha_1x+\mathbf{\alpha_4'c}+log(1+e^{\gamma_0+\gamma_1x+\mathbf{\gamma_4'c}})})(e^{\tau_2}-1)}}{(1+e^{\gamma_0+\gamma_1x+\mathbf{\gamma_4'c}})(e^{\gamma_0+\gamma_1x^*+\mathbf{\gamma_4'c}}+e^{(e^{\alpha_0+\alpha_1x^*+\mathbf{\alpha_4'c}+log(1+e^{\gamma_0+\gamma_1x^*+\mathbf{\gamma_4'c}})})(e^{\tau_2}-1)}}\]
For this scenario we will use the binoutzimed()
function. For example,
=binoutzimed(outcome=mzipmed_data$binY,mediator=mzipmed_data$ziM, exposure=mzipmed_data$X, confounder=cbind(mzipmed_data$C1,mzipmed_data$C2),error="Delta", robust=FALSE,X=1,Xstar=0,C=NULL, zioff=NULL, OFF=NULL, rare=FALSE)
zimedb#> Estimate log SE Lower CI Upper CI
#> Natural Direct Effect (RR) 1.325 0.104 1.081 1.624
#> Natural Indirect Effect (RR) 1.011 0.010 0.991 1.032
#> Total Effect (RR) 1.340 0.103 1.094 1.641
#> Proportion Mediated 0.044
If the outcome is a rare binary variable then odds ratios will
approximate risk ratios which may provide better fit for the data. To
accommodate this the functions for binary outcomes also include an
additional argument, rare
. By default
rare=FALSE
the robust Poisson model is used, but if
rare=TRUE
a logistic regression is fit for the outcome
model. In addition, if an offset is specified then a fixed value of the
offset variable is required in deriving mediation effects. This can be
done using the OFF
argument and should be a numeric value.
If this argument is left NULL
then the mean of the offset
variable is used. If no offset is used then this argument is not needed.
The rest of the input and output are all the same as in the section with
continuous outcomes, except effects being on a ratio scale. For the
zimedb example we find a significant NDE as 1 is not included in the
risk ratio confidence interval, but fail to observe a significant
NIE.
The mzipmed
package extends all the previous functions
to cases where you are considering if there is an exposure-mediator
interaction. In addition to the NDE and NIE additional quantities can be
computed including the controlled direct effect (CDE), pure indirect
effect (PIE), interactive reference effect (INTref), and interaction
mediation effect (INTmed) which are the exposure-outcome relationship
due to neither mediation nor interaction, only mediation, only
interaction, and the tandem effect of mediation and interaction
respectively. The interaction effects are for additive interactions,
multiplicative interactions can be assessed directly in the outcome
model. Also computed is the proportion eliminated (PE) which answers the
question ‘if we were to fix the mediator to some value how much of the
exposure differences in the outcome could be eliminated?’.
The functions to do this are as follows: for a continuous outcome and
zero-inflated mediator use lmoutzimedint
, for a binary
outcome and ZI mediator use binoutzimedint
, for a ZI
outcome and continuous mediator use zioutlmmedint
, and for
a ZI outcome with a binary mediator use zioutbinmedint
. We
will skip model expression and formulas for effects, but they are
available upon request. Instead we will show an example using
zioutlmmedint
. These functions have the same input as their
non-interaction counterparts, but also require an additional argument in
M
which is a fixed value of the mediator used to compute
CDE, INTref, INTmed, and PE. By default, M=NULL
the mean
value of the mediator is used, but can be altered e.g. M=0
.
Running the function will print the NDE, NIE, and TE in R console like
before as well as the other 4 effects, the overall additive interaction
effect, and PM, PE, and the proportion attributable to interaction.
=zioutlmmedint(outcome=mzipmed_data$ziY1,mediator=mzipmed_data$lmM, exposure=mzipmed_data$X, confounder=cbind(mzipmed_data$C1,mzipmed_data$C2),error="Delta", robust=FALSE,X=1,Xstar=0,M=NULL,C=NULL)
zimout#> Estimate log SE Lower CI Upper CI
#> Natural Direct Effect (IRR) 1.105 0.241 0.689 1.772
#> Natural Indirect Effect (IRR) 1.186 0.079 1.017 1.384
#> Total Effect (IRR) 1.311 0.266 0.778 2.208
#>
#> 4-way Decomposition
#> Controlled Direct Effect (IRR) 0.990 0.148 0.741 1.323
#> Pure Indirect Effect (IRR) 1.145 0.063 1.011 1.296
#> Interactive Reference Effect 0.132 0.053 0.028 0.235
#> Interactive Mediation Effect 0.071 0.073 -0.072 0.213
#> Total Additive Interaction 0.202 0.107 -0.007 0.411
#>
#> Proportions
#> Proportion Mediated 0.662
#> Proportion Eliminated 1.032
#> Proportion from Interaction 0.562
From this example we fail to see a significant NDE, but do have a significant mediation effect based on incidence rate ratio interpretations. The interaction effects are on a difference scale regardless of the scale of the outcome variable so because 0 is included in the total additive interaction confidence interval we can conclude that we do not have evidence of a significant additive interaction effect between the exposure and mediator on the outcome. Investigating further into the zimout data in the global environment proportions attributable to each of the decomposed effects are provided. Because the interaction was not significant, it would be recommended to use the function that does not include the interaction term. It is always useful to check for an interaction effect initially.
This package provides functions for conducting mediation analysis for zero-inflated count variables using MZIP. This document serves as a brief overview of the package. For more information feel free to contact the package maintainer directly.
Long DL, Preisser JS, Herring AH, Golin CE. A marginalized zero-inflated Poisson regression model with overall exposure effects. Stat Med. 2014;33(29):5151-5165. doi:10.1002/sim.6293
VanderWeele T. Explanation in Causal Inference: Methods for Mediation and Interaction. Oxford University Press; 2016.