User Guide for the R Package: marlod

1. Introduction

This vignette provides instructions on how to perform marginal modeling for exposure data with repeated measures and non-detects. Environmental exposure and biomonitoring data from environmental and occupational studies often exhibit right-skewed distributions and left-censored. Left censoring occurs when laboratory instruments have a limit of detection (LOD) below which measurements are not provided.

To conduct regression models for repeated measures data with non-detects, it is necessary to install the R package:

library(marlod)
#> Thank you for using marlod.
#>   Please cite:
#>   Chen, IC. marlod: an R package to model environmental exposure and biomonitoring
#>   data with repeated measurements and values below the limit of detection.
#>   J Expo Sci Environ Epidemiol 2025.
#>   when publishing results obtained from the package.

After building and installing the package, you can view the vignette by using the command:

browseVignettes(package = 'marlod')

2. Substitution Methods

To impute values that are below the LOD, the function Fillin() can be used. This section demonstrates the function using an example from Ganser and Hewett (2010). The available substitution methods for this function include: “None”, “LOD”, “LOD2”, “LODS2”, and “QQplot”. A detailed description of each substitution method can be found in the Reference Manual of the Documentation.

y <- c(0,0,0,3.06,4.41,7.23,8.29,9.52,19.94,20.25) 

## Limit of detection (LOD) = 3
lod <- 3

Fillin(y, lod, n, tp, "BetaMean")
#>  [1]  1.762431  1.762431  1.762431  3.060000  4.410000  7.230000  8.290000
#>  [8]  9.520000 19.940000 20.250000
Fillin(y, lod, n, tp, "BetaGM")
#>  [1]  1.482424  1.482424  1.482424  3.060000  4.410000  7.230000  8.290000
#>  [8]  9.520000 19.940000 20.250000

Consider the same example of a longitudinal dataset comprising five subjects, each measured at two time points. The analysis requires the number of subjects (n) and the number of time points (tp). The available substitution methods include “MIWithID” and “MIWithIDRM”.

y <- c(0,0,0,3.06,4.41,7.23,8.29,9.52,19.94,20.25) 
lod <- 3
n <- 5
tp <- 2

#Multiple imputation method with one covariate using id (order of subjects)
Fillin(y, lod, n, tp, "MIWithID")
#> X 
#> DL 
#> Z 
#> K = 5
#>  [1]  1.821762  1.808160  2.610667  3.060000  4.410000  7.230000  8.290000
#>  [8]  9.520000 19.940000 20.250000

#Multiple imputation method with two covariates using id and visit (order of time points)
#Fillin(y, lod, n, tp, "MIWithIDRM")

3. Outcome Data: Normal or Log-Normal Distribution

Conventional statistical analyses often assume that responses or outcome data follow a normal distribution. If the underlying distribution is log-normal, data transformation using the natural logarithm required. To analyze such data, marginal models utilizing estimation methods such as generalized estimating equations (GEE), quadratic inference functions (QIF), and generalized method of moments (GMM) are employed. These methods incorporate the imputation of measurements below the LOD using single and multiple value imputation techniques (Chen et al., 2024).

3-1. Simulated Dataset 15

The 15th dataset from a simulation study includes 100 subjects (sample size is 100), each with three repeated measurements. The independent variables or covariates (x1 and x2) are simulated from a Bernoulli distribution with a parameter value of \(p = 0.5\) and a uniform distribution \(U(0, 1)\), respectively. Correlated errors for repeated measures models are accounted for and assumed to follow a multivariate normal distribution, \(MVN(0, R(\alpha))\). A first-order autoregressive (AR-1) correlation structure with a correlation parameter of \(\alpha = 0.7\) is incorporated. The true values of 1, 1, and 1 correspond to the marginal intercept and two slopes.

data(simdata15)
head(simdata15)
#>           y int x1        x2 id visit
#> 1  3.879122   1  1 0.7528836  1     1
#> 2  4.008937   1  1 0.7528836  1     2
#> 3  4.380099   1  1 0.7528836  1     3
#> 4  1.847881   1  0 0.8511011  2     1
#> 5  1.806600   1  0 0.8511011  2     2
#> 6 -0.146508   1  0 0.8511011  2     3

3-2. Marginal Modeling (Mean Regression Model)

To conduct regression analysis using the 15th simulated dataset, assume the LOD is 2; thus, any measurements below 2 would be imputed using substitution approaches, such as “QQplot”. For repeated measures in a cluster study, an “exchangeable” correlation structure is used, while “AR-1” is appropriate for longitudinal datasets where subjects are measured over time. The example below employs the function Modified.GEE() for the GEE method. The atomic vector “typed” specifies the types of time-dependent covaraites, with the length of the vector equal to the number of regression parameters, excluding the intercept. For time-independent covariates or those in a cluster study, “1” is assigned, thus “c(1,1)” is used for “typed”. The maximum number of iterations is set to 1,000.

id=as.matrix(as.vector(t(simdata15$id)))
y=as.matrix(as.vector(t(simdata15$y)))
x1=as.matrix(as.vector(t(simdata15$x1)))
x2=as.matrix(as.vector(t(simdata15$x2)))
x=cbind(x1,x2)

## LOD = 2 is equivalent to detection proportion = 56.3% (censoring proportion = 43.7%).
lod=2

## Intercept is not included in the "x" and "typed".
## Modified.GEE(id, y, x, lod, substitue, corstr, typetd, maxiter)
Modified.GEE(id, y, x, lod, "QQplot", "AR-1", c(1,1), 1000)
#> Results include typical asymptotic standard error (SE) estimates and bias-corrected SE estimates that use covariance inflation corrections.
#> Covariance inflation corrections are based on the corrections of Mancl and DeRouen (MD) (2001), and Kauermann and Carroll (KC) (2001).
#> Last covariance inflation correction averages the above corrections (AVG) (Ford and Westgate, 2017, 2018).

Parameter	Estimate	Standard Error	Wald Statistic	p-value	TECM	CIC
Typical Asymptotic SE
0	1.41889291637309	0.117296637893661	12.0966205157511	0	0.0826889924640298	6.63582649603307
1	0.70842894519093	0.120271120631658	5.89026643694925	5.55713504102329e-08	0	0
2	1.14012150968324	0.233378123963833	4.88529726059467	4.07412042346955e-06	0	0
Bias-Corrected SE (MD)
0	1.41889291637309	0.121569957897582	11.671410773774	0	0.0891285922808028	7.1239067842054
1	0.70842894519093	0.124580417761904	5.68651926135669	1.36804927164391e-07	0	0
2	1.14012150968324	0.242547020447278	4.70062055423626	8.57351781924365e-06	0	0
Bias-Corrected SE (KC)
0	1.41889291637309	0.119412406036419	11.8822906552972	0	0.0858454535090291	6.87523734400911
1	0.70842894519093	0.122403870291727	5.78763517446403	8.76355334966661e-08	0	0
2	1.14012150968324	0.237914739625835	4.79214323364871	5.94118008701017e-06	0	0
Bias-Corrected SE (AVG)
0	1.41889291637309	0.120496011092954	11.7754347509356	0	0.0858454535090291	6.99957206410726
1	0.70842894519093	0.12349693913641	5.73640893567755	1.0986445742045e-07	0	0
2	1.14012150968324	0.240242045091072	4.74572013092478	7.15943746354419e-06	0	0
Estimated correlation	0.664821305215864	0	0	0	0	0
Estimated dispersion	0.511033918398329	0	0	0	0	0

The QIF method is performed using the function Modified.GEE(), which typically demonstrates efficiency advantages over GEE in large sample sizes.

## Gets initial estimates for the QIF approach through independence structure
initial=glm(y ~ x1 + x2, data=simdata15, family=gaussian)
beta_initial=as.matrix(initial$coefficients)

## Intercept is not included in the "x" and "typed".
## Modified.QIF(id, y, x, lod, substitue, corstr, beta, typetd, maxiter)
Modified.QIF(id, y, x, lod, "QQplot", "exchangeable", beta_initial, c(1,1), 1000)
#> Results include typical asymptotic standard error (SE) estimates and bias-corrected SE estimates that use covariance inflation corrections.
#> Covariance inflation corrections are based on the corrections of Mancl and DeRouen (MD) (2001), and Kauermann and Carroll (KC) (2001).
#> Last covariance inflation correction averages the above corrections (AVG) (Ford and Westgate, 2017, 2018).

	Parameter	Estimate	Standard Error	Wald Statistic	p-value	TECM	CIC
	Typical Asymptotic SE
(Intercept)	0	1.43745433084147	0.163267610463611	8.80428351195755	5.10702591327572e-14	0.12389623547046	7.60882967081079
x1	1	0.71367264050939	0.148937691261059	4.79175307785898	5.95052545859787e-06	0	0
x2	2	1.12391823394195	0.27396621500796	4.10239720218529	8.51430721415802e-05	0	0
	Bias-Corrected SE (MD)
(Intercept)	0	1.43745433084147	0.123035028606317	11.6832933443775	0	0.0911928228535921	5.86921377221324
x1	1	0.71367264050939	0.126351479859632	5.6483124796182	1.61730083991785e-07	0	0
x2	2	1.12391823394195	0.245133653598839	4.5849201749395	1.3554024130169e-05	0	0
	Bias-Corrected SE (KC)
(Intercept)	0	1.43745433084147	0.12084505122269	11.895020245327	0	0.0878152589014619	5.66351137004654
x1	1	0.71367264050939	0.124140764428436	5.74889838801342	1.0398148631019e-07	0	0
x2	2	1.12391823394195	0.240417975832862	4.67485107986806	9.49951150097661e-06	0	0
	Bias-Corrected SE (AVG)
(Intercept)	0	1.43745433084147	0.121944956167059	11.7877309240427	0	0.0878152589014619	5.76636257112989
x1	1	0.71367264050939	0.125250999707778	5.69793967452918	1.30116281304993e-07	0	0
x2	2	1.12391823394195	0.242787264112316	4.6292306066846	1.13820322262814e-05	0	0

Another estimation method, GMM, is employed using the function Modified.GMM().

## Gets initial estimates for the GMM approach through independence structure
initial=glm(y ~ x1 + x2, data=simdata15, family=gaussian)
beta_initial=as.matrix(initial$coefficients)

## Intercept is not included in the "x" and "typed".
## Modified.GMM(id, y, x, lod, substitue, beta, maxiter)
Modified.GMM(id, y, x, lod, "QQplot", beta_initial, 1000)
#> Results include typical asymptotic standard error (SE) estimates and bias-corrected SE estimates that use covariance inflation corrections.
#> Covariance inflation corrections are based on the corrections of Mancl and DeRouen (MD) (2001), and Kauermann and Carroll (KC) (2001).
#> Last covariance inflation correction averages the above corrections (AVG) (Ford and Westgate, 2017, 2018).

	Parameter	Estimate	Standard Error	Wald Statistic	p-value	TECM	CIC
	Typical Asymptotic SE
(Intercept)	0	1.43745433084149	0.11869769948294	12.1102122206513	0	0.0845688751173853	5.46551267185539
x1	1	0.713672640509396	0.12197480319244	5.85098415271417	6.6182999303166e-08	0	0
x2	2	1.12391823394191	0.235800506023648	4.76639449547743	6.5895608287736e-06	0	0
	Bias-Corrected SE (MD)
(Intercept)	0	1.43745433084149	0.123035028606323	11.683293344377	0	0.0911928228536039	5.86921377221358
x1	1	0.713672640509396	0.126351479859629	5.64831247961839	1.61730083991785e-07	0	0
x2	2	1.12391823394191	0.245133653598862	4.58492017493892	1.3554024130169e-05	0	0
	Bias-Corrected SE (KC)
(Intercept)	0	1.43745433084149	0.120845051222696	11.8950202453265	0	0.0878152589014733	5.66351137004685
x1	1	0.713672640509396	0.124140764428433	5.74889838801362	1.0398148631019e-07	0	0
x2	2	1.12391823394191	0.240417975832885	4.67485107986746	9.49951150097661e-06	0	0
	Bias-Corrected SE (AVG)
(Intercept)	0	1.43745433084149	0.121944956167065	11.7877309240423	0	0.0878152589014733	5.76636257113022
x1	1	0.713672640509396	0.125250999707775	5.69793967452937	1.30116281304993e-07	0	0
x2	2	1.12391823394191	0.242787264112338	4.62923060668401	1.13820322262814e-05	0	0

3-3. Time-Dependent Covariates

To handle time-dependent covariates, we use the 58th dataset from a simulation study, which includes 100 subjects (sample size is 100), each with three measurements collected over time. Detailed model mechanism are described in the setting II for type III time-dependent covariate on page 90 of Lai and Small (2007).

data(simdata58)
head(simdata58)
#>             y int         x1 id visit
#> 1  2.13315183   1 -0.1061468  1     1
#> 2 -0.93398126   1 -0.1406264  1     2
#> 3  0.38660420   1 -0.2905621  1     3
#> 4  0.70977961   1  0.1533462  2     1
#> 5 -0.07722418   1  0.3018084  2     2
#> 6  0.71894327   1 -0.1685652  2     3

Failure to account for time-dependency in covariates can lead to inefficient regression parameter estimation. The function Selected.GEE() allows the selection of a type of time-dependent covariate through a marginal mean regression model using the GEE estimation method for longitudinal data with values below the LOD. The selection approach applied to choose a type of time-dependency is the empirical mean squared error (MSE) minimization criterion (Chen and Westgate, 2017, 2019). Given the longitudinal nature of the dataset, “AR-1” is used as the working correlation structure.

id=as.matrix(as.vector(t(simdata58$id)))
y=as.matrix(as.vector(t(simdata58$y)))
x1=as.matrix(as.vector(t(simdata58$x1)))

## LOD = 0.05 is equivalent to detection proportion = 50.7% (censoring proportion = 49.3%).
lod=0.05

## Intercept is not included in the "x".
## Selected.GEE(id, y, x, lod, substitue, corstr, maxiter)
Selected.GEE(id, y, x1, lod, "MIWithID", "AR-1", 1000)
#> X 
#> DL 
#> Z 
#> K = 5 
#> X 
#> DL 
#> Z 
#> K = 5 
#> X 
#> DL 
#> Z 
#> K = 5 
#> X 
#> DL 
#> Z 
#> K = 5
#> Selected type of time-dependent covariate
#> 3
#> Results based on the empirical MSE minimization criterion (EMMC)

Type	EMMC
1	0.0015242
2	0.0014020
3	0.0011396

Once the type of time-dependent covaraite is selected, i.e., in this case, type 3, it can be updated in the “typed” vector of the function Modified.GEE().

id=as.matrix(as.vector(t(simdata58$id)))
y=as.matrix(as.vector(t(simdata58$y)))
x1=as.matrix(as.vector(t(simdata58$x1)))

Modified.GEE(id, y, x1, lod, "MIWithID", "AR-1", c(3), 1000)
#> X 
#> DL 
#> Z 
#> K = 5
#> Results include typical asymptotic standard error (SE) estimates and bias-corrected SE estimates that use covariance inflation corrections.
#> Covariance inflation corrections are based on the corrections of Mancl and DeRouen (MD) (2001), and Kauermann and Carroll (KC) (2001).
#> Last covariance inflation correction averages the above corrections (AVG) (Ford and Westgate, 2017, 2018).

Parameter	Estimate	Standard Error	Wald Statistic	p-value	TECM	CIC
Typical Asymptotic SE
0	0.452821002051275	0.0373341050873522	12.1288832554521	0	0.0024966420688547	2.18478785302503
1	0.250525250480249	0.0332085330326595	7.54400232716889	2.35171881968199e-11	0	0
Bias-Corrected SE (MD)
0	0.452821002051275	0.0378475410095741	11.9643440491082	0	0.00258446114934623	2.26294778050196
1	0.250525250480249	0.0339414906696044	7.3810915648617	5.17350606799027e-11	0	0
Bias-Corrected SE (KC)
0	0.452821002051275	0.0375896330058092	12.0464331743089	0	0.00253997288454896	2.22334596836551
1	0.250525250480249	0.0335707070976699	7.46261464649452	3.48875373035185e-11	0	0
Bias-Corrected SE (AVG)
0	0.452821002051275	0.0377188074439185	12.0051781256484	0	0.00253997288454896	2.24314687443373
1	0.250525250480249	0.0337566079746794	7.42151731205237	4.25589563590734e-11	0	0
Estimated correlation	0.141636045224919	0	0	0	0	0
Estimated dispersion	0.365171591232733	0	0	0	0	0

The QIF method can be performed using the function Selected.QIF(), and the selected type can be then updated in the “typed” vector of the function Modified.QIF().

## Gets initial estimates for the QIF approach through independence structure
initial=glm(y ~ x1, data=simdata58, family=gaussian)
beta_initial=as.matrix(initial$coefficients)

## Intercept is not included in the "x" and "typed".
## Selected.QIF(id, y, x, lod, substitue, corstr, beta, maxiter)
Selected.QIF(id, y, x1, lod, "MIWithID", "AR-1", beta_initial, 1000)
#> X 
#> DL 
#> Z 
#> K = 5 
#> X 
#> DL 
#> Z 
#> K = 5 
#> X 
#> DL 
#> Z 
#> K = 5 
#> X 
#> DL 
#> Z 
#> K = 5
#> Selected type of time-dependent covariate
#> 3
#> Results based on the empirical MSE minimization criterion (EMMC)

Type	EMMC
1	0.0028805
2	0.0024740
3	0.0010988

4. Outcome Data: Unknown Distribution

When original or transformed data do not follow a known distribution, modeling the conditional mean of the outcome variable may not be ideal, as the estimated mean and standard deviation can be sensitive to large values. In such cases, quantile regression serves as an alternative analytical method. It does not assume an underlying distribution, offers advantages for skewed data, and is robust to outliers. Using regression models, substitution or fill-in approaches such as single and multiple value imputation techniques can be employed to impute measurements below the LOD (Chen et al., 2021).

4-1. Marginal Modeling (Quantile Regression Model)

To conduct regression analysis using the 15th simulated dataset, where the LOD is set to 2, any measurements below 2 are imputed using substitution approaches such as “LOD2”. The median or 50th quantile (\(\tau = 0.5\)) is used. In cluster studies, an “exchangeable” structure is applied to the working correlation, while an “AR-1” structure is used for longitudinal datasets where subjects are measured over time.

y=as.matrix(as.vector(t(simdata15$y)))
x1=as.matrix(as.vector(t(simdata15$x1)))
x2=as.matrix(as.vector(t(simdata15$x2)))
x=cbind(matrix(1,length(x1),1),x1,x2)

## LOD = 2 is equivalent to detection proportion = 56.3% (censoring proportion = 43.7%).
lod=2

## Median or 50th quantile is given.
tau=0.5

## Intercept is included in the "x" but not in the "typed".
## Quantile.FWZ(y, x, lod, substitue, tau, corstr, typetd, data)
Quantile.FWZ(y, x, lod, "LOD2", tau, "exchangeable", c(1,1), simdata15)
#> Quantile level and correlation structure are provided in the column of Intercept.
#>                   Intercept     X2      X3
#> Quantile Level          0.5      0       0
#> Structure      exchangeable      0       0
#> Estimate             0.7361 1.4998  0.7495
#> Standard Error       0.1007  0.353  0.7583
#> 95% CI Lower         0.5363 0.7993 -0.7553
#> 95% CI Upper         0.9359 2.2003  2.2543
#> P-value                   0      0  0.3254

4-2. Time-Dependent Covariates

To handle time-dependent covariates, we use the 58th simulated dataset, which includes 100 subjects (sample size is 100), each with three repeated measurements. The “AR-1” working correlation structure is embedded in the function Quantile.select.FWZ() because time-dependent covariates are typically associated with longitudinal studies. A detailed description of the selection process can be found in Chen et al. (2024).

y=as.matrix(as.vector(t(simdata58$y)))
x1=as.matrix(as.vector(t(simdata58$x1)))
x=cbind(matrix(1,length(x1),1),x1)

## LOD = 0.05 is equivalent to detection proportion = 50.7% (censoring proportion = 49.3%).
lod=0.05

## 95th quantile is given.
tau=0.95

## Intercept is included in the "x".
## Quantile.select.FWZ(y, x, lod, substitue, tau, data)
Quantile.select.FWZ(y, x, lod, "LOD2", tau, simdata58)
#>                         beta.0 beta.1
#> Quantile Level          0.0000 0.9500
#> Type of Time-Dependency 0.0000 2.0000
#> Estimate                1.5715 0.5074
#> Standard Error          0.0932 0.0365
#> 95% CI Lower            1.3866 0.4350
#> 95% CI Upper            1.7564 0.5798
#> P-value                 0.0000 0.0000

The selection criterion identifies type 2 as the appropriate covariate type. The quantile regression result with the updated covariate type of time-dependency is displayed. Additionally, this selected type of time-dependent covariate (in this case, type 2) can also be updated in the “typed” parameter of the function Quantile.FWZ(), enabling analysts to conduct multivariable analysis.

y=as.matrix(as.vector(t(simdata58$y)))
x1=as.matrix(as.vector(t(simdata58$x1)))
x=cbind(matrix(1,length(x1),1),x1)

## LOD = 0.05 is equivalent to detection proportion = 50.7% (censoring proportion = 49.3%).
lod=0.05

## 95th quantile is given.
tau=0.95

## Intercept is included in the "x" but not in the "typed".
## Quantile.FWZ(y, x, lod, substitue, tau, corstr, typetd, data)
Quantile.FWZ(y, x, lod, "LOD2", tau, "AR-1", c(2), simdata58)
#> Quantile level and correlation structure are provided in the column of Intercept.
#>                Intercept     X2
#> Quantile Level      0.95      0
#> Structure           AR-1      0
#> Estimate          1.5685  0.507
#> Standard Error    0.0907 0.0363
#> 95% CI Lower      1.3885  0.435
#> 95% CI Upper      1.7485  0.579
#> P-value                0      0

5. References

Chen IC, Bertke SJ, Curwin BD. Quantile regression for exposure data with repeated measures in the presence of non-detects. J Expo Sci Environ Epidemiol. 2021; 31(6): 1057–1066.
Chen IC, Bertke SJ, Estill CF. Compare the marginal effects for environmental exposure and biomonitoring data with repeated measurements and values below the limit of detection. J Expo Sci Environ Epidemiol. 2024; 34: 1018–1027.
Chen IC, Bertke SJ, Dahm MM. Quantile regression for longitudinal data with values below the limit of detection and time-dependent covariates – application to modeling carbon nanotube and nanofiber exposures. Ann Work Expo Health. 2024; 68(8): 846–858.
Chen IC, Westgate PM. Improved methods for the marginal analysis of longitudinal data in the presence of time-dependent covariates. Statistics in Medicine. 2017; 36(16): 2533–2546.
Chen IC, Westgate PM. A novel approach to selecting classification types for time-dependent covariates in the marginal analysis of longitudinal data. Statistical Methods in Medical Research. 2019; 28(10-11): 3176–3186.
Ganser GH, Hewett P. An accurate substitution method for analyzing censored data. J Occup Environ Hyg. 2010; 7(4): 233–44.
Lai TL, Small DS. Marginal regression analysis of longitudinal data with time-dependent covariates: a generalized method-of-moments approach. Journal of the Royal Statistical Society: Series B. 2007; 69(1): 79–99.