Ex. 1 - Background questionnaire generation

Yuan-Ling Liaw and Waldir Leoncio

library(lsasim)

packageVersion("lsasim")

[1] '2.1.6'

questionnaire_gen(n_obs, cat_prop = NULL, n_vars = NULL, n_X = NULL, n_W = NULL, cor_matrix = NULL,
    cov_matrix = NULL, c_mean = NULL, c_sd = NULL, theta = FALSE, family = NULL, full_output = FALSE,
    verbose = TRUE)

The function questionnaire_gen generates correlated continuous and ordinal data which resembles background questionnaire data. The required argument is n_obs and the optional arguments include

n_obs: the number of observations (e.g., test takers).
cat_prop: a list of vectors where each vector contains the cumulative proportions for each category of a given item.
n_vars: the number of variables, including the continuous (X) and the ordinal (W) covariates as well as the latent trait (theta).
n_X: the number of continuous (X) variables.
n_W: the number of ordinal (W) variables.
cor_matrix: a possibly heterogeneous correlation matrix, consisting of polyserial correlations between continuous and ordinal variables, and polychoric correlations between ordinal variables.
cov_matrix: a covariance matrix, formatted as cov_matrix.

The arguments c_mean and c_sd are scaling parameters for continuous variables. If the logical argument theta is TRUE then the latent trait will be generated as the first continuous variable and labeled ‘theta’. If family is gaussian then the data will be generated from a multivariate normal distribution, or the data will be generated from the polychoric correlation matrix.

If the logical argument full_output is TRUE, output will be a list containing the questionnaire data as well as several objects that might be of interest for further analysis of the data. The output of full_outputwill be addressed in future tutorials.

We only specify n_obs = 100 and use a multivariate normal distribution. It turned out the generated data involves one continuous variable and four ordinal covariates, which are 2-category, 3-category, 4-category, and 5-category, respectively.

set.seed(4388)
bg <- questionnaire_gen(n_obs = 100, family = "gaussian")
str(bg)

'data.frame':   100 obs. of  6 variables:
 $ subject: int  1 2 3 4 5 6 7 8 9 10 ...
 $ q1     : num  -0.6178 1.0299 -0.12 0.0624 1.4585 ...
 $ q2     : Factor w/ 2 levels "1","2": 2 1 1 2 1 2 2 2 2 2 ...
 $ q3     : Factor w/ 4 levels "1","2","3","4": 2 4 2 2 4 2 4 1 2 4 ...
 $ q4     : Factor w/ 3 levels "1","2","3": 2 1 2 2 1 1 3 3 2 2 ...
 $ q5     : Factor w/ 5 levels "1","2","3","4",..: 2 1 3 2 2 1 5 4 1 3 ...

In addition to n_obs = 100, we specify the logical argument theta = TRUE. An additional continuous variable is generated and labeled theta. The latent trait is always placed first in the generated data.

set.seed(4388)
bg <- questionnaire_gen(n_obs = 100, theta = TRUE, family = "gaussian")
str(bg)

'data.frame':   100 obs. of  7 variables:
 $ subject: int  1 2 3 4 5 6 7 8 9 10 ...
 $ theta  : num  -1.611 -0.388 1.105 1.618 1.663 ...
 $ q1     : num  -0.8859 -0.0742 0.9164 -0.7751 -0.3396 ...
 $ q2     : Factor w/ 2 levels "1","2": 2 2 2 2 1 1 2 2 2 2 ...
 $ q3     : Factor w/ 4 levels "1","2","3","4": 4 1 1 2 4 4 1 4 4 3 ...
 $ q4     : Factor w/ 3 levels "1","2","3": 1 3 2 3 1 2 2 1 1 2 ...
 $ q5     : Factor w/ 5 levels "1","2","3","4",..: 1 1 1 3 1 4 2 1 1 5 ...

We specify n_vars = 4 regardless the item type. Four different item types are generated, one 1-category item (continuous), one 2-category item, one 4-category item, and one 5-category item.

set.seed(4388)
bg <- questionnaire_gen(n_obs = 100, n_vars = 4, family = "gaussian")
str(bg)

'data.frame':   100 obs. of  5 variables:
 $ subject: int  1 2 3 4 5 6 7 8 9 10 ...
 $ q1     : num  0.146 0.83 1.137 0.271 1.115 ...
 $ q2     : Factor w/ 5 levels "1","2","3","4",..: 5 1 5 1 4 5 5 3 1 5 ...
 $ q3     : Factor w/ 2 levels "1","2": 2 1 1 1 1 1 2 1 1 1 ...
 $ q4     : Factor w/ 4 levels "1","2","3","4": 4 4 3 4 2 4 4 4 4 1 ...

In addition to n_vars = 4, we specify the logical argument theta = TRUE. Three different item types are generated, two 1-category item (latent trait and continuous), one 2-category item, and one 5-category item. It is noted that when theta = TRUE, the first continuous variable generated is always labeled theta.

set.seed(4388)
bg <- questionnaire_gen(n_obs = 100, n_vars = 4, theta = TRUE, family = "gaussian")
str(bg)

'data.frame':   100 obs. of  5 variables:
 $ subject: int  1 2 3 4 5 6 7 8 9 10 ...
 $ theta  : num  -0.666 -0.937 -2.229 0.931 -1.438 ...
 $ q1     : num  -0.353 1.405 1.17 -0.91 0.352 ...
 $ q2     : Factor w/ 5 levels "1","2","3","4",..: 4 1 4 4 4 2 5 2 5 5 ...
 $ q3     : Factor w/ 2 levels "1","2": 2 2 2 2 2 2 1 1 1 1 ...

We generate one latent trait and three continuous variables by specifying theta = TRUE and n_X = 3. We also add n_W = 0, or random number of ordinal variables will be generated.

set.seed(4388)
bg <- questionnaire_gen(n_obs = 100, n_X = 3, n_W = 0, theta = TRUE, family = "gaussian")
str(bg)

'data.frame':   100 obs. of  5 variables:
 $ subject: int  1 2 3 4 5 6 7 8 9 10 ...
 $ theta  : num  -0.763 -0.822 -0.404 -1.955 0.981 ...
 $ q1     : num  0.444 -0.513 2.046 1.441 -0.733 ...
 $ q2     : num  0.0349 0.7822 -0.1954 0.9954 -0.203 ...
 $ q3     : num  -0.3048 -0.3757 1.8951 1.1954 0.0676 ...

set.seed(4388)
bg <- questionnaire_gen(n_obs = 100, n_X = 3, theta = TRUE, family = "gaussian")
str(bg)

'data.frame':   100 obs. of  10 variables:
 $ subject: int  1 2 3 4 5 6 7 8 9 10 ...
 $ theta  : num  0.2258 -0.1851 -0.0877 0.0436 0.05 ...
 $ q1     : num  -0.609 -0.356 0.308 -1.88 -1.009 ...
 $ q2     : num  0.954 0.161 1.266 -1.268 0.797 ...
 $ q3     : num  0.444 0.229 -0.285 -0.659 1.169 ...
 $ q4     : Factor w/ 2 levels "1","2": 1 1 1 2 1 2 1 2 2 1 ...
 $ q5     : Factor w/ 4 levels "1","2","3","4": 2 1 2 1 2 4 1 3 2 1 ...
 $ q6     : Factor w/ 3 levels "1","2","3": 2 2 2 1 2 3 2 1 1 2 ...
 $ q7     : Factor w/ 5 levels "1","2","3","4",..: 1 1 2 5 2 2 3 5 5 3 ...
 $ q8     : Factor w/ 4 levels "1","2","3","4": 4 3 3 2 3 1 4 3 3 4 ...

We can also specify cat_prop = list(1, 1, 1, 1) to generate one latent trait and three continuous covariates. The length of cat_prop corresponds to the number of generated variables (including latent trait and continuous variables in this case).

set.seed(4388)
bg <- questionnaire_gen(n_obs = 100, cat_prop = list(1, 1, 1, 1), theta = TRUE, family = "gaussian")
str(bg)

'data.frame':   100 obs. of  5 variables:
 $ subject: int  1 2 3 4 5 6 7 8 9 10 ...
 $ theta  : num  -0.763 -0.822 -0.404 -1.955 0.981 ...
 $ q1     : num  0.444 -0.513 2.046 1.441 -0.733 ...
 $ q2     : num  0.0349 0.7822 -0.1954 0.9954 -0.203 ...
 $ q3     : num  -0.3048 -0.3757 1.8951 1.1954 0.0676 ...

We generate two ordinal variables regardless the item type. It turned out one 2-category variable and one 5-category variable are generated, respectively.

set.seed(4388)
bg <- questionnaire_gen(n_obs = 100, n_X = 0, n_W = 2, family = "gaussian")
str(bg)

'data.frame':   100 obs. of  3 variables:
 $ subject: int  1 2 3 4 5 6 7 8 9 10 ...
 $ q1     : Factor w/ 2 levels "1","2": 2 1 1 1 1 1 1 2 2 1 ...
 $ q2     : Factor w/ 5 levels "1","2","3","4",..: 1 4 5 3 5 4 2 1 1 1 ...

We generate one binary variable and 3 four-category variables.

set.seed(4388)
bg <- questionnaire_gen(n_obs = 100, n_X = 0, n_W = list(2, 4, 4, 4), family = "gaussian")
str(bg)

'data.frame':   100 obs. of  5 variables:
 $ subject: int  1 2 3 4 5 6 7 8 9 10 ...
 $ q1     : Factor w/ 2 levels "1","2": 1 2 2 2 1 2 1 1 1 2 ...
 $ q2     : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 2 2 4 4 1 4 ...
 $ q3     : Factor w/ 4 levels "1","2","3","4": 3 2 3 1 1 1 3 4 1 1 ...
 $ q4     : Factor w/ 4 levels "1","2","3","4": 2 1 1 2 4 4 4 4 3 4 ...

We generate five variables including one latent trait, two continuous, and two binary covariates. The latent trait is scaled on a mean set at 500, with a standard deviation of 100.

set.seed(4388)
bg <- questionnaire_gen(n_obs = 100, n_X = 2, n_W = list(2, 2), theta = TRUE, c_mean = c(500,
    0, 0), c_sd = c(100, 1, 1), family = "gaussian")
str(bg)

'data.frame':   100 obs. of  6 variables:
 $ subject: int  1 2 3 4 5 6 7 8 9 10 ...
 $ theta  : num  515 612 578 476 437 ...
 $ q1     : num  0.0731 -0.8194 -0.8648 -0.1415 0.7484 ...
 $ q2     : num  -0.0166 1.4975 0.596 0.4905 0.482 ...
 $ q3     : Factor w/ 2 levels "1","2": 2 2 2 1 1 1 1 2 1 2 ...
 $ q4     : Factor w/ 2 levels "1","2": 2 2 1 1 2 1 1 1 2 1 ...

We generate one continuous and two ordinal covariates. We specify the covariance matrix between the numeric and ordinal variables. The continuous covariate is scaled and the average is 2 by specifying c_mean = 2. When cov_matrix is provided, c_sd is ignored .

set.seed(4388)
props <- list(1, c(0.25, 1), c(0.2, 0.8, 1))
yw_cov <- matrix(c(1, 0.5, 0.5, 0.5, 1, 0.8, 0.5, 0.8, 1), nrow = 3)
bg <- questionnaire_gen(n_obs = 100, cat_prop = props, cov_matrix = yw_cov, c_mean = 2, family = "gaussian")
str(bg)

'data.frame':   100 obs. of  4 variables:
 $ subject: int  1 2 3 4 5 6 7 8 9 10 ...
 $ q1     : num  1.878 3.746 2.938 2.386 0.768 ...
 $ q2     : Factor w/ 2 levels "1","2": 1 2 2 2 1 2 2 2 1 1 ...
 $ q3     : Factor w/ 3 levels "1","2","3": 1 2 2 2 2 3 1 3 2 2 ...