[1] '2.1.5'
questionnaire_gen(n_obs, cat_prop = NULL, n_vars = NULL, n_X = NULL, n_W = NULL, cor_matrix = NULL,
cov_matrix = NULL, c_mean = NULL, c_sd = NULL, theta = FALSE, family = NULL, full_output = FALSE,
verbose = TRUE)
The function questionnaire_gen
generates correlated
continuous and ordinal data which resembles background questionnaire
data. The required argument is n_obs
and the optional
arguments include
n_obs
: the number of observations (e.g., test
takers).cat_prop
: a list of vectors where each vector contains
the cumulative proportions for each category of a given item.n_vars
: the number of variables, including the
continuous (X
) and the ordinal (W
) covariates
as well as the latent trait (theta
).n_X
: the number of continuous (X
)
variables.n_W
: the number of ordinal (W
)
variables.cor_matrix
: a possibly heterogeneous correlation
matrix, consisting of polyserial correlations between continuous and
ordinal variables, and polychoric correlations between ordinal
variables.cov_matrix
: a covariance matrix, formatted as
cov_matrix
.The arguments c_mean
and c_sd
are scaling
parameters for continuous variables. If the logical argument
theta
is TRUE
then the latent trait will be
generated as the first continuous variable and labeled ‘theta’. If
family
is gaussian
then the data will be
generated from a multivariate normal distribution, or the data will be
generated from the polychoric correlation matrix.
If the logical argument full_output
is
TRUE
, output will be a list containing the questionnaire
data as well as several objects that might be of interest for further
analysis of the data. The output of full_output
will be
addressed in future tutorials.
We only specify n_obs = 100
and use a multivariate
normal distribution. It turned out the generated data involves one
continuous variable and four ordinal covariates, which are 2-category,
3-category, 4-category, and 5-category, respectively.
'data.frame': 100 obs. of 6 variables:
$ subject: int 1 2 3 4 5 6 7 8 9 10 ...
$ q1 : num -0.6178 1.0299 -0.12 0.0624 1.4585 ...
$ q2 : Factor w/ 2 levels "1","2": 2 1 1 2 1 2 2 2 2 2 ...
$ q3 : Factor w/ 4 levels "1","2","3","4": 2 4 2 2 4 2 4 1 2 4 ...
$ q4 : Factor w/ 3 levels "1","2","3": 2 1 2 2 1 1 3 3 2 2 ...
$ q5 : Factor w/ 5 levels "1","2","3","4",..: 2 1 3 2 2 1 5 4 1 3 ...
In addition to n_obs = 100
, we specify the logical
argument theta = TRUE
. An additional continuous variable is
generated and labeled theta
. The latent trait is always
placed first in the generated data.
'data.frame': 100 obs. of 7 variables:
$ subject: int 1 2 3 4 5 6 7 8 9 10 ...
$ theta : num -1.611 -0.388 1.105 1.618 1.663 ...
$ q1 : num -0.8859 -0.0742 0.9164 -0.7751 -0.3396 ...
$ q2 : Factor w/ 2 levels "1","2": 2 2 2 2 1 1 2 2 2 2 ...
$ q3 : Factor w/ 4 levels "1","2","3","4": 4 1 1 2 4 4 1 4 4 3 ...
$ q4 : Factor w/ 3 levels "1","2","3": 1 3 2 3 1 2 2 1 1 2 ...
$ q5 : Factor w/ 5 levels "1","2","3","4",..: 1 1 1 3 1 4 2 1 1 5 ...
We specify n_vars = 4
regardless the item type. Four
different item types are generated, one 1-category item (continuous),
one 2-category item, one 4-category item, and one 5-category item.
'data.frame': 100 obs. of 5 variables:
$ subject: int 1 2 3 4 5 6 7 8 9 10 ...
$ q1 : num 0.146 0.83 1.137 0.271 1.115 ...
$ q2 : Factor w/ 5 levels "1","2","3","4",..: 5 1 5 1 4 5 5 3 1 5 ...
$ q3 : Factor w/ 2 levels "1","2": 2 1 1 1 1 1 2 1 1 1 ...
$ q4 : Factor w/ 4 levels "1","2","3","4": 4 4 3 4 2 4 4 4 4 1 ...
In addition to n_vars = 4
, we specify the logical
argument theta = TRUE
. Three different item types are
generated, two 1-category item (latent trait and continuous), one
2-category item, and one 5-category item. It is noted that when
theta = TRUE
, the first continuous variable generated is
always labeled theta
.
set.seed(4388)
bg <- questionnaire_gen(n_obs = 100, n_vars = 4, theta = TRUE, family = "gaussian")
str(bg)
'data.frame': 100 obs. of 5 variables:
$ subject: int 1 2 3 4 5 6 7 8 9 10 ...
$ theta : num -0.666 -0.937 -2.229 0.931 -1.438 ...
$ q1 : num -0.353 1.405 1.17 -0.91 0.352 ...
$ q2 : Factor w/ 5 levels "1","2","3","4",..: 4 1 4 4 4 2 5 2 5 5 ...
$ q3 : Factor w/ 2 levels "1","2": 2 2 2 2 2 2 1 1 1 1 ...
We generate one latent trait and three continuous variables by
specifying theta = TRUE
and n_X = 3
. We also
add n_W = 0
, or random number of ordinal variables will be
generated.
set.seed(4388)
bg <- questionnaire_gen(n_obs = 100, n_X = 3, n_W = 0, theta = TRUE, family = "gaussian")
str(bg)
'data.frame': 100 obs. of 5 variables:
$ subject: int 1 2 3 4 5 6 7 8 9 10 ...
$ theta : num -0.763 -0.822 -0.404 -1.955 0.981 ...
$ q1 : num 0.444 -0.513 2.046 1.441 -0.733 ...
$ q2 : num 0.0349 0.7822 -0.1954 0.9954 -0.203 ...
$ q3 : num -0.3048 -0.3757 1.8951 1.1954 0.0676 ...
set.seed(4388)
bg <- questionnaire_gen(n_obs = 100, n_X = 3, theta = TRUE, family = "gaussian")
str(bg)
'data.frame': 100 obs. of 10 variables:
$ subject: int 1 2 3 4 5 6 7 8 9 10 ...
$ theta : num 0.2258 -0.1851 -0.0877 0.0436 0.05 ...
$ q1 : num -0.609 -0.356 0.308 -1.88 -1.009 ...
$ q2 : num 0.954 0.161 1.266 -1.268 0.797 ...
$ q3 : num 0.444 0.229 -0.285 -0.659 1.169 ...
$ q4 : Factor w/ 2 levels "1","2": 1 1 1 2 1 2 1 2 2 1 ...
$ q5 : Factor w/ 4 levels "1","2","3","4": 2 1 2 1 2 4 1 3 2 1 ...
$ q6 : Factor w/ 3 levels "1","2","3": 2 2 2 1 2 3 2 1 1 2 ...
$ q7 : Factor w/ 5 levels "1","2","3","4",..: 1 1 2 5 2 2 3 5 5 3 ...
$ q8 : Factor w/ 4 levels "1","2","3","4": 4 3 3 2 3 1 4 3 3 4 ...
We can also specify cat_prop = list(1, 1, 1, 1)
to
generate one latent trait and three continuous covariates. The length of
cat_prop
corresponds to the number of generated variables
(including latent trait and continuous variables in this case).
set.seed(4388)
bg <- questionnaire_gen(n_obs = 100, cat_prop = list(1, 1, 1, 1), theta = TRUE, family = "gaussian")
str(bg)
'data.frame': 100 obs. of 5 variables:
$ subject: int 1 2 3 4 5 6 7 8 9 10 ...
$ theta : num -0.763 -0.822 -0.404 -1.955 0.981 ...
$ q1 : num 0.444 -0.513 2.046 1.441 -0.733 ...
$ q2 : num 0.0349 0.7822 -0.1954 0.9954 -0.203 ...
$ q3 : num -0.3048 -0.3757 1.8951 1.1954 0.0676 ...
We generate two ordinal variables regardless the item type. It turned out one 2-category variable and one 5-category variable are generated, respectively.
'data.frame': 100 obs. of 3 variables:
$ subject: int 1 2 3 4 5 6 7 8 9 10 ...
$ q1 : Factor w/ 2 levels "1","2": 2 1 1 1 1 1 1 2 2 1 ...
$ q2 : Factor w/ 5 levels "1","2","3","4",..: 1 4 5 3 5 4 2 1 1 1 ...
We generate one binary variable and 3 four-category variables.
set.seed(4388)
bg <- questionnaire_gen(n_obs = 100, n_X = 0, n_W = list(2, 4, 4, 4), family = "gaussian")
str(bg)
'data.frame': 100 obs. of 5 variables:
$ subject: int 1 2 3 4 5 6 7 8 9 10 ...
$ q1 : Factor w/ 2 levels "1","2": 1 2 2 2 1 2 1 1 1 2 ...
$ q2 : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 2 2 4 4 1 4 ...
$ q3 : Factor w/ 4 levels "1","2","3","4": 3 2 3 1 1 1 3 4 1 1 ...
$ q4 : Factor w/ 4 levels "1","2","3","4": 2 1 1 2 4 4 4 4 3 4 ...
We generate five variables including one latent trait, two continuous, and two binary covariates. The latent trait is scaled on a mean set at 500, with a standard deviation of 100.
set.seed(4388)
bg <- questionnaire_gen(n_obs = 100, n_X = 2, n_W = list(2, 2), theta = TRUE, c_mean = c(500,
0, 0), c_sd = c(100, 1, 1), family = "gaussian")
str(bg)
'data.frame': 100 obs. of 6 variables:
$ subject: int 1 2 3 4 5 6 7 8 9 10 ...
$ theta : num 515 612 578 476 437 ...
$ q1 : num 0.0731 -0.8194 -0.8648 -0.1415 0.7484 ...
$ q2 : num -0.0166 1.4975 0.596 0.4905 0.482 ...
$ q3 : Factor w/ 2 levels "1","2": 2 2 2 1 1 1 1 2 1 2 ...
$ q4 : Factor w/ 2 levels "1","2": 2 2 1 1 2 1 1 1 2 1 ...
We generate one continuous and two ordinal covariates. We specify the
covariance matrix between the numeric and ordinal variables. The
continuous covariate is scaled and the average is 2 by specifying
c_mean = 2
. When cov_matrix
is provided,
c_sd
is ignored .
set.seed(4388)
props <- list(1, c(0.25, 1), c(0.2, 0.8, 1))
yw_cov <- matrix(c(1, 0.5, 0.5, 0.5, 1, 0.8, 0.5, 0.8, 1), nrow = 3)
bg <- questionnaire_gen(n_obs = 100, cat_prop = props, cov_matrix = yw_cov, c_mean = 2, family = "gaussian")
str(bg)
'data.frame': 100 obs. of 4 variables:
$ subject: int 1 2 3 4 5 6 7 8 9 10 ...
$ q1 : num 1.878 3.746 2.938 2.386 0.768 ...
$ q2 : Factor w/ 2 levels "1","2": 1 2 2 2 1 2 2 2 1 1 ...
$ q3 : Factor w/ 3 levels "1","2","3": 1 2 2 2 2 3 1 3 2 2 ...