Simulation studies are often used in forensic genetics to study the behaviour of probabilistic genotyping software. In many cases, researchers use simplified models that ignore complexities such as peak height variability and stutters. The goal of this package is to make it straightforward to use probabilistic genotyping models in simulation studies.
For a general introduction to forensic genetics and probabilistic genotyping, see (Buckleton, Bright, and Taylor 2018). See also (Gill et al. 2021) for a recent review of probabilistic genotyping systems.
Below, we load allele frequencies provided with the package and demonstrate how mixtures can be sampled using a log normal model for peak heights (refer to (Taylor, Bright, and Buckleton 2013) for details on the model). The template parameters for each contributor are picked uniformly between 50 and 10,000. The degradation parameter is picked from a gamma distribution with shape 2.5 and scale 0.001.
<- read_allele_freqs(system.file("extdata","FBI_extended_Cauc.csv",
freqs package = "simDNAmixtures"))
data(gf)
<- list(min_template = 50., max_template = 10000.,
sampling_parameters degradation_shape = 2.5, degradation_scale = 1e-3)
After the setup, a single function call is sufficient to generate a set of samples. We set n=2
to generate two samples. The contributors to the sample are named U1
and U2
which means that each mixed sample has two unrelated contributors.
set.seed(1)
<- sample_mixtures(n = 2, contributors = c("U1", "U2"), freqs = freqs,
mixtures sampling_parameters = sampling_parameters,
model_settings = gf$log_normal_bwfw_settings,
sample_model = sample_log_normal_model)
The mixtures
object contains the simulation results. The sample_mixtures
function has an optional results_directory
argument. If the path to a directory is provided, then all simulation results will be persisted on disk in file formats that are readily imported in probabilistic genotyping software. We inspect the parameter_summary
property of the result of sample_mixtures
.
::kable(mixtures$parameter_summary[1:5]) knitr
SampleName | template1 | template2 | degradation1 | degradation2 |
---|---|---|---|---|
sample_0001_N2_9584_4870_bc | 9584.164 | 4869.979 | 0.0032058 | 0.004264 |
sample_0002_N2_6366_5585_ba | 6366.157 | 5584.630 | 0.0031578 | 0.001269 |
The first sample is in approximately a 2:1 ratio. We print the first 10 peaks of the profile below.
::kable(head(mixtures$samples[[1]]$mixture, 10)) knitr
Locus | Allele | Height | Size |
---|---|---|---|
D3S1358 | 13 | 216 | 113.25 |
D3S1358 | 14 | 4760 | 117.33 |
D3S1358 | 15 | 18947 | 121.40 |
D3S1358 | 16 | 107 | 125.48 |
vWA | 14 | 218 | 168.84 |
vWA | 15 | 4246 | 172.87 |
vWA | 16 | 3629 | 176.91 |
vWA | 17 | 6545 | 180.95 |
vWA | 18 | 6905 | 184.99 |
vWA | 19 | 97 | 189.02 |
We repeat the example above with a gamma distribution for peak heights instead of a log normal one. The mean peak height parameter mu
is chosen uniformly between 50 and 5,000. The variance parameter cv
parameter (\(\sigma\) in (Bleka, Storvik, and Gill 2016)) is chosen uniformly between 0.05 and 0.35. The degradation parameter is sampled from a Beta distribution with parameters 10 and 1, which means most profile will be only mildly degraded.
<- list(min_mu = 50., max_mu = 5e3,
gamma_sampling_parameters min_cv = 0.05, max_cv = 0.35,
degradation_shape1 = 10, degradation_shape2 = 1)
We now invoke the sample_mixtures
function again to sample two mixtures; both consisting of two unrelated contributors. We do not sample stutters.
set.seed(2)
<- sample_mixtures(n = 2, contributors = c("U1", "U2"), freqs = freqs,
mixtures sampling_parameters = gamma_sampling_parameters,
model_settings = gf$gamma_settings_no_stutter,
sample_model = sample_gamma_model)
We see that the first sample has \(\mu\) around 1,236 and is in a 65:35 ratio.
::kable(mixtures$parameter_summary[1:4]) knitr
SampleName | mixture_proportions1 | mixture_proportions2 | mu |
---|---|---|---|
sample_0001_N2_1236_65_35_aa | 0.6512712 | 0.3487288 | 1235.641 |
sample_0002_N2_2906_82_18_aa | 0.8158866 | 0.1841134 | 2905.758 |
The coefficient of variation for a full heterozygote is about 9% for the first sample.
::kable(mixtures$parameter_summary[c(1,5:7)]) knitr
SampleName | cv | degradation_beta1 | degradation_beta2 |
---|---|---|---|
sample_0001_N2_1236_65_35_aa | 0.0913336 | 0.8311816 | 0.8311816 |
sample_0002_N2_2906_82_18_aa | 0.3151883 | 0.8186755 | 0.8186755 |
We print the first 10 peaks of the profile below.
::kable(head(mixtures$samples[[1]]$mixture, 10)) knitr
Locus | Allele | Height | Size |
---|---|---|---|
D3S1358 | 15 | 1394 | 121.40 |
D3S1358 | 16 | 358 | 125.48 |
D3S1358 | 18 | 535 | 133.64 |
vWA | 14 | 928 | 168.84 |
vWA | 16 | 1077 | 176.91 |
vWA | 17 | 334 | 180.95 |
D16S539 | 11 | 1607 | 251.67 |
D16S539 | 12 | 341 | 255.70 |
CSF1PO | 10 | 920 | 298.34 |
CSF1PO | 12 | 456 | 306.26 |