library(bstrl)
We will be using the included data, geco_small
, a small
simulated dataset with 7 files, 14 fields per file, known true
identities, and 3 errors per duplicated record. This variable contains a
list of data frames, each of which represents a file with 10 records
each.
length(geco_small)
#> [1] 7
head(geco_small[[1]])
#> rec.id entity given.name surname age occup extra1 extra2 extra3
#> 3 rec-0252-org 252 harrison widdowson a 4 6 7 5
#> 7 rec-0534-org 534 taylah deakin f 2 12 4 8
#> 10 rec-0605-org 605 jai green k 2 2 10 4
#> 73 rec-4150-org 4150 spencer blake g 5 4 12 11
#> 107 rec-5641-org 5641 makenzi chandler j 2 6 4 4
#> 112 rec-5932-org 5932 blake gilbertson g 1 12 11 9
#> extra4 extra5 extra6 extra7 extra8 extra9 extra10
#> 3 1 8 3 3 1 1 8
#> 7 7 3 5 9 2 2 6
#> 10 5 10 2 6 1 5 4
#> 73 10 5 6 7 5 1 5
#> 107 8 11 7 11 7 10 6
#> 112 12 4 4 7 8 7 3
A larger version of this dataset is also included as
geco_30over_3err
.
To perform linkage, we first have to define the fields we will be using and their data types. For this example we will only use a subset of the available fields.
# Names of the columns on which to perform linkage
<- c("given.name", "surname", "age", "occup", "extra1", "extra2", "extra3", "extra4", "extra5", "extra6")
fieldnames
# How to compare each of the fields
<- c("lv", "lv", # First name and last name use normalized edit distance
types "bi", "bi", "bi", "bi", "bi", "bi", "bi", "bi") # All others binary equal/unequal
<- c(0, 0.25, 0.5) # Break continuous difference measures into 4 levels using these split points breaks
We will link using the fields given.name
,
surname
, age
, occup
, and
extra1
through extra6
. Given name and surname
are compared using normalized Levenshtein edit distance, with breaks at
0, 0.25, and 0.5. In other words, exact equality is given its own level,
then there are three levels for varying inequality. All other fields are
compared in a binary fashion.
To perform streaming record linkage, we need an initial two-file
linkage to use as a base for streaming updates. This is accomplished
using the bipartiteRL
function which outsources the task to
the BRL
package.
<- bipartiteRL(geco_small[[1]], geco_small[[2]],
res.twofile flds = fieldnames, types = types, breaks = breaks,
nIter = 600, burn = 100,
seed = 0)
It is important to pass the comparison details we created earlier to
this function as shown. These will be stored with the result and used in
future streaming updates. Further information on function parameters can
be found in the documentation by running
help(bipartiteRL)
.
To perform PPRB updates, pass an existing result object and a new
data frame to the function PPRBupdate
, along with
parameters that define how the sampling should proceed.
<- PPRBupdate(res.twofile, geco_small[[3]], # Comparison details are stored with previous result
res.pprb3 nIter = 600, burn = 100,
seed = 0,
refresh = 0.05)
#> 30/600 [5%] (burn)
#> 60/600 [10%] (burn)
#> 90/600 [15%] (burn)
#> 120/600 [20%]
#> 150/600 [25%]
#> 180/600 [30%]
#> 210/600 [35%]
#> 240/600 [40%]
#> 270/600 [45%]
#> 300/600 [50%]
#> 330/600 [55%]
#> 360/600 [60%]
#> 390/600 [65%]
#> 420/600 [70%]
#> 450/600 [75%]
#> 480/600 [80%]
#> 510/600 [85%]
#> 540/600 [90%]
#> 570/600 [95%]
#> 600/600 [100%]
<- PPRBupdate(res.pprb3, geco_small[[4]], # Comparison details are stored with previous result
res.pprb4 nIter = 600, burn = 100,
seed = 0,
refresh = 0.05)
#> 30/600 [5%] (burn)
#> 60/600 [10%] (burn)
#> 90/600 [15%] (burn)
#> 120/600 [20%]
#> 150/600 [25%]
#> 180/600 [30%]
#> 210/600 [35%]
#> 240/600 [40%]
#> 270/600 [45%]
#> 300/600 [50%]
#> 330/600 [55%]
#> 360/600 [60%]
#> 390/600 [65%]
#> 420/600 [70%]
#> 450/600 [75%]
#> 480/600 [80%]
#> 510/600 [85%]
#> 540/600 [90%]
#> 570/600 [95%]
#> 600/600 [100%]
Further information on function parameters can be found in the
documentation by running help(PRPBupdate)
.
To perform SMCMC updates, pass an existing result object and a new
data frame to the function SMCMCupdate
, along with
parameters that define how the sampling should proceed.
First, SMCMC can work with a smaller number of iterations than PPRB.
To filter an existing result object by thinning every \(n^{th}\) posterior sample, run the
thinsamples
function. SMCMC produces independent
samples from the posterior, so filter to the number of independent
samples that would be desired from the posterior distribution for
estimating your parameters or quantities of interest.
<- thinsamples(res.twofile, 50) # Don't need 500, 50 for demonstration filtered
The filtered sample pool can be passed to a streaming update in the same way as any result object.
<- SMCMCupdate(filtered, geco_small[[3]],
res.smcmc3 nIter.jumping=2, nIter.transition = 8,
proposals.jumping="component", proposals.transition="component", #Either can be LB, but increase corresponding number of iterations
cores = 2) # Parallel execution
#> Random seed not set in parallel execution.
<- SMCMCupdate(res.smcmc3, geco_small[[4]],
res.smcmc4 nIter.jumping=2, nIter.transition = 8,
proposals.jumping="component", proposals.transition="component", #Either can be LB, but increase corresponding number of iterations
cores = 2) # Parallel execution
#> Random seed not set in parallel execution.
SMCMC uses two MCMC kernels, a jumping kernel and a transition kernel, which are performed on each member of the ensemble in parallel. The jumping kernel is used to initialize the links to the latest file, and the transition kernel is used to simultaneously update all parameters.
This results in two differences to the function’s parameters. Instead
of a single nIter
parameter, SMCMCupdate
has
two: nIter.jumping
and nIter.transition
. The
function also has two parameters, proposals.jumping
and
proposals.transition
, which define the update method for
the link parameters. nIter.jumping
can be relatively
smaller than nIter.transition
, since the transition kernel
will continue to update links to the latest file. Both values must be
larger if either proposal is set to “LB” - locally balanced proposals
have slower mixing than component-wise.
Results objects are made up of a list containing posterior samples of each parameter.
names(res.pprb4)
#> [1] "Z" "m" "u" "files" "comparisons"
#> [6] "priors" "cmpdetails" "m.fc.pars" "u.fc.pars" "diagnostics"
The values of Z
, m
, and u
are
link parameters. They are stored as matrices where each column is a
posterior sample and each row is a component of the vector-valued
parameters. The value of diagnostics
is used internally by
some functions, and all other values store details necessary to perform
streaming updates when further files arrive.
# All have 500 columns and different numbers of rows.
dim(res.pprb4$Z)
#> [1] 30 500
dim(res.pprb4$m)
#> [1] 24 500
dim(res.pprb4$u)
#> [1] 24 500
The Z
parameter contains posterior samples of links
between records, so is the main parameter of interest in record linkage
applications.
# The first post-burn MCMC sample of Z
<- res.pprb4$Z[,1]
Zexample
Zexample#> [1] 3 12 13 14 15 16 7 9 10 20 1 22 13 24 14 26 6 8 29 20 31 32 33 34 5
#> [26] 36 18 38 39 30
The parameter \(Z\) contains one value per record starting in file 2. The value at an index gives the record to which the corresponding record is linked. For example,
8]
Zexample[#> [1] 9
indicates that the \(8^{th}\) record in file 2 is linked to the \(9^{th}\) record in file 1. These links can also lead to further links, such as
27]
Zexample[#> [1] 18
8]
Zexample[#> [1] 9
collectively defining a cluster of the \(7^{th}\) record in file 4, the \(8^{th}\) record in file 2 and the \(9^{th}\) record in file 1.
This is unwieldy for an increasing number of files or for larger files, so functions to process these links are provided.
# Create a list of length 500, where each element is one streaming link object
# for each posterior sample.
<- extractlinks(res.pprb4)
samples
# Are record 9 in file 1 and record 7 in file 4 linked in the first posterior sample?
islinked(samples[[1]], file1=1, record1=9, file2=4, record2=7)
#> [1] TRUE
# In what proportion of posterior samples are record 9 in file 1 and record 7 in file 4 linked?
mean(sapply(samples, islinked, file1=1, record1=9, file2=4, record2=7))
#> [1] 1
# In what proportion of posterior samples are record 8 in file 1 and record 1 in file 2 linked?
mean(sapply(samples, islinked, file1=1, record1=8, file2=2, record2=1))
#> [1] 0
Locally balanced proposals are an alternate way for link vectors to be sampled by an MCMC. While component-wise proposals draw values from the full conditional distribution of each component of a link vector - essentially updating the link of each record in each file sequentially - locally balanced proposals perform an add, delete, or swap operation based on the target posterior probability. Locally balanced proposals sample by more intelligently moving through the space of possible links between records at the cost of slower mixing (because fewer links can be updated in any one iteration). Locally balanced proposals have another advantage in that they can be blocked. Blocking allows for a locally balanced proposal to only consider a small subset of records for its proposal, reducing the time required at the cost of even slower mixing.
In streaming update functions such as SMCMCupdate
, the
proposals.jumping
and proposals.transition
parameters can instruct the sampler to use locally balanced proposals
and the blocksize
parameter can be used to enable blocking.
Similar options are also available in the multifileRL
function.
Streaming updates can be mixed and matched. For example, file 3 can be incorporated using a PPRB update and file 4 can be incorporated using an SMCMC update.
The number of MCMC iterations does not need to be the same in subsequent streaming updates. This is especially useful when alternating between SMCMC updates (which can function well with small ensembles) and PPRB updates (which can quickly produce large numbers of samples).
As a filtering method, PPRB reduces the number of distinct values within its sample pool after repeated use. Eventually, this can lead to inaccurate estimates of quantities of interest as the pool has degraded too much. To couteract this, SMCMC updates can be occasionally used to refresh the diversity of available samples.
This package also provides the option to perform non-streaming,
multi-file record linkage using a Gibbs sampler. This can be done using
the multifileRL
function. This is significantly slower than
streaming record linkage and is only provided for educational
purposes.
# Link three files from scratch, returning 500 posterior samples
<- multifileRL(geco_small[1:3],
res.threefile flds = fieldnames, types = types, breaks = breaks,
nIter = 600, burn=100, # Number of iterations to run
proposals = "comp", # Change to "LB" for faster iterations, slower convergence
seed = 0,
refresh = 0.05) # Print progress every 5% of run.
#> 30/600 [5%] (burn)
#> 60/600 [10%] (burn)
#> 90/600 [15%] (burn)
#> 120/600 [20%]
#> 150/600 [25%]
#> 180/600 [30%]
#> 210/600 [35%]
#> 240/600 [40%]
#> 270/600 [45%]
#> 300/600 [50%]
#> 330/600 [55%]
#> 360/600 [60%]
#> 390/600 [65%]
#> 420/600 [70%]
#> 450/600 [75%]
#> 480/600 [80%]
#> 510/600 [85%]
#> 540/600 [90%]
#> 570/600 [95%]
#> 600/600 [100%]