This vignette describes and explains logic behind common ways of creating rule packs.
Rule is a function which converts data unit of interest (data, group, column, row, cell) to logical value indicating whether this object satisfies certain condition.
Rule pack is a function which combines several rules
for common data unit into one functional block. The recommended way of
creating rules is by creating packs right away with the use of
dplyr
and magrittr’s pipe operator.
Some of ruler
’s functionality is powered by the keyholder package.
It is highly recommended to use its supported functions during rule pack
construction. All one- and two-table dplyr
verbs applied to
local data frames are supported and considered the most appropriate way
to create rule packs.
As described in vignette about design process it is necessary for
rule pack to have type because outputs for different
data units have different structure. For this reason ruler
has family of *_packs()
constructors (where *
stands for the name of data unit):
To check whether dimensions of mtcars
obey some rules
one can write the next dplyr pipeline:
mtcars %>% summarise(
nrow_low = nrow(.) > 10,
nrow_high = nrow(.) < 30,
ncol = ncol(.) == 12
)
#> nrow_low nrow_high ncol
#> 1 TRUE FALSE FALSE
The output has the following structure:
There is an easy way to transform this pipeline into a function to be
used for any data: mtcars
should be replaced with
.
character. To indicate that this function is a rule pack
for data unit ‘data’ it should be wrapped with
data_packs()
.
The next code creates a list my_data_packs
with one data
rule pack named my_data_pack_1
. That rule pack defines
rules with names nrow_low
, nrow_high
,
ncol
.
To check whether certain groups of rows of mtcars
obey
some rules one can write the next dplyr pipeline:
mtcars %>% group_by(vs, am) %>%
summarise(any_cyl_6 = any(cyl == 6))
#> `summarise()` has grouped output by 'vs'. You can override using the `.groups`
#> argument.
#> # A tibble: 4 × 3
#> # Groups: vs [2]
#> vs am any_cyl_6
#> <dbl> <dbl> <lgl>
#> 1 0 0 FALSE
#> 2 0 1 TRUE
#> 3 1 0 TRUE
#> 4 1 1 FALSE
The output has the following structure:
vs
and
am
in this case).The next code creates a list with one nameless group rule pack (the
name will be imputed during exposure). This pack contains one rule
any_cyl_6
which checks every group defined by
vs
and am
columns.
my_group_packs <- group_packs(
. %>% group_by(vs, am) %>%
summarise(any_cyl_6 = any(cyl == 6)),
.group_vars = c("vs", "am")
)
Notes:
ungroup
ed..group_vars
argument to distinguish them from non-grouping
ones.var
column in validation report is created by
uniting
them with the default separator .
. In this case values will
be 0.0
, 0.1
, 1.0
,
1.1
. To change separator supply it with
.group_sep
argument.To check whether certain columns of mtcars
obey some
rules one can write the next dplyr pipeline:
is_integerish <- function(x) {
all(x == as.integer(x))
}
mtcars %>%
summarise_if(is_integerish, list(mean_low = ~ mean(.) > 0.5))
#> cyl_mean_low hp_mean_low vs_mean_low am_mean_low gear_mean_low carb_mean_low
#> 1 TRUE TRUE FALSE FALSE TRUE TRUE
The output has the following structure:
In general it is hard to automatically separate output’s column names
into ‘validated column name’ and ‘rule name’ because default separator
_
is a commonly used one. For this reason
ruler
has function rules()
with the following
functionality:
rules()
’s
arguments.._.
(Morse code for ‘R’) to rule
names. Note that one can change this prefix with
.prefix
argument.The next code creates a list with two elements:
my_col_pack_1
which checks obedience
of ‘integerish’ columns to rule mean_low
.vs
to some (will be imputed as rule__1
) rule.
Note the use of named argument in
vars(vs = "vs")
. This is the current way in
dplyr
’s scoped variants of summarise
and
mutate
to force using both column and function names in
output’s column name.To check whether certain rows of mtcars
are not outliers
one can write the next dplyr pipeline:
z_score <- function(x) {
(x - mean(x)) / sd(x)
}
mtcars %>%
mutate(rowMean = rowMeans(.)) %>%
transmute(is_common_row_mean = abs(z_score(rowMean)) < 1) %>%
slice(10:15)
#> is_common_row_mean
#> Merc 280 TRUE
#> Merc 280C TRUE
#> Merc 450SE TRUE
#> Merc 450SL TRUE
#> Merc 450SLC TRUE
#> Cadillac Fleetwood FALSE
The output has the following structure:
Pipeline like the one above is quite common: for every row compute
some value based on all rows and then validate only some of them.
However in the validation report column id
should represent
the row index in the original data frame and this information
is missing after applying slice()
.
This problem is solved by using keyholder package.
Its main purpose is to track information about rows while modifying data
frame. During exposure pack is applied to the keyed
version of input data with key equals to row index.
Note that to use this feature one should create rule
packs using composition of functions supported by
keyholder
.
The next code creates a list with one row pack
my_row_pack_1
. It contains one rule
is_common_row_mean
that checks 6 rows (from 10 to 15) for
not being an outlier (based on information from all rows) in terms of
row means.
To check whether certain cells of mtcars
are not
outliers one can write the next dplyr pipeline:
mtcars %>% transmute_if(
is_integerish,
list(is_common = ~ abs(z_score(.)) < 1)
) %>%
slice(20:24)
#> cyl_is_common hp_is_common vs_is_common am_is_common
#> Toyota Corolla FALSE FALSE FALSE FALSE
#> Toyota Corona FALSE TRUE FALSE TRUE
#> Dodge Challenger FALSE TRUE TRUE TRUE
#> AMC Javelin FALSE TRUE TRUE TRUE
#> Camaro Z28 FALSE FALSE TRUE TRUE
#> gear_is_common carb_is_common
#> Toyota Corolla TRUE FALSE
#> Toyota Corona TRUE FALSE
#> Dodge Challenger TRUE TRUE
#> AMC Javelin TRUE TRUE
#> Camaro Z28 TRUE TRUE
The output has the following structure:
Basically cell rule pack is a combination of column and row rule packs. It means:
rules()
instead of pure list in scoped variants
of transmute()
.keyholder
.The next code creates a list with one cell pack
my_cell_pack_1
. It checks cells of every integer-like
column in rows 20-24 for not being an outlier within column.