Rule Packs

This vignette describes and explains logic behind common ways of creating rule packs.

Overview

Rule is a function which converts data unit of interest (data, group, column, row, cell) to logical value indicating whether this object satisfies certain condition.

Rule pack is a function which combines several rules for common data unit into one functional block. The recommended way of creating rules is by creating packs right away with the use of dplyr and magrittr’s pipe operator.

Some of ruler’s functionality is powered by the keyholder package. It is highly recommended to use its supported functions during rule pack construction. All one- and two-table dplyr verbs applied to local data frames are supported and considered the most appropriate way to create rule packs.

As described in vignette about design process it is necessary for rule pack to have type because outputs for different data units have different structure. For this reason ruler has family of *_packs() constructors (where * stands for the name of data unit):

They take functions defining packs (in pure form or inside list at any depth) as arguments. It is recommended to name those arguments with future pack names. If no name is supplied then it will be imputed during exposure.
They return list of what should be rule packs of certain type.

Data rule packs

To check whether dimensions of mtcars obey some rules one can write the next dplyr pipeline:

mtcars %>% summarise(
  nrow_low = nrow(.) > 10,
  nrow_high = nrow(.) < 30,
  ncol = ncol(.) == 12
)
#>   nrow_low nrow_high  ncol
#> 1     TRUE     FALSE FALSE

The output has the following structure:

Number of rows equals to one.
Column names define rule names.
Values indicate whether the data as a whole follows the rule.

There is an easy way to transform this pipeline into a function to be used for any data: mtcars should be replaced with . character. To indicate that this function is a rule pack for data unit ‘data’ it should be wrapped with data_packs().

The next code creates a list my_data_packs with one data rule pack named my_data_pack_1. That rule pack defines rules with names nrow_low, nrow_high, ncol.

my_data_packs <- data_packs(
  my_data_pack_1 = . %>% summarise(
    nrow_low = nrow(.) > 10,
    nrow_high = nrow(.) < 30,
    ncol = ncol(.) == 12
  )
)

Group rule packs

To check whether certain groups of rows of mtcars obey some rules one can write the next dplyr pipeline:

mtcars %>% group_by(vs, am) %>%
  summarise(any_cyl_6 = any(cyl == 6))
#> `summarise()` has grouped output by 'vs'. You can override using the `.groups`
#> argument.
#> # A tibble: 4 × 3
#> # Groups:   vs [2]
#>      vs    am any_cyl_6
#>   <dbl> <dbl> <lgl>    
#> 1     0     0 FALSE    
#> 2     0     1 TRUE     
#> 3     1     0 TRUE     
#> 4     1     1 FALSE

The output has the following structure:

Some columns define group levels (vs and am in this case).
Number of rows equals to number of validated groups.
Names of non-grouping columns define rule names.
Values indicate whether the group as a whole follows the rule.

The next code creates a list with one nameless group rule pack (the name will be imputed during exposure). This pack contains one rule any_cyl_6 which checks every group defined by vs and am columns.

my_group_packs <- group_packs(
  . %>% group_by(vs, am) %>%
    summarise(any_cyl_6 = any(cyl == 6)),
  .group_vars = c("vs", "am")
)

Notes:

In this example pack’s output is a grouped tibble. It doesn’t affect anything because during exposure this output is ungrouped.
Names of grouping columns should be supplied with .group_vars argument to distinguish them from non-grouping ones.
Value for var column in validation report is created by uniting them with the default separator .. In this case values will be 0.0, 0.1, 1.0, 1.1. To change separator supply it with .group_sep argument.

Column rule packs

To check whether certain columns of mtcars obey some rules one can write the next dplyr pipeline:

is_integerish <- function(x) {
  all(x == as.integer(x))
}

mtcars %>%
  summarise_if(is_integerish, list(mean_low = ~ mean(.) > 0.5))
#>   cyl_mean_low hp_mean_low vs_mean_low am_mean_low gear_mean_low carb_mean_low
#> 1         TRUE        TRUE       FALSE       FALSE          TRUE          TRUE

The output has the following structure:

Number of rows equals to one.
Column names are constructed as ‘validated column name’ + ‘separator’ + ‘rule name’.
Values indicate whether the column as a whole follows the rule.

In general it is hard to automatically separate output’s column names into ‘validated column name’ and ‘rule name’ because default separator _ is a commonly used one. For this reason ruler has function rules() with the following functionality:

Impute not supplied rule names. The format is ’rule__{ind}’ where {ind} is the index of function position within rules()’s arguments.
Add rarely used prefix ._. (Morse code for ‘R’) to rule names. Note that one can change this prefix with .prefix argument.

The next code creates a list with two elements:

A column rule pack my_col_pack_1 which checks obedience of ‘integerish’ columns to rule mean_low.
A nameless column rule pack which checks obedience of column vs to some (will be imputed as rule__1) rule. Note the use of named argument in vars(vs = "vs"). This is the current way in dplyr’s scoped variants of summarise and mutate to force using both column and function names in output’s column name.

my_col_packs <- col_packs(
  my_col_pack_1 = . %>% summarise_if(
    is_integerish,
    rules(mean_low = mean(.) > 0.5)
  ),
  . %>% summarise_at(vars(vs = "vs"), rules(sum(.) > 300))
)

Row rule pack

To check whether certain rows of mtcars are not outliers one can write the next dplyr pipeline:

z_score <- function(x) {
  (x - mean(x)) / sd(x)
}

mtcars %>%
  mutate(rowMean = rowMeans(.)) %>%
  transmute(is_common_row_mean = abs(z_score(rowMean)) < 1) %>%
  slice(10:15)
#>                    is_common_row_mean
#> Merc 280                         TRUE
#> Merc 280C                        TRUE
#> Merc 450SE                       TRUE
#> Merc 450SL                       TRUE
#> Merc 450SLC                      TRUE
#> Cadillac Fleetwood              FALSE

The output has the following structure:

Number of rows equals to number of checked rows.
Column names define rule names.
Values indicate whether the row as a whole follows the rule.

Pipeline like the one above is quite common: for every row compute some value based on all rows and then validate only some of them. However in the validation report column id should represent the row index in the original data frame and this information is missing after applying slice().

This problem is solved by using keyholder package. Its main purpose is to track information about rows while modifying data frame. During exposure pack is applied to the keyed version of input data with key equals to row index. Note that to use this feature one should create rule packs using composition of functions supported by keyholder.

The next code creates a list with one row pack my_row_pack_1. It contains one rule is_common_row_mean that checks 6 rows (from 10 to 15) for not being an outlier (based on information from all rows) in terms of row means.

my_row_packs <- row_packs(
  my_row_pack_1 = . %>% mutate(rowMean = rowMeans(.)) %>%
    transmute(is_common_row_mean = abs(z_score(rowMean)) < 1) %>%
    slice(10:15)
)

Cell rule pack

To check whether certain cells of mtcars are not outliers one can write the next dplyr pipeline:

mtcars %>% transmute_if(
  is_integerish,
  list(is_common = ~ abs(z_score(.)) < 1)
) %>%
  slice(20:24)
#>                  cyl_is_common hp_is_common vs_is_common am_is_common
#> Toyota Corolla           FALSE        FALSE        FALSE        FALSE
#> Toyota Corona            FALSE         TRUE        FALSE         TRUE
#> Dodge Challenger         FALSE         TRUE         TRUE         TRUE
#> AMC Javelin              FALSE         TRUE         TRUE         TRUE
#> Camaro Z28               FALSE        FALSE         TRUE         TRUE
#>                  gear_is_common carb_is_common
#> Toyota Corolla             TRUE          FALSE
#> Toyota Corona              TRUE          FALSE
#> Dodge Challenger           TRUE           TRUE
#> AMC Javelin                TRUE           TRUE
#> Camaro Z28                 TRUE           TRUE

The output has the following structure:

Number of rows equals to number of rows for checked cells.
Column names are constructed as ‘validated column name’ + ‘separator’ + ‘rule name’.
Values indicate whether the cell follows the rule.

Basically cell rule pack is a combination of column and row rule packs. It means:

Using rules() instead of pure list in scoped variants of transmute().
Using functions supported by keyholder.

The next code creates a list with one cell pack my_cell_pack_1. It checks cells of every integer-like column in rows 20-24 for not being an outlier within column.

my_cell_packs <- cell_packs(
  my_cell_pack_1 = . %>% transmute_if(
    is_integerish,
    rules(is_common = abs(z_score(.)) < 1)
  ) %>%
    slice(20:24)
)

Rule Packs

Evgeni Chasnovski

2023-03-28

Overview

Data rule packs

Group rule packs

Column rule packs

Row rule pack

Cell rule pack