The assertable package contains functions that allow users to easily: * Confirm the number of rows and column names of a dataset * Check the values of given variables (is not NA/infinite, or is less than, equal to, greater than, or contains a given value or set of values) * Check whether the dataset contains all combinations of specified ID variables, and whether it has duplicates within those combinations
This vignette will illustrate how to carry out each of these operations, and to see different ways that assertable can return informative output for informed vetting of tabular data.
We will use the CO2 dataset, which has 64 rows and 5 columns of data from an experiment related to the cold tolerances of plants.
## Plant Type Treatment conc uptake
## 1 Qn1 Quebec nonchilled 95 16.0
## 2 Qn1 Quebec nonchilled 175 30.4
## 3 Qn1 Quebec nonchilled 250 34.8
## 4 Qn1 Quebec nonchilled 350 37.2
## 5 Qn1 Quebec nonchilled 500 35.3
## 6 Qn1 Quebec nonchilled 675 39.2
assert_nrows makes sure your dataset is a certain number of rows.
[1] “All rows present”
## Error in assert_nrows(CO2, 80): Have 84 rows, expecting 80
assert_colnames ensures that all column names specified as colnames exist in the dataset, and also that all columns in the dataset exist in the colnames argument.
[1] “All column names present”
## Error in assert_colnames(CO2, c("Plant", "Type", "Treatment", "conc", : These columns exist in colnames but not in your dataframe: other_uptake and these exist in your dataframe but not in colnames: uptake
If you only want to assert a subset of colnames and allow your dataset to have additional columns besides those specified in colnames, you can use the only_colnames=F option.
[1] “All column names present”
Full list of things it can check for: * not_na: All values must not be NA * not_inf: All values must not be infinite * lt: All values must be less than test_valQ * lte: All values must be less than or equal to test_val * gt: All values must be greater than test_val * gte: All values must be greater than or equal to test_val * equal: All values must be equal to test_val * not_equal: All values must not equal test_val * in: All values must be one of the values in test_val
Here, we can check to see whether any columns of a new dataset, CO2_miss, contain na values.
CO2_miss <- CO2
CO2_miss[CO2_miss$Plant == "Qn2" & CO2_miss$conc == 175, "uptake"] <- NA
assert_values(CO2_miss, colnames=c("conc","uptake"), test="not_na")
## [1] "Variable conc passed not_na test"
## Plant Type Treatment conc uptake
## 9 Qn2 Quebec nonchilled 175 NA
## Error in assert_values(CO2_miss, colnames = c("conc", "uptake"), test = "not_na"): 1 Rows for variable uptake are NA in the dataset above
If we run assert_values on the original data, we can check that the dataset is correct.
## [1] "Variable conc passed not_na test"
## [1] "Variable uptake passed not_na test"
Similar functionality exists for checking for infinite values as well, using the not_inf test option.
CO2_inf <- CO2
CO2_inf[CO2_inf$Plant == "Qn2" & CO2_inf$conc == 175, "uptake"] <- Inf
assert_values(CO2_inf, colnames=c("conc","uptake"), test="not_inf")
## [1] "Variable conc passed not_inf test"
## Plant Type Treatment conc uptake
## 9 Qn2 Quebec nonchilled 175 Inf
## Error in assert_values(CO2_inf, colnames = c("conc", "uptake"), test = "not_inf"): 1 Rows for variable uptake are infinite in the dataset above
Here, we can see different results for checking values of CO2 against single numeric thresholds.
## [1] "Variable uptake passed gt test"
## [1] "Variable conc passed lte test"
## Plant Type Treatment conc uptake
## 11 Qn2 Quebec nonchilled 350 41.8
## 12 Qn2 Quebec nonchilled 500 40.6
## 13 Qn2 Quebec nonchilled 675 41.4
## 14 Qn2 Quebec nonchilled 1000 44.3
## 17 Qn3 Quebec nonchilled 250 40.3
## 18 Qn3 Quebec nonchilled 350 42.1
## 19 Qn3 Quebec nonchilled 500 42.9
## 20 Qn3 Quebec nonchilled 675 43.9
## 21 Qn3 Quebec nonchilled 1000 45.5
## 35 Qc2 Quebec chilled 1000 42.4
## 42 Qc3 Quebec chilled 1000 41.4
## Error in assert_values(CO2, colnames = "uptake", test = "lt", 40): 11 Rows for variable uptake not less than the test value(s) in the dataset above
Using the “in” option for test, we can assert that the values of the given colnames must contain the values in test_val, which can be a vector of any size.
## [1] "Variable Treatment passed in test"
We can also test equivalency, to see whether contents are equal or not equal to a given value.
## [1] "Variable Type passed not_equal test"
## [1] "Variable Type passed equal test"
assert_values can also compare your columns against vectors of the same length as the number of rows in your dataset. For example, here we compare the uptake variable against a newly-created new_uptake variable, which is equal to uptake * 2.
CO2_mult <- CO2
CO2_mult$new_uptake <- CO2_mult$uptake * 2
assert_values(CO2, colnames="uptake", test="lt", CO2_mult$new_uptake)
## [1] "Variable uptake passed lt test"
## [1] "Variable uptake passed equal test"
Above, assert_values correctly notes that the uptake = new_uptake / 2. Below, the “gt” assertion fails for a similar reason, while “gte” would have succeeded. Here, we use the display_rows = F option to simply display the row numbers rather than the actual rows that failed this assertion (in this case, it happens to be all the rows).
CO2_mult <- CO2
assert_values(CO2, colnames="uptake", test="gt", CO2_mult$new_uptake/2, display_rows=F)
## Error in assert_values(CO2, colnames = "uptake", test = "gt", CO2_mult$new_uptake/2, : Must specify test_val argument for comparison tests
You can combine assert_values calls to test columns against one another based on arbitrary lower/upper bounds; for example, the code below asserts that all values in the uptake column must be less than the value of conc, and that conc must not be more than 50 times the value of uptake.
## [1] "Variable uptake passed lt test"
## Plant Type Treatment conc uptake
## 77 Mc2 Mississippi chilled 1000 14.4
## 84 Mc3 Mississippi chilled 1000 19.9
## Error in assert_values(CO2, colnames = "uptake", test = "gt", CO2_mult$conc * : 2 Rows for variable uptake not more than the test value(s) in the dataset above
The na.rm option in assert_values is useful for numeric comparisons – if you try to evaluate a number against a NA value, the output will be returned as NA as well and fail your assertion. By specifying na.rm=T, all NA values are not considered as violating the assertion in assert_values.
## [1] "Variable conc passed lt test"
## Plant Type Treatment conc uptake
## 9 Qn2 Quebec nonchilled 175 NA
## Error in assert_values(CO2_miss, colnames = c("conc", "uptake"), test = "lt", : 1 Rows for variable uptake not less than the test value(s) in the dataset above
With na.rm=T, we can evaluate without marking the NA value for Qn2 as a failure.
## [1] "Variable conc passed lt test"
## [1] "Variable uptake passed lt test"
assert_ids allows you to check whether your dataset is “square”, meaning that it contains all unique combinations of ID variables as sepcified in a named list of vectors (e.g. list(id1=c(1,2), id2=c(“A”,B))).
The ultimate aim is to make sure that you have one row per unique combination of ID variables, and return violations of this rule for easy vetting. Here, we first try to figure out what combinations of variables uniquely identify the data, whether they are missing any combinations of ID variables, and whether there are any duplicates in the data by ID variables. First, we get the levels of some potential ID variables.
plants <- as.character(unique(CO2$Plant))
treatments <- unique(CO2$Treatment)
concs <- unique(CO2$conc)
Let’s see if Plant alone is a unique identifier.
## Plant n_duplicates
## 1: Qn1 7
## 2: Qn2 7
## 3: Qn3 7
## 4: Qc1 7
## 5: Qc2 7
## 6: Qc3 7
## 7: Mn1 7
## 8: Mn2 7
## 9: Mn3 7
## 10: Mc1 7
## 11: Mc2 7
## 12: Mc3 7
## Error in assert_ids(CO2, ids): These combinations of id variables have n_duplicates duplicate observations per combination (84 total duplicates)
There are 7 duplicates for each plant type because each plant has 7 different values of conc. Now, let’s try adding conc to the ID list.
## [1] "Data is identified by id_vars: Plant conc"
Our dataset is uniquely identified by Plant and conc!
Now, let’s see how assert_id returns results when the dataset has duplicate values.
ids <- list(Plant=plants,conc=concs)
CO2_dups <- rbind(CO2,CO2[CO2$Plant=="Mc2" & CO2$conc < 300,])
assert_ids(CO2_dups, ids)
## Plant conc n_duplicates
## 1: Mc2 95 2
## 2: Mc2 175 2
## 3: Mc2 250 2
## Error in assert_ids(CO2_dups, ids): These combinations of id variables have n_duplicates duplicate observations per combination (6 total duplicates)
Here, we get the unique conbinations of Plant and conc that had duplicate values. If we want a more detailed look at the duplicates, we can specify ids_only = F to return each observation in the original dataset that is a duplicate. This dataset will include the variables n_duplicates (the total number within the combination) and duplicate_id (the observation’s unique ID within the combination).
## Plant conc Type Treatment uptake n_duplicates duplicate_id
## 1: Mc2 95 Mississippi chilled 7.7 2 1
## 2: Mc2 95 Mississippi chilled 7.7 2 2
## 3: Mc2 175 Mississippi chilled 11.4 2 1
## 4: Mc2 175 Mississippi chilled 11.4 2 2
## 5: Mc2 250 Mississippi chilled 12.3 2 1
## 6: Mc2 250 Mississippi chilled 12.3 2 2
## Error in assert_ids(CO2_dups, ids, ids_only = F): These rows of data are all of the observations with duplicated id_vars, and have n_duplicates duplicate observations per combination of id_varnames (6 total duplicates)
This dataset can also be stored into an object by specifying the warn_only = T option, which can then be saved or used for further exploration.
## Warning in assert_ids(CO2_dups, ids, ids_only = F, warn_only = T): These
## rows of data are all of the observations with duplicated id_vars, and have
## n_duplicates duplicate observations per combination of id_varnames (6 total
## duplicates)
One behavior of assert_ids is that it stops at the first violation that it reaches. In the example below, the CO2_dups dataset does not contain a certain set of ID combinations and it also has duplicate rows. Since assert_ids first evaluates whether all ID combinations are present, it errors out on the ID combinations part but does not reach the step where it evaluates duplicates.
## Add a new fake level to plants, use as.character because the "new_plant" level
## doesn't mix well with the factor level
new_plants <- c(as.character(plants),"new_plant")
ids <- list(Plant=new_plants,conc=concs)
dup_rows <- assert_ids(CO2_dups, ids)
## Plant conc
## 1: new_plant 95
## 2: new_plant 175
## 3: new_plant 250
## 4: new_plant 350
## 5: new_plant 500
## 6: new_plant 675
## 7: new_plant 1000
## Error in assert_ids(CO2_dups, ids): The above combinations of id variables do not exist in your dataset
To evaluate both the existing-combinations and no-duplicate conditions using assert_ids, you can call it twice, with warn_only = T and with alternating toggles on the assert_* options. By capturing the output into objects, you can then output those results separately and then stop execution of your script if neither object is NULL.
new_plants <- c(as.character(plants),"new_plant")
ids <- list(Plant=new_plants,conc=concs)
combos <- assert_ids(CO2_dups, ids, assert_dups = F, warn_only=T)
## Warning in assert_ids(CO2_dups, ids, assert_dups = F, warn_only = T): The
## following combinations of id variables do not exist in your dataset
## Warning in assert_ids(CO2_dups, ids, assert_combos = F, ids_only = F,
## warn_only = T): These rows of data are all of the observations with
## duplicated id_vars, and have n_duplicates duplicate observations per
## combination of id_varnames (6 total duplicates)
## Plant conc
## 1: new_plant 95
## 2: new_plant 175
## 3: new_plant 250
## 4: new_plant 350
## 5: new_plant 500
## 6: new_plant 675
## 7: new_plant 1000
## Plant conc Type Treatment uptake n_duplicates duplicate_id
## 1: Mc2 95 Mississippi chilled 7.7 2 1
## 2: Mc2 95 Mississippi chilled 7.7 2 2
## 3: Mc2 175 Mississippi chilled 11.4 2 1
## 4: Mc2 175 Mississippi chilled 11.4 2 2
## 5: Mc2 250 Mississippi chilled 12.3 2 1
## 6: Mc2 250 Mississippi chilled 12.3 2 2
## Error in eval(expr, envir, enclos): assert_ids failed, see above for results