The datacheckr::check_data()
function takes two arguments: the data frame to check and a named list specifying the various conditions.
The names of the list elements specify the columns that need to appear in the data frame while the classes of the vectors specify the classes of the columns.
Thus, to specify that x must contain a column called col1
of class integer the call would be as follows.
library(datacheckr)
check_data2(mtcars, list(col1 = integer()))
## Error: mtcars must have column 'col1'
To specify that x can not contain a column called mpg
the call is just
check_data2(mtcars, list(mpg = NULL))
## Error: values cannot include NULLs
and to specify that it can contain a column col1
that can be integer or numeric values the call would be
check_data1(mtcars, list(
col1 = integer(),
col1 = NULL,
col1 = numeric()))
If a column is not named in the list then no checks are performed on it.
To specify that a column cannot include missing values pass a single non-missing value.
check_data2(mtcars, list(mpg = 3))
check_data2(mtcars, list(mpg = -1))
To specify that it can include missing values include an NA in the vector
check_data2(mtcars, list(mpg = c(NA, 9)))
and to specify that it can only include missing values use
check_data2(mtcars, list(mpg = as.numeric(NA)))
## Error: column mpg in mtcars can only include missing values
To indicate that the non-missing values must fall within a range use two non-missing values (the following code tests for counts).
data1 <- data.frame(
Count = c(0L, 3L, 3L, 0L),
LocationX = c(2000, NA, 2001, NA),
Extra = TRUE)
check_data2(data1, list(Count = c(0L, .Machine$integer.max)))
As .Machine$integer.max
is difficult to remember the max_integer()
wrapper function is provided so that the above code can be written as.
check_data2(data1, list(Count = c(0L, max_integer())))
If particular values are required then specify them as a vector of three or more non-missing values
check_data2(data1, list(Count = c(0L, 1L, 3L)))
check_data2(data1, list(Count = c(1L, 2L, 2L)))
## Error: column Count in data1 must only include the permitted values 1 and 2
The order of the values in an element is unimportant.
Numeric, Date and POSIXct vectors have exactly the same behaviour regarding ranges and specific values as illustrated above using integers.
With logical values two non-missing values produce the same behaviour as three or more non-missing values. For example to test for only TRUE
values use
check_data2(data1, list(Extra = c(TRUE, TRUE)))
To specify that col1
must be a character vector use
check_data2(x, list(col1 = "b"))
while the following requires that the values match both character elements which are treated as regular expressions
check_data2(x, list(col1 = c("^//d", ".*")))
with three or more non-missing character elements each value in col1
must match at least one of the elements which are treated as regular expressions. Regular expressions are matched using grepl
with perl=TRUE
.
To indicate that supp
should be a factor use either of the following
check_data2(ToothGrowth, list(supp = factor()))
check_data2(ToothGrowth, list(supp = factor("blahblah")))
To specify that supp
should be a factor that includes the factor levels OJ
and VC
(in any order) just pass two non-missing values
check_data2(ToothGrowth, list(supp = factor(c("VC", "OJ"))))
And to specify the actual factor levels that supp
must have pass three or more non-missing values
check_data2(ToothGrowth, list(supp = factor(c("VC", "OJ", "OJ"))))