The datacheckr::check_data() function takes two arguments: the data frame to check and a named list specifying the various conditions.

Checking Columns and Classes

The names of the list elements specify the columns that need to appear in the data frame while the classes of the vectors specify the classes of the columns.

Thus, to specify that x must contain a column called col1 of class integer the call would be as follows.

library(datacheckr)
check_data2(mtcars, list(col1 = integer()))
## Error: mtcars must have column 'col1'

To specify that x can not contain a column called mpg the call is just

check_data2(mtcars, list(mpg = NULL))
## Error: values cannot include NULLs

and to specify that it can contain a column col1 that can be integer or numeric values the call would be

check_data1(mtcars, list(
  col1 = integer(), 
  col1 = NULL, 
  col1 = numeric()))

If a column is not named in the list then no checks are performed on it.

Checking Missing Values

To specify that a column cannot include missing values pass a single non-missing value.

check_data2(mtcars, list(mpg = 3))
check_data2(mtcars, list(mpg = -1))

To specify that it can include missing values include an NA in the vector

check_data2(mtcars, list(mpg = c(NA, 9)))

and to specify that it can only include missing values use

check_data2(mtcars, list(mpg = as.numeric(NA)))
## Error: column mpg in mtcars can only include missing values

Checking Ranges

To indicate that the non-missing values must fall within a range use two non-missing values (the following code tests for counts).

data1 <- data.frame(
  Count = c(0L, 3L, 3L, 0L), 
  LocationX = c(2000, NA, 2001, NA), 
  Extra = TRUE)

check_data2(data1, list(Count = c(0L, .Machine$integer.max)))

As .Machine$integer.max is difficult to remember the max_integer() wrapper function is provided so that the above code can be written as.

check_data2(data1, list(Count = c(0L, max_integer())))

Checking Specific Values

If particular values are required then specify them as a vector of three or more non-missing values

check_data2(data1, list(Count = c(0L, 1L, 3L)))
check_data2(data1, list(Count = c(1L, 2L, 2L)))
## Error: column Count in data1 must only include the permitted values 1 and 2

The order of the values in an element is unimportant.

Checking Numeric, Date and POSIXct Vectors

Numeric, Date and POSIXct vectors have exactly the same behaviour regarding ranges and specific values as illustrated above using integers.

Checking Logical Vectors

With logical values two non-missing values produce the same behaviour as three or more non-missing values. For example to test for only TRUE values use

check_data2(data1, list(Extra = c(TRUE, TRUE)))

Checking Character Vectors

To specify that col1 must be a character vector use

check_data2(x, list(col1 = "b"))

while the following requires that the values match both character elements which are treated as regular expressions

check_data2(x, list(col1 = c("^//d", ".*")))

with three or more non-missing character elements each value in col1 must match at least one of the elements which are treated as regular expressions. Regular expressions are matched using grepl with perl=TRUE.

Checking Factors

To indicate that supp should be a factor use either of the following

check_data2(ToothGrowth, list(supp = factor()))
check_data2(ToothGrowth, list(supp = factor("blahblah")))

To specify that supp should be a factor that includes the factor levels OJ and VC (in any order) just pass two non-missing values

check_data2(ToothGrowth, list(supp = factor(c("VC", "OJ"))))

And to specify the actual factor levels that supp must have pass three or more non-missing values

check_data2(ToothGrowth, list(supp = factor(c("VC", "OJ", "OJ"))))