The skim()
function summarizes data types contained within data frames. It comes with a set of default summary functions for a wide variety of data types, but this is not comprehensive. Package authors can add support for skimming their specific data types in their packages, and they can provide different defaults in their own summary functions.
This example will illustrate this by creating support for the sf
object produced by the “sf: Simple Features for R” package. For any object this involves two required elements and one optional element.
get_skimmers
for different objects within this packageIf you are adding skim support to a package you will also need to add skimr
to the list of imports. Note that in this vignette the actual analysis will not be run because that would require importing the sf
package just for this example. However to run it on your own you can install sf
and then run the following code. Note that code in this vignette was not evaluated when rendering the vignette in order to avoid forcing installation of sf.
library(skimr)
library(sf)
## Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0
nc <- st_read(system.file("shape/nc.shp", package = "sf"))
## Reading layer `nc' from data source `/Users/elinwaring/Library/R/3.6/library/sf/shape/nc.shp' using driver `ESRI Shapefile'
## Simple feature collection with 100 features and 14 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## epsg (SRID): 4267
## proj4string: +proj=longlat +datum=NAD27 +no_defs
class(nc)
## [1] "sf" "data.frame"
Unlike the example of having a new type of data in a column of a simple data frame in the “Using skimr” vignette, this is a different type of object with special attributes.
In this object there is also a column of a class that does not have default skimmers. By default, skimr falls back to use the sfl for character variables.
skim(nc$geometry)
## Warning: Couldn't find skimmers for class: sfc_MULTIPOLYGON, sfc; No user-
## defined `sfl` provided. Falling back to `character`.
Name | nc$geometry |
Number of rows | 100 |
Number of columns | 1 |
_______________________ | |
Column type frequency: | |
character | 1 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
geometry | 0 | 1 | 232 | 1965 | 0 | 100 | 0 |
skimr
has an opinionated list of functions for each class (e.g. numeric, factor) of data. The core package supports many commonly used classes, but there are many others. You can investigate these defaults by calling get_default_skimmer_names()
.
What if your data type isn’t covered by defaults? skimr
usually falls back to treating the type as a character, which isn’t necessarily helpful. In this case, you’re best off adding your data type with skim_with()
.
Before we begin, we’ll be using the following custom summary statistic throughout. It’s a naive example, but covers the requirements of what we need.
funny_sf <- function(x) {
length(x) + 1
}
This function, like all summary functions used by skimr
has two notable features.
There are a lot of functions that fulfill these criteria:
skimr
packageNot fulfilling the two criteria can lead to some very confusing behavior within skimr
. Beware! An example of this issue is the base quantile()
function in default skimr
percentiles are returned by using quantile()
five times.
Next, we create a custom skimming function. To do this, we need to think about the many specific classes of data in the sf
package. The following example will build support for sfc_MULTIPOLYGON
, but note that we’ll have to eventually think about sfc_LINESTRING
, sfc_POLYGON
, sfc_MULTIPOINT
and others if we want to fully support sf
.
skim_sf <- skim_with(
sfc_MULTIPOLYGON = sfl(
n_unique = n_unique,
valid = ~ sum(sf::st_is_valid(.)),
funny = funny_sf
)
)
## Creating new skimming functions for the following classes: sfc_MULTIPOLYGON.
## They did not have recognized defaults. Call get_default_skimmers() for more information.
The example above creates a new function, and you can call that function on a specific column with sfc_MULTIPOLYGON
data to get the appropriate summary statistics.
skim_sf(nc$geometry)
Name | nc$geometry |
Number of rows | 100 |
Number of columns | 1 |
_______________________ | |
Column type frequency: | |
sfc_MULTIPOLYGON | 1 |
________________________ | |
Group variables | None |
Variable type: sfc_MULTIPOLYGON
skim_variable | n_missing | complete_rate | n_unique | valid | funny |
---|---|---|---|---|---|
geometry | 0 | 1 | 100 | 100 | 101 |
Creating a function that is a method of the skim_by_type generic for the data type allows skimming of an entire data frame that contains some columns of that type.
skim_by_type.sfc_MULTIPOLYGON <- function(mangled, columns, data) {
skimmed <- dplyr::summarize_at(data, columns, mangled$funs)
build_results(skimmed, columns, NULL)
}
skim_sf(nc)
Name | nc |
Number of rows | 100 |
Number of columns | 15 |
_______________________ | |
Column type frequency: | |
factor | 2 |
numeric | 12 |
sfc_MULTIPOLYGON | 1 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
NAME | 0 | 1 | FALSE | 100 | Ala: 1, Ale: 1, All: 1, Ans: 1 |
FIPS | 0 | 1 | FALSE | 100 | 370: 1, 370: 1, 370: 1, 370: 1 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
AREA | 0 | 1 | 0.13 | 0.05 | 0.04 | 0.09 | 0.12 | 0.15 | 0.24 | ▆▇▆▃▂ |
PERIMETER | 0 | 1 | 1.67 | 0.48 | 1.00 | 1.32 | 1.61 | 1.86 | 3.64 | ▇▇▂▁▁ |
CNTY_ | 0 | 1 | 1985.96 | 106.52 | 1825.00 | 1902.25 | 1982.00 | 2067.25 | 2241.00 | ▇▆▆▅▁ |
CNTY_ID | 0 | 1 | 1985.96 | 106.52 | 1825.00 | 1902.25 | 1982.00 | 2067.25 | 2241.00 | ▇▆▆▅▁ |
FIPSNO | 0 | 1 | 37100.00 | 58.02 | 37001.00 | 37050.50 | 37100.00 | 37149.50 | 37199.00 | ▇▇▇▇▇ |
CRESS_ID | 0 | 1 | 50.50 | 29.01 | 1.00 | 25.75 | 50.50 | 75.25 | 100.00 | ▇▇▇▇▇ |
BIR74 | 0 | 1 | 3299.62 | 3848.17 | 248.00 | 1077.00 | 2180.50 | 3936.00 | 21588.00 | ▇▁▁▁▁ |
SID74 | 0 | 1 | 6.67 | 7.78 | 0.00 | 2.00 | 4.00 | 8.25 | 44.00 | ▇▂▁▁▁ |
NWBIR74 | 0 | 1 | 1050.81 | 1432.91 | 1.00 | 190.00 | 697.50 | 1168.50 | 8027.00 | ▇▁▁▁▁ |
BIR79 | 0 | 1 | 4223.92 | 5179.46 | 319.00 | 1336.25 | 2636.00 | 4889.00 | 30757.00 | ▇▁▁▁▁ |
SID79 | 0 | 1 | 8.36 | 9.43 | 0.00 | 2.00 | 5.00 | 10.25 | 57.00 | ▇▂▁▁▁ |
NWBIR79 | 0 | 1 | 1352.81 | 1976.00 | 3.00 | 250.50 | 874.50 | 1406.75 | 11631.00 | ▇▁▁▁▁ |
Variable type: sfc_MULTIPOLYGON
skim_variable | n_missing | complete_rate | n_unique | valid | funny |
---|---|---|---|---|---|
geometry | 0 | 1 | 100 | 100 | 101 |
Sharing these functions within a separate package requires an export. The simplest way to do this is with Roxygen.
#' Skimming functions for `sfc_MULTIPOLYGON` objects.
#' @export
skim_sf <- skim_with(
sfc_MULTIPOLYGON = sfl(
missing = n_missing,
n = length,
n_unique = n_unique,
valid = ~ sum(sf::st_is_valid(.)),
funny = funny_sf
)
)
## Creating new skimming functions for the following classes: sfc_MULTIPOLYGON.
## They did not have recognized defaults. Call get_default_skimmers() for more information.
#' A skim_by_type function for `sfc_MULTIPOLYGON` objects.
#' @export
skim_by_type.sfc_MULTIPOLYGON <- function(mangled, columns, data) {
skimmed <- dplyr::summarize_at(data, columns, mangled$funs)
skimr::build_results(skimmed, columns, NULL)
}
While this works within any package, there is an even better approach in this case. To take full advantage of skimr
, we’ll dig a bit into its API.
skimr
has a lookup mechanism, based on the function get_skimmers()
, to find default summary functions for each class. This is based on the S3 class system. You can learn more about it in Advanced R.
To export a new set of defaults for a data type, create a method for the generic function get_skimmers
. Each of those methods returns an sfl
, a skimr
function list. This is the same list-like data structure used in the skim_with()
example above. But note! There is one key difference. When adding a generic we also want to identify the skim_type
in the sfl
.
#' @importFrom skimr get_skimmers
#' @export
get_skimmers.sfc_MULTIPOLYGON <- function(column) {
sfl(
skim_type = "sfc_MULTIPOLYGON",
n_unique = n_unique,
valid = ~ sum(sf::st_is_valid(.)),
funny = funny_sf
)
}
The same strategy follows for other data types.
sfl
skim_type
is there#' @export
get_skimmers.sfc_POINT <- function(column) {
sfl(
skim_type = "sfc_POINT",
n_unique = n_unique,
valid = ~ sum(sf::st_is_valid(.))
)
}
Users of your package should load skimr
to get the skim()
function. Once loaded, a call to get_default_skimmer_names()
will return defaults for your data types as well!
get_default_skimmer_names()
## $AsIs
## [1] "n_unique" "min_length" "max_length"
##
## $Date
## [1] "min" "max" "median" "n_unique"
##
## $POSIXct
## [1] "min" "max" "median" "n_unique"
##
## $character
## [1] "min" "max" "empty" "n_unique" "whitespace"
##
## $complex
## [1] "mean"
##
## $difftime
## [1] "min" "max" "median" "n_unique"
##
## $factor
## [1] "ordered" "n_unique" "top_counts"
##
## $list
## [1] "n_unique" "min_length" "max_length"
##
## $logical
## [1] "mean" "count"
##
## $numeric
## [1] "mean" "sd" "p0" "p25" "p50" "p75" "p100" "hist"
##
## $sfc_MULTIPOLYGON
## [1] "n_unique" "valid" "funny"
##
## $sfc_POINT
## [1] "n_unique" "valid"
##
## $ts
## [1] "start" "end" "frequency" "deltat" "mean"
## [6] "sd" "min" "max" "median" "line_graph"
```
This is a very simple example. For a package such as sf the custom statistics will likely be much more complex. The flexibility of skimr
allows you to manage that.
Thanks to Jakub Nowosad, Tiernan Martin, Edzer Pebesma and Michael Sumner for inspiring and helping with the development of this code.