The vapour package provides access to the basic read functions available in GDAL for both raster and vector data sources.
The functions are deliberately lower-level than these data models and provide access to the component entities independently.
For vector data:
All vector/feature read tasks can optionally apply an arbitrary limit to the maximum number of features read or queried, and all allow execution of OGRSQL to a layer prior to data extraction. In combination with a SQL query a bounding box spatial filter can be applied via the extent
argument.
For raster data:
The workflows available are intended to support development of applications in R for these vector and raster data without being constrained to any particular data model.
The package can be installed from CRAN.
The development version can be installed from Github.
You will need development tools for building R packages.
On Linux and MacOS building also requires an available GDAL installation, but on Windows the ROpenSci rwinlib tools are used and the required GDAL will be downloaded and used when building the package. This installation is self-contained and only affects the use of R, it can be used alongside other applications using GDAL.
The goal of vapour is to provide a basic GDAL API package for R. The key functions provide vector geometry or attributes and raster data and raster metadata.
The priority is to give low-level access to key functionality rather than comprehensive coverage of the library. The real advantage of vapour
is the flexibility of a modular workflow, not the outright efficiency.
A parallel goal is to be freed from the powerful but sometimes limiting high-level data models of GDAL itself, specifically these are simple features and affine-based regular rasters composed of 2D slices. (GDAL will possibly remove these limitations over time but still there will always be value in having modularity in an ecosystem of tools.)
GDAL’s dynamic resampling of arbitrary raster windows is also very useful for interactive tools on local data, and seems under-utilized in favour of less accessible online image services.
This partly draws on work done in the sf package and in packages rgdal
and rgdal2
. I’m amazed that something as powerful and general as GDAL is still only available through these lenses, but recent improvements make things much easier to use and share. Specifically Rcpp
means that access to external libs is simplified, easier to learn and easier to get started and make progress. The other part is that cross-platform support is now much better, with more consistency on the libraries available on the CRAN machines and in other contexts.
You might burn limbs off.
It’s possible to give problematic “SELECT” statements via the sql
argument. Note that the geometry readers vapour_read_geometry
, vapour_read_geometry_text
, and vapour_read_extent
will strip out the SELECT ... FROM
clause and replace it with SELECT * FROM
to ensure that the geometry is accessible, though the attributes are ignored. This means we can allow the user or dplyr
to create any SELECT
statement. The function vapour_read_geometry
will return a list of NULLs, in this case.
I haven’t tried this against a real database, I’m not sure if we need AsBinary()
around EWKB geoms, for example - but at any rate these can be ingested by sf
.
The package documentation page gives an overview of available functions.
See the vignettes and documentation for examples WIP.
The following concrete example illustrates the motivation for vapour
, through a timing benchmark for one standard operation: extracting feature geometries from a data set within a user-defined bounding box. The data set is the one used throughout the book Geocomputation in R by Robin Lovelace, Jakub Nowosad, and Jannes Muenchow, and can be obtained with the following code.
url <- file.path ("http://www.naturalearthdata.com/http//www.naturalearthdata.com",
"download/10m/cultural/ne_10m_parks_and_protected_lands.zip")
download.file (url = url, destfile = "USA_parks.zip")
unzip (zipfile = "USA_parks.zip", exdir = "usa_parks")
fname <- "usa_parks/ne_10m_parks_and_protected_lands_area.shp"
That last fname
is the file we’re interested in, which contains polygons for all United States parks. We now construct a timing benchmark for three ways of extracting the data within a pre-defined bounding box of:
library (magrittr)
bb <- c (-120, 20, -100, 40)
names (bb) <- c ("xmin", "ymin", "xmax", "ymax")
bb_sf <- sf::st_bbox (bb, crs = sf::st_crs (4326)) %>%
sf::st_as_sfc ()
First, we define a function to do the desired extraction using the sf
package, comparing both st_crop
and the sf::[
sub-selection operator:
f_sf1 <- function (fname)
{
usa_parks <- sf::st_read (fname, quiet = TRUE)
suppressMessages (suppressWarnings (
parks2 <- sf::st_crop (usa_parks, bb_sf)
))
}
f_sf2 <- function (fname)
{
usa_parks <- sf::st_read (fname, quiet = TRUE)
suppressMessages (suppressWarnings (
parks2 <- usa_parks [bb_sf, ]
))
}
Then three approaches using vapour
, in each case extracting equivalent data to sf
- that is, both geometries and attributes - yet simply leaving them separate here. The three approaches are:
WKB
, sub-select after reading, and convert to sfc
lists;WKB
pre-selected via an SQL statement, and convert to sfc
lists; andjson
(text), pre-selected via SQL, and convert with the geojsonsf
packagef_va1 <- function (fname) # read all then sub-select
{
ext <- do.call (rbind, vapour_read_extent (fname)) # bboxes of each feature
indx <- which (ext [, 1] > bb [1] & ext [, 2] < bb [3] &
ext [, 3] > bb [2] & ext [, 4] < bb [4])
g <- vapour_read_geometry (fname) [indx] %>%
sf::st_as_sfc ()
a <- lapply (vapour_read_attributes (fname), function (i) i [indx])
}
f_va2 <- function (fname) # read selection only via SQL
{
ext <- do.call (rbind, vapour_read_extent (fname))
indx <- which (ext [, 1] > bb [1] & ext [, 2] < bb [3] &
ext [, 3] > bb [2] & ext [, 4] < bb [4])
n <- paste0 (vapour_read_names (fname) [indx], collapse = ",") # GDAL FIDs
stmt <- paste0 ("SELECT FID FROM ", vapour_layer_names (fname),
" WHERE FID in (", n, ")")
g <- vapour_read_geometry (fname, sql = stmt) %>%
sf::st_as_sfc ()
a <- vapour_read_attributes (fname, sql = stmt)
}
f_va3 <- function (fname) # convert json text via geojsonsf
{
ext <- do.call (rbind, vapour_read_extent (fname)) # bboxes of each feature
indx <- which (ext [, 1] > bb [1] & ext [, 2] < bb [3] &
ext [, 3] > bb [2] & ext [, 4] < bb [4])
n <- paste0 (vapour_read_names (fname) [indx], collapse = ",") # GDAL FIDs
stmt <- paste0 ("SELECT FID FROM ", vapour_layer_names (fname),
" WHERE FID in (", n, ")")
g <- vapour_read_geometry_text (fname, textformat = "json", sql = stmt)
g <- lapply (g, function (i) geojsonsf::geojson_sfc (i) [[1]]) %>%
sf::st_sfc ()
a <- vapour_read_attributes (fname, sql = stmt)
}
The benchmark timings - in particular the “relative” values - then illustrate the advantages of vapour
:
rbenchmark::benchmark (
f_sf1 (fname),
f_sf2 (fname),
f_va1 (fname),
f_va2 (fname),
f_va3 (fname),
replications = 10)
#> manually finding feature count
#>
#> manually finding feature count
#>
#> manually finding feature count
#>
#> manually finding feature count
#>
#> manually finding feature count
#>
#> manually finding feature count
#>
#> manually finding feature count
#>
#> manually finding feature count
#>
#> manually finding feature count
#>
#> manually finding feature count
#>
#> manually finding feature count
#>
#> manually finding feature count
#>
#> manually finding feature count
#>
#> manually finding feature count
#>
#> manually finding feature count
#>
#> manually finding feature count
#>
#> manually finding feature count
#>
#> manually finding feature count
#>
#> manually finding feature count
#>
#> manually finding feature count
#>
#> manually finding feature count
#>
#> manually finding feature count
#>
#> manually finding feature count
#>
#> manually finding feature count
#>
#> manually finding feature count
#>
#> manually finding feature count
#>
#> manually finding feature count
#>
#> manually finding feature count
#>
#> manually finding feature count
#>
#> manually finding feature count
#>
#> manually finding feature count
#>
#> manually finding feature count
#>
#> manually finding feature count
#> test replications elapsed relative user.self sys.self user.child
#> 1 f_sf1(fname) 10 0.249 3.458 0.249 0.001 0
#> 2 f_sf2(fname) 10 0.180 2.500 0.176 0.004 0
#> 3 f_va1(fname) 10 0.072 1.000 0.064 0.008 0
#> 4 f_va2(fname) 10 0.085 1.181 0.086 0.000 0
#> 5 f_va3(fname) 10 0.215 2.986 0.203 0.012 0
#> sys.child
#> 1 0
#> 2 0
#> 3 0
#> 4 0
#> 5 0
Reading geometries only, as opposed to the sf
reading of all geometries and attributes, affords a speed increase of about 25%, while utilizing the SQL capabilities of ogr_sql
offers an increase of around 75%.
Examples of packages that use vapour are in development, RGDALSQL and lazyraster. RGDALSQL
aims to leverage the facilities of GDAL to provide data on-demand for many sources as if they were databases. lazyraster
uses the level-of-detail facility of GDAL to read just enough resolution from a raster source using traditional window techniques.
Limitations, work-in-progress and other discussion are active here: https://github.com/hypertidy/vapour/issues/4
We’ve kept a record of a minimal GDAL wrapper package here:
https://github.com/diminutive/gdalmin
Before those I had worked on getting sp and dplyr to at least work together https://github.com/dis-organization/sp_dplyrexpt and recently rgdal was updated to allow tibbles to be used, something that spbabel and spdplyr really needed to avoid friction.
Early exploration of allow non-geometry read with rgdal was tried here: https://github.com/r-gris/gladr
Thanks to Edzer Pebesma and Roger Bivand and Tim Keitt for prior art that I crib and copy from. Jeroen Ooms helped the R community hugely by providing an accessible build process for libraries on Windows. Mark Padgham helped kick me over a huge obstacle in using C++ libraries with R. Simon Wotherspoon and Ben Raymond have endured ravings about wanting this level of control for many years.
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.