Get started with vroom

The vroom package contains one main function vroom() which is used to read all types of delimited files. A delimited file is any file in which the data is separated (delimited) by one or more characters.

The most common type of delimited files are CSV (Comma Separated Values) files, typically these files have a .csv suffix.

library(vroom)

This vignette covers the following topics:

Reading files

To read a CSV, or other type of delimited file with vroom pass the file to vroom(). The delimiter will be automatically guessed if it is a common delimiter. If the guessing fails or you are using a less common delimiter specify it with the delim parameter. (e.g. delim = ",").

We have included an example CSV file in the vroom package for use in examples and tests. Access it with vroom_example("mtcars.csv")

# See where the example file is stored on your machine
file <- vroom_example("mtcars.csv")
file
#> [1] "/private/var/folders/dt/r5s12t392tb5sk181j3gs4zw0000gn/T/RtmpMIeGNk/Rinstcdb1feece36/vroom/extdata/mtcars.csv"

# Read the file, by default vroom will guess the delimiter automatically.
vroom(file)
#> Observations: 32
#> Variables: 12
#> chr [ 1]: model
#> dbl [11]: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#> 
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
#> # A tibble: 32 x 12
#>   model     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda …  21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2 Mazda …  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3 Datsun…  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> # … with 29 more rows

# You can also specify it explicitly, which is (slightly) faster, and safer if
# you know how the file is delimited.
vroom(file, delim = ",")
#> Observations: 32
#> Variables: 12
#> chr [ 1]: model
#> dbl [11]: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#> 
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
#> # A tibble: 32 x 12
#>   model     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda …  21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2 Mazda …  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3 Datsun…  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> # … with 29 more rows

Reading multiple files

If you are reading a set of files which all have the same columns, you can pass the filenames directly to vroom() and it will combine them into one result.

First we will create some files to read by splitting the mtcars dataset by number of cylinders, (it is OK if you don’t currently understand this code).

mt <- tibble::rownames_to_column(mtcars, "model")
purrr::iwalk(
  split(mt, mt$cyl),
  ~ vroom_write(.x, glue::glue("mtcars_{.y}.csv"), "\t")
)

We can then efficiently read them into one table by passing the filenames directly to vroom.

files <- fs::dir_ls(glob = "mtcars*csv")
files
#> mtcars_4.csv mtcars_6.csv mtcars_8.csv
vroom(files)
#> Observations: 32
#> Variables: 12
#> chr [ 1]: model
#> dbl [11]: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#> 
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
#> # A tibble: 32 x 12
#>   model     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Datsun…  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#> 2 Merc 2…  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#> 3 Merc 2…  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#> # … with 29 more rows

Often the filename or directory where the files are stored contains information, in this case the id parameter can be used to add an extra column to the result with the full path to each file. (in this case named path).

vroom(files, id = "path")
#> Observations: 32
#> Variables: 13
#> chr [ 1]: model
#> dbl [11]: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#> 
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
#> # A tibble: 32 x 13
#>   path    model   mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear
#>   <chr>   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 mtcars… Dats…  22.8     4  108     93  3.85  2.32  18.6     1     1     4
#> 2 mtcars… Merc…  24.4     4  147.    62  3.69  3.19  20       1     0     4
#> 3 mtcars… Merc…  22.8     4  141.    95  3.92  3.15  22.9     1     0     4
#> # … with 29 more rows, and 1 more variable: carb <dbl>

Reading compressed files

vroom supports reading zip, gz, bz2 and xz compressed files automatically, just pass the filename of the compressed file to vroom.

file <- vroom_example("mtcars.csv.gz")

vroom(file)
#> Observations: 32
#> Variables: 12
#> chr [ 1]: model
#> dbl [11]: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#> 
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
#> # A tibble: 32 x 12
#>   model     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda …  21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2 Mazda …  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3 Datsun…  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> # … with 29 more rows

Reading remote files

vroom can read files from the internet as well by passing the URL of the file to vroom.

file <- "https://raw.githubusercontent.com/r-lib/vroom/master/inst/extdata/mtcars.csv"
vroom(file)

It can even read gzipped files from the internet (although currently not the other compressed formats).

file <- "https://raw.githubusercontent.com/r-lib/vroom/master/inst/extdata/mtcars.csv.gz"
vroom(file)

Column selection

vroom provides the same interface for column selection and renaming as dplyr::select(). This provides very flexible and readable selections.

file <- vroom_example("mtcars.csv.gz")

vroom(file, col_select = c(model, cyl, gear))
#> Observations: 32
#> Variables: 3
#> chr [1]: model
#> dbl [2]: cyl, gear
#> 
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
#> # A tibble: 32 x 3
#>   model           cyl  gear
#>   <chr>         <dbl> <dbl>
#> 1 Mazda RX4         6     4
#> 2 Mazda RX4 Wag     6     4
#> 3 Datsun 710        4     4
#> # … with 29 more rows
vroom(file, col_select = c(1, 3, 11))
#> Observations: 32
#> Variables: 3
#> chr [1]: model
#> dbl [2]: cyl, gear
#> 
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
#> # A tibble: 32 x 3
#>   model           cyl  gear
#>   <chr>         <dbl> <dbl>
#> 1 Mazda RX4         6     4
#> 2 Mazda RX4 Wag     6     4
#> 3 Datsun 710        4     4
#> # … with 29 more rows
vroom(file, col_select = starts_with("d"))
#> Observations: 32
#> Variables: 2
#> dbl [2]: disp, drat
#> 
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
#> # A tibble: 32 x 2
#>    disp  drat
#>   <dbl> <dbl>
#> 1   160  3.9 
#> 2   160  3.9 
#> 3   108  3.85
#> # … with 29 more rows
vroom(file, col_select = list(car = model, everything()))
#> Observations: 32
#> Variables: 12
#> chr [ 1]: model
#> dbl [11]: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#> 
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
#> # A tibble: 32 x 12
#>   car       mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda …  21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2 Mazda …  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3 Datsun…  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> # … with 29 more rows

Reading fixed width files

A fixed width file can be a very compact representation of numeric data. Unfortunately, it’s also painful because you need to describe the length of every field. vroom aims to make it as easy as possible by providing a number of different ways to describe the field structure. Use vroom_fwf() in conjunction with one of the following helper functions to read the file.

fwf_sample <- vroom_example("fwf-sample.txt")
cat(readLines(fwf_sample))
#> John Smith          WA        418-Y11-4111 Mary Hartford       CA        319-Z19-4341 Evan Nolan          IL        219-532-c301
vroom_fwf(fwf_sample, fwf_empty(fwf_sample, col_names = c("first", "last", "state", "ssn")))
#> Observations: 3
#> Variables: 4
#> chr [4]: first, last, state, ssn
#> 
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
#> # A tibble: 3 x 4
#>   first last     state ssn         
#>   <chr> <chr>    <chr> <chr>       
#> 1 John  Smith    WA    418-Y11-4111
#> 2 Mary  Hartford CA    319-Z19-4341
#> 3 Evan  Nolan    IL    219-532-c301
vroom_fwf(fwf_sample, fwf_widths(c(20, 10, 12), c("name", "state", "ssn")))
#> Observations: 3
#> Variables: 3
#> chr [3]: name, state, ssn
#> 
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
#> # A tibble: 3 x 3
#>   name          state ssn         
#>   <chr>         <chr> <chr>       
#> 1 John Smith    WA    418-Y11-4111
#> 2 Mary Hartford CA    319-Z19-4341
#> 3 Evan Nolan    IL    219-532-c301
vroom_fwf(fwf_sample, fwf_positions(c(1, 30), c(20, 42), c("name", "ssn")))
#> Observations: 3
#> Variables: 2
#> chr [2]: name, ssn
#> 
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
#> # A tibble: 3 x 2
#>   name          ssn         
#>   <chr>         <chr>       
#> 1 John Smith    418-Y11-4111
#> 2 Mary Hartford 319-Z19-4341
#> 3 Evan Nolan    219-532-c301
vroom_fwf(fwf_sample, fwf_cols(name = 20, state = 10, ssn = 12))
#> Observations: 3
#> Variables: 3
#> chr [3]: name, state, ssn
#> 
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
#> # A tibble: 3 x 3
#>   name          state ssn         
#>   <chr>         <chr> <chr>       
#> 1 John Smith    WA    418-Y11-4111
#> 2 Mary Hartford CA    319-Z19-4341
#> 3 Evan Nolan    IL    219-532-c301
vroom_fwf(fwf_sample, fwf_cols(name = c(1, 20), ssn = c(30, 42)))
#> Observations: 3
#> Variables: 2
#> chr [2]: name, ssn
#> 
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
#> # A tibble: 3 x 2
#>   name          ssn         
#>   <chr>         <chr>       
#> 1 John Smith    418-Y11-4111
#> 2 Mary Hartford 319-Z19-4341
#> 3 Evan Nolan    219-532-c301

Column types

vroom guesses the data types of columns as they are read, however sometimes it is necessary to change the type of one or more columns.

The available specifications are: (with single letter abbreviations in quotes)

You can tell vroom what columns to use with the col_types() argument in a number of ways.

If you only need to override a single column the most concise way is to use a named vector.

# read the 'hp' columns as an integer
vroom(vroom_example("mtcars.csv"), col_types = c(hp = "i"))
#> # A tibble: 32 x 12
#>   model     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>   <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda …  21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2 Mazda …  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3 Datsun…  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> # … with 29 more rows

# also skip reading the 'cyl' column
vroom(vroom_example("mtcars.csv"), col_types = c(hp = "i", cyl = "_"))
#> # A tibble: 32 x 11
#>   model           mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>         <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda RX4      21     160   110  3.9   2.62  16.5     0     1     4     4
#> 2 Mazda RX4 Wag  21     160   110  3.9   2.88  17.0     0     1     4     4
#> 3 Datsun 710     22.8   108    93  3.85  2.32  18.6     1     1     4     1
#> # … with 29 more rows

# also read the gears as a factor
vroom(vroom_example("mtcars.csv"), col_types = c(hp = "i", cyl = "_", gear = "f"))
#> # A tibble: 32 x 11
#>   model           mpg  disp    hp  drat    wt  qsec    vs    am gear   carb
#>   <chr>         <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 Mazda RX4      21     160   110  3.9   2.62  16.5     0     1 4         4
#> 2 Mazda RX4 Wag  21     160   110  3.9   2.88  17.0     0     1 4         4
#> 3 Datsun 710     22.8   108    93  3.85  2.32  18.6     1     1 4         1
#> # … with 29 more rows

However you can also use the col_*() functions in a list.

vroom(
  vroom_example("mtcars.csv"),
  col_types = list(hp = col_integer(), cyl = col_skip(), gear = col_factor())
)
#> # A tibble: 32 x 11
#>   model           mpg  disp    hp  drat    wt  qsec    vs    am gear   carb
#>   <chr>         <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 Mazda RX4      21     160   110  3.9   2.62  16.5     0     1 4         4
#> 2 Mazda RX4 Wag  21     160   110  3.9   2.88  17.0     0     1 4         4
#> 3 Datsun 710     22.8   108    93  3.85  2.32  18.6     1     1 4         1
#> # … with 29 more rows

This is most useful when a column type needs additional information, such as for categorical data when you know all of the levels of a factor.

vroom(
  vroom_example("mtcars.csv"),
  col_types = list(gear = col_factor(levels = c(gear = c("3", "4", "5"))))
)
#> # A tibble: 32 x 12
#>   model     mpg   cyl  disp    hp  drat    wt  qsec    vs    am gear   carb
#>   <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 Mazda …  21       6   160   110  3.9   2.62  16.5     0     1 4         4
#> 2 Mazda …  21       6   160   110  3.9   2.88  17.0     0     1 4         4
#> 3 Datsun…  22.8     4   108    93  3.85  2.32  18.6     1     1 4         1
#> # … with 29 more rows

Writing delimited files

Use vroom_write() to write delimited files, the default delimiter is tab.

vroom_write(mtcars, "mtcars.tsv")

Writing CSV delimited files

Use the delim = ',' to write CSV files

Writing compressed files

For gzip, bzip2 and xz compression they will be automatically compressed if the filename ends in gz, bz2 or xz.

It is also possible to use other compressors, such as pigz a parallel gzip implementation, lbzip2 a parallel bzip2 implementation or pixz a parallel xz implementation by using pipe() to create a pipe connection. The parallel versions can be considerably faster for large output files.

Further reading

vignette("benchmarks") discusses the performance of vroom, how it compares to alternatives and how it achieves its results.