Use tidytransit to:
This package requires a working installation of sf.
# Once sf is installed, you can install from CRAN with:
install.packages('tidytransit')
# For the development version from Github:
# install.packages("devtools")
devtools::install_github("r-transit/tidytransit")
For some users, sf
is impractical to install due to system level dependencies. For these users, trread
may work better. Its tidytransit without geospatial (GDAL) tools.
GTFS feeds contain many linked tables about published transit schedules about trips, stops, and routes. Below is a diagram of these relationships and tables.
Source: Wikimedia, user -stk.
Since GTFS is a data standard, you can find many uses for it which have not been considered here. The summary page for the GTFS standard is a good resource.
GTFS works well with R given that the data structure is tabular.
GTFS data come packaged as a zip file of tables in text form. The main thing tidytransit does is consolidate the reading of all those tables into a single R object, which contains a list of the tables in each feed.
Below we use the tidytransit read_gtfs
function in order to read a feed from the NYC MTA into R.
We use a feed included in the package in the example below. But note that you can read directly from the New York City Metropolitan Transit Authority, as shown in the commented code below.
You can also read from any other URL. This is useful because there are many sources for GTFS data, and often the best source is transit service providers themselves. See the next section on “Finding More GTFS Feeds” for more sources of feeds.
# nyc <- read_gtfs("http://web.mta.info/developers/data/nyct/subway/google_transit.zip")
local_gtfs_path <- system.file("extdata",
"google_transit_nyc_subway.zip",
package = "tidytransit")
nyc <- read_gtfs(local_gtfs_path,
local=TRUE)
Each of the source tables for the GTFS feed is now available in the nyc gtfs
object.
For example, stops:
## # A tibble: 6 x 10
## stop_id stop_code stop_name stop_desc stop_lat stop_lon zone_id stop_url
## <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 101 <NA> Van Cort… <NA> 40.9 -73.9 <NA> <NA>
## 2 101N <NA> Van Cort… <NA> 40.9 -73.9 <NA> <NA>
## 3 101S <NA> Van Cort… <NA> 40.9 -73.9 <NA> <NA>
## 4 103 <NA> 238 St <NA> 40.9 -73.9 <NA> <NA>
## 5 103N <NA> 238 St <NA> 40.9 -73.9 <NA> <NA>
## 6 103S <NA> 238 St <NA> 40.9 -73.9 <NA> <NA>
## # … with 2 more variables: location_type <int>, parent_station <chr>
The tables available on each feed may vary. Below we can simply print the names of all the tables that were read in for this feed. Each of these is a table.
## [1] "trips" "stop_times" "agency" "calendar"
## [5] "calendar_dates" "stops" "routes" "shapes"
## [9] "transfers"
Included in the tidytransit package is a dataframe with a list of urls, city names, and locations.
You can browse it as a data frame:
## id t loc_id loc_pid
## 1 prazska-integrovana-doprava/1106 PID GTFS 588 587
## 2 mpk-sa-w-krakowie/1105 MPK SA w Krakowie GTFS 713 434
## 3 mpk-sa-w-krakowie/1104 MPK SA w Krakowie GTFS 713 434
## 4 city-of-kajaani/1103 Kajaani GTFS 712 530
## 5 emtu/1099 EMTU GTFS 666 388
## 6 ctm-cagliari/1098 CTM Cagliari GTFS 710 78
## loc_t loc_n loc_lat loc_lng
## 1 Prague, Czechia Prague 50.07554 14.437800
## 2 Kraków, Poland Kraków 50.06465 19.944980
## 3 Kraków, Poland Kraków 50.06465 19.944980
## 4 Kajaani, Finland Kajaani 64.22218 27.727850
## 5 São Paulo, State of São Paulo, Brazil São Paulo -23.55052 -46.633309
## 6 Cagliari, Province of Cagliari, Italy Cagliari 39.22384 9.121661
## url_i
## 1 https://pid.cz/o-systemu/opendata/
## 2 www.mpk.krakow.pl
## 3 www.mpk.krakow.pl
## 4 http://dev.hsl.fi
## 5 http://www.emtu.sp.gov.br/emtu/home/home.htm
## 6 http://dati.regione.sardegna.it/dataset/quadri-orari-ctm
## url_d
## 1 <NA>
## 2 ftp://ztp.krakow.pl/GTFS_KRK_T.zip
## 3 ftp://ztp.krakow.pl/GTFS_KRK_A.zip
## 4 <NA>
## 5 <NA>
## 6 <NA>
Note that there is a url (url_d
) for each feed, which can be used to read the feed for a given city into R.
For example:
Included in the transitfeeds table is a set of coordinates for each feed. This means you can filter feed sources by location. Or map all of them, as below:
## Warning: package 'sf' was built under R version 3.5.2
## Linking to GEOS 3.6.1, GDAL 2.1.3, PROJ 4.9.3
feedlist_sf <- st_as_sf(feedlist,
coords=c("loc_lng","loc_lat"),
crs=4326)
plot(feedlist_sf, max.plot = 1)
See the package reference for the transitfeeds
data frame for more information on the transitfeeds metadata.
When you add flags for geometry=TRUE and frequency=TRUE, tidytransit attempts to convert GTFS feeds into simple features dataframes and frequency/headway dataframes upon import of the GTFS data. These data frames are added to the “gtfs” object under the “.” sub-list.
# Read in GTFS feed
# here we use a feed included in the package, but note that you can read directly from the New York City Metropolitan Transit Authority using the following URL:
# nyc <- read_gtfs("http://web.mta.info/developers/data/nyct/subway/google_transit.zip")
local_gtfs_path <- system.file("extdata",
"google_transit_nyc_subway.zip",
package = "tidytransit")
nyc <- read_gtfs(local_gtfs_path,
local=TRUE,
geometry=TRUE,
frequency=TRUE)
## Calculating route and stop headways.
Note that these are estimated headways and route geometries, and the quality of their estimation depends on many factors, including the GTFS feed structure. In some cases, these functions may fail to estimate frequencies or spatial features at all, or with an acceptable level of accuracy. We have an open issue for benchmarking the quality of these estimates.
Below we list the table names added.
## [1] "stops_sf" "routes_sf" "stops_frequency"
## [4] "routes_frequency"
View the headways along routes as a dataframe. routes_frequency
is added to the list of gtfs dataframes read in by read_gtfs
when frequency=TRUE. By default, frequency is calculated for service that happens every weekday from 6 am to 10 pm. See the reference for the get_route_frequency
function for other options (e.g. weekends, other times of day).
## # A tibble: 6 x 5
## route_id median_headways mean_headways st_dev_headways stop_count
## <chr> <int> <int> <dbl> <int>
## 1 1 5 5 0.15 76
## 2 2 7 51 135. 120
## 3 3 8 8 0.08 68
## 4 4 6 115 205. 77
## 5 5 9 110 271. 102
## 6 5X 48 48 0 29
View the headways at stops. stops_frequency
is added to the list of gtfs dataframes read in by read_gtfs
. Again, by default, frequency is calculated for service that happens every weekday from 6 am to 10 pm. See the reference for the get_stop_frequency
function for other options (e.g. weekends, other times of day).
## # A tibble: 6 x 6
## route_id direction_id stop_id service_id departures headway
## <chr> <int> <chr> <chr> <int> <dbl>
## 1 1 0 101N ASP18GEN-1087-Weekday-00 177 5.42
## 2 1 0 103N ASP18GEN-1087-Weekday-00 177 5.42
## 3 1 0 104N ASP18GEN-1087-Weekday-00 177 5.42
## 4 1 0 106N ASP18GEN-1087-Weekday-00 178 5.39
## 5 1 0 107N ASP18GEN-1087-Weekday-00 183 5.25
## 6 1 0 108N ASP18GEN-1087-Weekday-00 183 5.25
You can now map subway routes and color-code each route by how often trains come.
## Calculating headways and spatial features. This may take a while
## Calculating route and stop headways.
When reading a feed, it is checked against the GTFS specification, and an attribute is added to the resultant object called validation_result
, which is a tibble about the files and fields in the GTFS feed and how they compare to the specification.
You can get this tibble from the metadata about the feed.
## # A tibble: 6 x 8
## file file_spec file_provided_s… field field_spec field_provided_…
## <chr> <chr> <lgl> <chr> <chr> <lgl>
## 1 trips req TRUE rout… req TRUE
## 2 trips req TRUE serv… req TRUE
## 3 trips req TRUE trip… req TRUE
## 4 trips req TRUE trip… opt TRUE
## 5 trips req TRUE trip… opt FALSE
## 6 trips req TRUE dire… opt TRUE
## # … with 2 more variables: validation_status <chr>,
## # validation_details <chr>