Researching vehicles involved in collisions with STATS19 data

library(stats19)
library(dplyr)

Vehicle level variables in the STATS19 datasets

Of the three dataset types in STATS19, the vehicle tables are perhaps the most revealing yet under-explored. They look like this:

v = get_stats19(year = 2017, type = "vehicles")
#> Files identified: dftRoadSafetyData_Vehicles_2017.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Vehicles_2017.zip
#> Data already exists in data_dir, not downloading
#> Data saved at /tmp/RtmpAph2kW/dftRoadSafetyData_Vehicles_2017/Veh.csv
v
#> # A tibble: 238,926 x 23
#>    accident_index vehicle_referen… vehicle_type towing_and_arti…
#>    <chr>                     <int> <chr>        <chr>           
#>  1 2017010001708                 1 Car          No tow/articula…
#>  2 2017010001708                 2 Motorcycle … No tow/articula…
#>  3 2017010009342                 1 Car          No tow/articula…
#>  4 2017010009342                 2 Car          No tow/articula…
#>  5 2017010009344                 1 Car          No tow/articula…
#>  6 2017010009344                 2 Car          No tow/articula…
#>  7 2017010009344                 3 Car          No tow/articula…
#>  8 2017010009348                 1 Car          No tow/articula…
#>  9 2017010009348                 2 Car          No tow/articula…
#> 10 2017010009350                 1 Car          No tow/articula…
#> # … with 238,916 more rows, and 19 more variables:
#> #   vehicle_manoeuvre <chr>, vehicle_location_restricted_lane <int>,
#> #   junction_location <chr>, skidding_and_overturning <chr>,
#> #   hit_object_in_carriageway <int>, vehicle_leaving_carriageway <int>,
#> #   hit_object_off_carriageway <int>, first_point_of_impact <chr>,
#> #   was_vehicle_left_hand_drive <chr>, journey_purpose_of_driver <chr>,
#> #   sex_of_driver <chr>, age_of_driver <int>, age_band_of_driver <int>,
#> #   engine_capacity_cc <int>, propulsion_code <chr>, age_of_vehicle <int>,
#> #   driver_imd_decile <chr>, driver_home_area_type <int>,
#> #   vehicle_imd_decile <int>

We will categorise the vehicle types to simplify subsequent results:

v = v %>% mutate(vehicle_type2 = case_when(
  grepl(pattern = "motorcycle", vehicle_type, ignore.case = TRUE) ~ "Motorbike",
  grepl(pattern = "Car", vehicle_type, ignore.case = TRUE) ~ "Car",
  grepl(pattern = "Bus", vehicle_type, ignore.case = TRUE) ~ "Bus",
  grepl(pattern = "cycle", vehicle_type, ignore.case = TRUE) ~ "Cycle",
  # grepl(pattern = "Van", vehicle_type, ignore.case = TRUE) ~ "Van",
  grepl(pattern = "Goods", vehicle_type, ignore.case = TRUE) ~ "Goods",
  
  TRUE ~ "Other"
))
# barplot(table(v$vehicle_type2))

All of these variables are of potential interest to road safety researchers. Let’s take a look at summaries of a few of them:

table(v$vehicle_type2)
#> 
#>       Bus       Car     Cycle     Goods Motorbike     Other 
#>      5455    173686     18954     18907     19204      2720
summary(v$age_of_driver)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   -1.00   23.00   35.00   35.58   50.00  100.00
summary(v$engine_capacity_cc)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>      -1     125    1398    1454    1956   16400
table(v$propulsion_code)
#> 
#>            Electric     Electric diesel                 Gas 
#>                 200                  74                  23 
#>         Gas/Bi-fuel           Heavy oil     Hybrid electric 
#>                 144               80600                3202 
#> New fuel technology              Petrol    Petrol/Gas (LPG) 
#>                   2              101724                  27
summary(v$age_of_vehicle)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  -1.000  -1.000   5.000   5.759  10.000  85.000

The output shows vehicle type (a wide range of vehicles are represented), age of driver (with young and elderly drivers often seen as more risky), engine capacity and populsion (related to vehicle type and size) and age of vehicle. In addition to these factors appearing in prior road safety research and debate, they are also things that policy makers can influence, e.g by:

Relationships between vehicle type and crash severity

To explore the relationship between vehicles and crash severity, we must first join on the ‘accidents’ table:

a = get_stats19(year = 2017, type = "accidents")
#> Files identified: dftRoadSafetyData_Accidents_2017.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Accidents_2017.zip
#> Data already exists in data_dir, not downloading
#> Data saved at /tmp/RtmpAph2kW/dftRoadSafetyData_Accidents_2017/Acc.csv
#> Reading in:
#> /tmp/RtmpAph2kW/dftRoadSafetyData_Accidents_2017/Acc.csv
va = dplyr::inner_join(v, a)
#> Joining, by = "accident_index"

Now we have additional variables available to us:

dim(v)
#> [1] 238926     24
dim(va)
#> [1] 238926     55

Let’s see how crash severity relates to the variables of interest mentioned above:

xtabs(~vehicle_type2 + accident_severity, data = va) %>% prop.table()
#>              accident_severity
#> vehicle_type2        Fatal      Serious       Slight
#>     Bus       0.0002553092 0.0033315755 0.0192444523
#>     Car       0.0079145844 0.1022324904 0.6167976696
#>     Cycle     0.0004896914 0.0165490570 0.0622912534
#>     Goods     0.0019294677 0.0130416949 0.0641621255
#>     Motorbike 0.0015862652 0.0249365912 0.0538534944
#>     Other     0.0002260114 0.0023228950 0.0088353716
xtabs(~vehicle_type2 + accident_severity, data = va) %>% prop.table() %>% plot()