ropenaq and rnoaa

Complementing air quality data with weather data using rnoaa

Maëlle Salmon


Introduction: getting air quality data

This vignette aims at explaining how you can complement a data.frame with weather data using rnoaa. In this vignette we shall use air quality data from the OpenAQ platform queried with the ropenaq package, for India. Using ropenaq one can get e.g. PM2.5 values over time in Indianapolis over the last days. For getting all data for march we’ll loop over several pages.

First, we need to know how many measures are available for Indianapolis for March 2016.


measurementsIndianapolis <- aq_measurements(city = "Indianapolis", parameter = "pm25",
                                     date_from = as.character(Sys.Date() - 30),
                                     date_to = as.character(Sys.Date()),
                                     limit = 10000 )
save(measurementsIndianapolis, file = "data/measurementsIndianapolis.RData")

We filter negative values.

measurementsIndianapolis %>% head() %>% knitr::kable()
measurementsIndianapolis <- filter(measurementsIndianapolis, value > 0)

We now transform these data into daily data.

# only keep stations with geographical information
measurementsIndianapolis <- filter(measurementsIndianapolis, !
# now transform to daily data
measurementsIndianapolis <- measurementsIndianapolis %>%
  #mutate(day = as.Date(dateLocal)) %>%
  group_by(location, day) %>%
  summarize(value = mean(value),
            longitude = longitude[1],
            latitude = latitude[1]) %>%
measurementsIndianapolis %>% head() %>% knitr::kable()

Air quality and weather are correlated, so one could be interested in getting a time series of say temperature for the same location. The OpenAQ platform itself does not provide weather data but nearly all stations have geographical coordinates. Our goal here will be to use rnoaa to complement this table with precipitation and temperature.

Find weather stations

For finding the right station(s) we shall use the meteo_nearby_stations function. It returns a list with the weather stations nearby each latitude/longitude given as arguments, respecting the other arguments such as maximal radius, first year with data, etc. For finding stations one might have to play a bit with the parameters until there is at least one station for each location.

Here we query stations with a less than 15km distance from the air quality stations, with precipitation and temperature data, and with data starting from 2016. Note that the query takes a while.

station_data <- ghcnd_stations()
lat_lon_df <- select(measurementsIndianapolis,
                     longitude) %>% unique() %>%
  ungroup() %>%
  rename(id = location) %>%
  mutate(id = factor(id))

stationsIndianapolis <- meteo_nearby_stations(lat_lon_df =,
                                       station_data = station_data,
                                       radius = 5,
                                       year_min = as.character(format(Sys.Date(), "%Y")),
                                       var = c("TAVG", "PRCP"))
stationsIndianapolis <- unique(bind_rows(stationsIndianapolis) %>% select(- distance))

save(stationsIndianapolis, file = "data/stationsIndianapolis.RData")
stationsIndianapolis %>% knitr::kable()

Now let us plot the AQ and weather stations on a quick and dirty map with no legend, red for AQ stations, blue for weather stations.

Not shown

map <- get_map(location = "Indianapolis", zoom = 11)
ggmap(map) +
  geom_point(aes(x = longitude, y = latitude),
             data = stationsIndianapolis, col = "blue", size = 4)+
  geom_point(aes(x = longitude, y = latitude),
             data = measurementsIndianapolis, col = "red", size = 4)

Query weather data for these stations

For pulling weather data from these weather monitors, we shall use the meteo_pull_monitors function.

monitors <- stationsIndianapolis$id
all_monitors_clean <- meteo_pull_monitors(monitors,
    date_min = as.character(Sys.Date() - 30),
    date_max = as.character(Sys.Date())) %>%
  dplyr::rename(day = date, location = id)
all_monitors_clean %>% head() %>% knitr::kable()

Here we notice some values are not available. Therefore, we might need to go back to weather stations searching with, for instance, a larger radius. In this case let’s say we’re ok with the result of the search.

Join the two tables, thus complementing the original table

Therefore, in this case we will bind the rows of the air quality table with the weather table.

measurementsIndianapolis <- bind_rows(measurementsIndianapolis, all_monitors_clean)
measurementsIndianapolis %>% head() %>% knitr::kable()

Now some locations are air quality locations and have only missing values in the weather columns, and some locations are weather locations and have only missing values in the air quality columns.

We can plot the data we got.

data_plot <- measurementsIndianapolis %>%
  rename(pm25 = value) %>%
  select(- longitude, - latitude)

data_plot <- tidyr::gather_(data_plot, "parameter", "value",

ggplot(data_plot) +
  geom_line(aes(x = day, y = value, col = location)) +
  facet_grid(parameter ~ ., scales = "free_y")