Visualising the History of openSenseMap.org

Norwin Roosen

2018-06-11

This vignette serves as an example on data wrangling & visualization with opensensmapr, dplyr and ggplot2.

# required packages:
library(opensensmapr) # data download
library(dplyr)        # data wrangling
library(ggplot2)      # plotting
library(lubridate)    # date arithmetic
library(zoo)          # rollmean()

openSenseMap.org has grown quite a bit in the last years; it would be interesting to see how we got to the current 1845 sensor stations, split up by various attributes of the boxes.

While opensensmapr provides extensive methods of filtering boxes by attributes on the server, we do the filtering within R to save time and gain flexibility. So the first step is to retrieve all the boxes:

# if you want to see results for a specific subset of boxes,
# just specify a filter such as grouptag='ifgi' here
boxes = osem_boxes()

Plot count of boxes by time

By looking at the createdAt attribute of each box we know the exact time a box was registered. With this approach we have no information about boxes that were deleted in the meantime, but that’s okay for now.

…and exposure

exposure_counts = boxes %>%
  group_by(exposure) %>%
  mutate(count = row_number(createdAt))

exposure_colors = c(indoor = 'red', outdoor = 'lightgreen', mobile = 'blue', unknown = 'darkgrey')
ggplot(exposure_counts, aes(x = createdAt, y = count, colour = exposure)) +
  geom_line() +
  scale_colour_manual(values = exposure_colors) +
  xlab('Registration Date') + ylab('senseBox count')

Outdoor boxes are growing fast! We can also see the introduction of mobile sensor “stations” in 2017. While mobile boxes are still few, we can expect a quick rise in 2018 once the new senseBox MCU with GPS support is released.

Let’s have a quick summary:

exposure_counts %>%
  summarise(
    oldest = min(createdAt),
    newest = max(createdAt),
    count = max(count)
  ) %>%
  arrange(desc(count))
exposure oldest newest count
outdoor 2015-02-18 16:53:41 2018-06-11 19:49:27 1473
indoor 2015-02-08 17:36:40 2018-06-09 21:24:19 295
mobile 2017-05-24 08:16:36 2018-06-11 12:13:18 57
unknown 2014-05-28 15:36:14 2016-06-25 15:11:11 20

…and grouptag

We can try to find out where the increases in growth came from, by analysing the box count by grouptag.

Caveats: Only a small subset of boxes has a grouptag, and we should assume that these groups are actually bigger. Also, we can see that grouptag naming is inconsistent (Luftdaten, luftdaten.info, …)

grouptag_counts = boxes %>%
  group_by(grouptag) %>%
  # only include grouptags with 8 or more members
  filter(length(grouptag) >= 8 && !is.na(grouptag)) %>%
  mutate(count = row_number(createdAt))

# helper for sorting the grouptags by boxcount
sortLvls = function(oldFactor, ascending = TRUE) {
  lvls = table(oldFactor) %>% sort(., decreasing = !ascending) %>% names()
  factor(oldFactor, levels = lvls)
}
grouptag_counts$grouptag = sortLvls(grouptag_counts$grouptag, ascending = FALSE)

ggplot(grouptag_counts, aes(x = createdAt, y = count, colour = grouptag)) +
  geom_line(aes(group = grouptag)) +
  xlab('Registration Date') + ylab('senseBox count')

grouptag_counts %>%
  summarise(
    oldest = min(createdAt),
    newest = max(createdAt),
    count = max(count)
  ) %>%
  arrange(desc(count))
grouptag oldest newest count
Luftdaten 2017-03-14 17:01:16 2018-06-08 23:09:27 116
ifgi 2016-06-17 08:04:54 2018-06-09 18:15:47 37
MakeLight 2015-02-18 16:53:41 2018-02-02 13:50:21 15
Bad_Hersfeld 2017-07-18 13:32:03 2018-03-22 09:10:07 13
luftdaten.info 2017-05-01 10:15:44 2018-05-29 13:25:53 13
PGKN 2018-03-22 16:44:00 2018-06-08 13:38:22 11
dwih-sp 2016-08-09 08:06:02 2016-11-23 10:16:04 11
Che Aria Tira? 2018-03-11 10:50:42 2018-03-11 23:11:20 10
Luftdaten.info 2017-04-03 14:10:20 2018-04-16 16:31:24 10
Sofia 2017-04-11 04:40:11 2018-06-07 11:00:54 10
Feinstaub 2017-04-08 06:38:25 2018-03-29 17:27:55 9
Raumanmeri 2017-03-13 11:35:39 2017-04-27 05:36:20 9
IKG 2017-03-21 19:02:11 2017-12-18 14:30:21 8

Plot rate of growth and inactivity per week

First we group the boxes by createdAt into bins of one week:

bins = 'week'
mvavg_bins = 6

growth = boxes %>%
  mutate(week = cut(as.Date(createdAt), breaks = bins)) %>%
  group_by(week) %>%
  summarize(count = length(week)) %>%
  mutate(event = 'registered')

We can do the same for updatedAt, which informs us about the last change to a box, including uploaded measurements. This method of determining inactive boxes is fairly inaccurate and should be considered an approximation, because we have no information about intermediate inactive phases. Also deleted boxes would probably have a big impact here.

inactive = boxes %>%
  # remove boxes that were updated in the last two days,
  # b/c any box becomes inactive at some point by definition of updatedAt
  filter(updatedAt < now() - days(2)) %>%
  mutate(week = cut(as.Date(updatedAt), breaks = bins)) %>%
  group_by(week) %>%
  summarize(count = length(week)) %>%
  mutate(event = 'inactive')

Now we can combine both datasets for plotting:

boxes_by_date = bind_rows(growth, inactive) %>% group_by(event)

ggplot(boxes_by_date, aes(x = as.Date(week), colour = event)) +
  xlab('Time') + ylab(paste('rate per ', bins)) +
  scale_x_date(date_breaks="years", date_labels="%Y") +
  scale_colour_manual(values = c(registered = 'lightgreen', inactive = 'grey')) +
  geom_point(aes(y = count), size = 0.5) +
  # moving average, make first and last value NA (to ensure identical length of vectors)
  geom_line(aes(y = rollmean(count, mvavg_bins, fill = list(NA, NULL, NA))))

We see a sudden rise in early 2017, which lines up with the fast growing grouptag Luftdaten. This was enabled by an integration of openSenseMap.org into the firmware of the air quality monitoring project luftdaten.info. The dips in mid 2017 and early 2018 could possibly be explained by production/delivery issues of the senseBox hardware, but I have no data on the exact time frames to verify.

Plot duration of boxes being active

While we are looking at createdAt and updatedAt, we can also extract the duration of activity of each box, and look at metrics by exposure and grouptag once more:

…by exposure

duration = boxes %>%
  group_by(exposure) %>%
  filter(!is.na(updatedAt)) %>%
  mutate(duration = difftime(updatedAt, createdAt, units='days'))

ggplot(duration, aes(x = exposure, y = duration)) +
  geom_boxplot() +
  coord_flip() + ylab('Duration active in Days')

The time of activity averages at only 157 days, though there are boxes with 766 days of activity, spanning a large chunk of openSenseMap’s existence.

…by grouptag

duration = boxes %>%
  group_by(grouptag) %>%
  # only include grouptags with 8 or more members
  filter(length(grouptag) >= 8 && !is.na(grouptag) && !is.na(updatedAt)) %>%
  mutate(duration = difftime(updatedAt, createdAt, units='days'))
  
ggplot(duration, aes(x = grouptag, y = duration)) +
  geom_boxplot() +
  coord_flip() + ylab('Duration active in Days')

duration %>%
  summarize(
    duration_avg = round(mean(duration)),
    duration_min = round(min(duration)),
    duration_max = round(max(duration)),
    oldest_box = round(max(difftime(now(), createdAt, units='days')))
  ) %>%
  arrange(desc(duration_avg))
grouptag duration_avg duration_min duration_max oldest_box
dwih-sp 639 days 565 days 672 days 672 days
Feinstaub 233 days 4 days 430 days 430 days
Bad_Hersfeld 212 days 81 days 328 days 328 days
ifgi 203 days 0 days 639 days 725 days
Sofia 195 days 0 days 427 days 427 days
Luftdaten 189 days 0 days 440 days 454 days
luftdaten.info 182 days 13 days 376 days 406 days
IKG 170 days 70 days 260 days 447 days
Luftdaten.info 93 days 5 days 376 days 434 days
Che Aria Tira? 92 days 92 days 92 days 92 days
PGKN 48 days 0 days 81 days 81 days
Raumanmeri 45 days 7 days 318 days 455 days

The time of activity averages at only 194 days, though there are boxes with 672 days of activity, spanning a large chunk of openSenseMap’s existence.

…by year of registration

This is less useful, as older boxes are active for a longer time by definition. If you have an idea how to compensate for that, please send a Pull Request!

# NOTE: boxes older than 2016 missing due to missing updatedAt in database
duration = boxes %>%
  mutate(year = cut(as.Date(createdAt), breaks = 'year')) %>%
  group_by(year) %>%
  filter(!is.na(updatedAt)) %>%
  mutate(duration = difftime(updatedAt, createdAt, units='days'))

ggplot(duration, aes(x = substr(as.character(year), 0, 4), y = duration)) +
  geom_boxplot() +
  coord_flip() + ylab('Duration active in Days') + xlab('Year of Registration')

More Visualisations

Other visualisations come to mind, and are left as an exercise to the reader. If you implemented some, feel free to add them to this vignette via a Pull Request.