Exploring probability distributions for bivariate temporal granularities

Sayani Gupta

Introduction

Temporal data are available at various resolutions depending on the context. Often in time series analysis, data collected at finer scales needs to be explored also at coarser temporal scales. This approach requires deconstructing time in various possible ways leading to linear (e.g. days, weeks), circular (e.g. day-of-week) or aperiodic (e.g. day-of-month) time granularities. The hierarchical structure of time also creates a natural nested ordering resulting in single or multiple order-up granularities. For a conventional Gregorian calendar, examples of single-order-up granularity include hour-of-day and minute-of-hour, while examples of multiple-order-up granularity may include hour-of-week or minute-of-day. All of these granularities might be useful in throwing light on the periodicity of the data.

Exploration of large quantities of temporal data across so many different deconstructions of time becomes clumsy without a systematic approach.

This vignette describes the tools in the package gravitas for systematically exploring large quantities of temporal data across different time deconstructions by visualizing the probability distributions using a range of graphics in ggplot2.

Analysis of smart meter data

The smart_meter10 data set in the package is a tsibble providing the energy consumption for ten households from customer trials. It is hard to derive insights when we try to visualize periodicity in the data for all households. For example, if we plot the energy consumption across each hour of the day, the structure of the data suggests that there are several data points for each hour of the day for just one household. Hence, for several households, there will be a blob of points for each hour of the day.

smart_meter10 %>% select(customer_id,
                         reading_datetime,
                         general_supply_kwh,
                         everything()
)
#> # A tsibble: 259,235 x 3 [30m] <UTC>
#> # Key:       customer_id [10]
#>    customer_id reading_datetime    general_supply_kwh
#>    <chr>       <dttm>                           <dbl>
#>  1 10006704    2012-06-01 10:00:00              0.24 
#>  2 10006704    2012-06-01 10:30:00              0.245
#>  3 10006704    2012-06-01 11:00:00              0.206
#>  4 10006704    2012-06-01 11:30:00              0.217
#>  5 10006704    2012-06-01 12:00:00              0.214
#>  6 10006704    2012-06-01 12:30:00              0.23 
#>  7 10006704    2012-06-01 13:00:00              0.898
#>  8 10006704    2012-06-01 13:30:00              0.197
#>  9 10006704    2012-06-01 14:00:00              0.255
#> 10 10006704    2012-06-01 14:30:00              0.199
#> # … with 259,225 more rows

So the question is how to explore systematically multiple perspectives of this smart meter data across different time granularities to find regular patterns or anomalies in behaviors? One way can be by visualizing the probability distributions across these granularities.

Let us see how we can explore the energy behavior of a couple of households by making allowances for the following aspects of time granularities: Computation, interaction and visualization.

Computation

Get set of possible granularities with search_gran

Typically, the first thing we should have at our disposal for examining periodicities of energy behavior across time granularities is to know the number of time granularities we can look at exhaustively. If we consider conventional time deconstructions for a Gregorian calendar (second, minute, half-hour, hour, day, week, fortnight, month, quarter, semester, year), the following time granularities can be considered for this analysis.

The interval of this tsibble is 30 minutes, and hence the default for search_gran in this case, provides temporal granularities ranging from half-hour to year. If these options are considered too many, the default options can be modified to limit the possibilities. For example, the most coarse temporal unit can be set to be a “month”.

This looks better. However, some intermediate temporal units might not be pertinent to the analysis and we might want to remove them from the list of granularities that we want to examine using the argument filter_out in search_gran.

Create any temporal granularity with create_gran()

After we have the set of granularities to look at, we should be able to compute any granularity using create_gran. These can be be used for exploring the distribution of the time series across univariate temporal granularity or computing summary statistics across these categorizations.

The data is filtered for two customers and a new granularity “day_fortnight” (day of the fortnight) is created and visualized to understand if the electricity consumption is different across days of the fortnight. From the letter value plot, we can not clearly observe any pattern across days of the fortnight for each household.

The difference in the distribution of energy consumption for these two households can be observed in the following plot. Let us see if we can derive any other insights than the otherwise obvious observation that in most cases the second household has much less consumption than the first one. If we consider letter value F as a regular behavior and letter values beyond F as not-so-regular behavior, we can conclude that the regular behavior of the first household is more stable than the second household. However, the distribution of tail of the first household is more variable, observed through distinct letter values, implying that their not-so-regular behavior is quite extreme.

Interaction

We can visualize the distribution of energy across univariate time granularities for deriving insights on periodic behavior. From the search list, we found that we can look at six granularities, that amounts to analyzing six graphics. However, what happens if we want to see the distribution of energy across two granularities at a time? This is equivalent to looking at the distribution of energy consumption across one granularity conditional on another one. One way can be to plot one of the granularities on the x-axis and another on the facet. Different perspectives of the data can be derived depending on where the granularities are places.

So, what is the number of pairs of granularities we can look at? It is equivalent to taking 2 granularities from 6, which essentially means we need to examine 30 plots. The good news is, not all time granularities can be plotted together and we do not have to analyze so many plots!

Harmony/clash can be identified to considerably reduce the number of visualizations that can aid exploratory analysis.

Let us now look at all the harmonies that we can examine. Fortunately, we are left with only 13 out of 30 visualizations. We can also play with the argument facet_h, which represents the maximum number of levels allowed for the facet variable, to reduce or increase the number of harmonies.

Visualization

Now, we know the pairs of granularities that we need to look at using harmonies. Next, we want to visualize the distribution of the measured variable general_supply_kwh across these harmonies. This is possible through prob_plot. We have several options for visualizing statistical distributions and each comes with some pros and cons which we need to consider while choosing a good one for our context. Traditional methods of plotting distributions include boxplots and violin plots, while more recent methods include ridge plots, quantile plots or letter-value plots.

gran_advice provides checkpoints before plotting a distribution. It includes recommended distribution plots depending on the levels of the two granularities plotted. Assumptions are made to ensure the display is not too cluttered by the space occupied by various kinds of distribution plots. Moreover, it also provides advice on the number of observations to use for drawing the distribution plot so that one can ensure that they are enough before jumping on to plot a distribution. Furthermore, it advises if the number of observations varies significantly across facets or within facets.

The function gran_obs can be employed to see if the number of observations per category is large enough to draw a distribution plot. In this case, the number of observations is much less on weekends compared to weekdays, but enough to have reliable estimates of distribution.

Through prob_plot we can choose one from the list of recommended plots. Warnings are generated if users try to plot clashes, the number of facet variables is too high or the number of observations used to compute statistical summaries for distribution is not large enough. Currently, we can choose from “boxplot”, “violin”, “ridge”, “lv” or “quantile” to visualize distributions.

We visualize the harmony pair (wknd_wday, hour_day) through a box plot. Boxplot of energy consumption is shown across wknd_wday (facet) and hour-day (x-axis) for the same two households. For the second household, outliers are less prominent implying their regular behavior is more stable. For the first household, energy behavior is not significantly different between weekdays and weekends. For the second household, median energy consumption for the early morning hours is extremely high for weekends compared to weekdays.

Similarly, we can plot the other harmony pairs and decide if they are useful in deriving insights about the energy behavior of the customers.

Future development

Acknowledgements

Thanks to PhD supervisors Prof. Rob J Hyndman, Prof. Dianne Cook and Google Summer of Code 2019 mentor Prof. Antony Unwin for their support and always leading by example. The fine balance of encouraging me to work on my ideas and stepping in to help when I need has made the development of this package a great learning experience for me.

Moreover, I want to thank my cohort at NUMBATS, Monash University, especially Mitchell O’Hara-Wild and Nicholas Spyrison for always lending an ear and sharing their wisdom and experience of developing R packages, with such kindness.