hermiter
facilitates the estimation of the full probability density function, cumulative distribution function and quantile function using Hermite series based estimators. The package is applicable to streaming, batch and grouped data. The core methods of the package are written in C++ for speed.
These estimators are particularly useful in the sequential setting (both stationary and non-stationary data streams). In addition, they are useful in efficient, one-pass batch estimation which is particularly relevant in the context of large data sets. Finally, the Hermite series based estimators are applicable in decentralized (distributed) settings in that estimators formed on subsets of the data can be consistently combined. The Hermite series based estimators have the distinct advantage of being able to estimate the full density function, distribution function and quantile function in an online manner.The theoretical and empirical properties of these estimators have been studied in-depth in the articles below. The investigations demonstrate that the Hermite series based estimators are particularly effective in distribution function and quantile estimation.
In order to utilize the hermiter package, the package must be loaded using the following command:
Other packages that are used in this vignette are loaded as follows:
A hermite_estimator S3 object is constructed as below. The argument, N, adjusts the number of terms in the Hermite series based estimator and controls the trade-off between bias and variance. A lower N value implies a higher bias but lower variance and vice versa for higher values of N. The argument, standardize, controls whether or not to standardize the batch of observations before applying the estimator. Standardization usually yields better results and is recommended for most estimation settings.
Once the hermite_estimator object has been constructed, it can be updated with a batch of observations as below.
In the sequential setting, observations are revealed one at a time. A hermite_estimator object can be updated sequentially with a single new observation by utilizing the update_sequential method. Note that when updating the Hermite series based estimator sequentially, a running mean and running standard deviation is maintained in order to standardize the observations in an online manner.
The central advantage of Hermite series based estimators is that they can be updated in a sequential/one-pass manner as above and subsequently probability densities and cumulative probabilities at arbitrary x values can be obtained, along with arbitrary quantiles. The hermite_estimator object only maintains a small and fixed number of coefficients and thus uses minimal memory. The syntax to calculate probability densities, cumulative probabilities and quantiles is presented below.
actual_pdf <- dlogis(x)
actual_cdf <- plogis(x)
df_pdf_cdf <- data.frame(x,pdf_est,cdf_est,actual_pdf,actual_cdf)
actual_quantiles <- qlogis(p)
df_quant <- data.frame(p,quantile_est,actual_quantiles)
ggplot(df_pdf_cdf,aes(x=x)) + geom_line(aes(y=pdf_est, colour="Estimated")) +
geom_line(aes(y=actual_pdf, colour="Actual")) +
scale_colour_manual("",
breaks = c("Estimated", "Actual"),
values = c("blue", "black")) + ylab("Probability Density")
ggplot(df_pdf_cdf,aes(x=x)) + geom_line(aes(y=cdf_est, colour="Estimated")) +
geom_line(aes(y=actual_cdf, colour="Actual")) +
scale_colour_manual("",
breaks = c("Estimated", "Actual"),
values = c("blue", "black")) +
ylab("Cumulative Probability")
ggplot(df_quant,aes(x=actual_quantiles)) + geom_point(aes(y=quantile_est),
color="blue") +
geom_abline(slope=1,intercept = 0) +xlab("Theoretical Quantiles") +
ylab("Estimated Quantiles")
A useful application of the hermite_estimator class is to obtain pdf, cdf and quantile function estimates on groups of data as illustrated in the example below. The fact that the memory usage of hermite_estimator objects is small and fixed irrespective of group size and that batch estimation is fast, implies that it is practical to perform estimation on a very large number of groups.
We first present the example using the data.table
package.
# Prepare Test Data
test_data <- data.frame()
for (i in c(1:5)) {
exponential_data <- rexp(n=1000)
logistic_data <- rlogis(n=1000)
logn_data <- rlnorm(n=1000)
test_data <- rbind(test_data,data.frame(dist_name=rep("exp",
length(exponential_data)),idx=i,observations=exponential_data))
test_data <- rbind(test_data,data.frame(dist_name=rep("logis",
length(logistic_data)),idx=i,observations=logistic_data))
test_data <- rbind(test_data,data.frame(dist_name=rep("lnorm",
length(logn_data)),idx=i,observations=logn_data))
}
setDT(test_data)
# Group observations by distribution and idx and create Hermite estimators
estimates <- test_data[,.(herm_est = list(hermite_estimator(N=10,
standardize = TRUE) %>% update_batch(observations))),
by=.(dist_name,idx)]
estimates
#> dist_name idx herm_est
#> 1: exp 1 <list[8]>
#> 2: logis 1 <list[8]>
#> 3: lnorm 1 <list[8]>
#> 4: exp 2 <list[8]>
#> 5: logis 2 <list[8]>
#> 6: lnorm 2 <list[8]>
#> 7: exp 3 <list[8]>
#> 8: logis 3 <list[8]>
#> 9: lnorm 3 <list[8]>
#> 10: exp 4 <list[8]>
#> 11: logis 4 <list[8]>
#> 12: lnorm 4 <list[8]>
#> 13: exp 5 <list[8]>
#> 14: logis 5 <list[8]>
#> 15: lnorm 5 <list[8]>
Hermite series based estimators can also be consistently combined. In particular, when standardize = FALSE, the pdf, cdf and quantile value results obtained from combining distinct hermite_estimators updated on subsets of a data set are exactly equal to those obtained by constructing a single hermite_estimator and updating on the full data set (corresponding to the concatenation of the aforementioned subsets). When standardize = TRUE, the equivalence is no longer exact, but is accurate enough to be practically useful. Combining hermite_estimators is illustrated below.
# Group observations by distribution and combine Hermite estimators
combined_estimates <- estimates[,.(herm_comb = list(combine_hermite(herm_est))),
by=.(dist_name)]
combined_estimates
#> dist_name herm_comb
#> 1: exp <list[8]>
#> 2: logis <list[8]>
#> 3: lnorm <list[8]>
# Estimate probability densities, cumulative probabilities and quantiles
dens_vals <- combined_estimates[,.(dens_est = list(dens(herm_comb[[1]],
c(0.5,1,1.5,2)))),by=.(dist_name)]
cum_prob_vals <- combined_estimates[,.(cum_prob_est = list(cum_prob(herm_comb[[1]],c(0.5,1,1.5,2)))),by=.(dist_name)]
quantile_vals <- combined_estimates[,.(quantile_est = list(quant(herm_comb[[1]],c(0.25,0.5,0.75)))),by=.(dist_name)]
The same illustrative example is presented using the dplyr
package instead of the data.table
package below.
# Prepare Test Data
test_data <- data.frame()
for (i in c(1:5)) {
exponential_data <- rexp(n=1000)
logistic_data <- rlogis(n=1000)
logn_data <- rlnorm(n=1000)
test_data <- rbind(test_data,data.frame(dist_name=rep("exp",
length(exponential_data)),idx=i,observations=exponential_data))
test_data <- rbind(test_data,data.frame(dist_name=rep("logis",
length(logistic_data)),idx=i,observations=logistic_data))
test_data <- rbind(test_data,data.frame(dist_name=rep("lnorm",
length(logn_data)),idx=i,observations=logn_data))
}
# Group observations by distribution and idx and create Hermite estimators
estimates <- test_data %>% group_by(dist_name,idx) %>% summarise(herm_est = list(hermite_estimator(N=10,standardize = TRUE) %>% update_batch(observations)))
#> `summarise()` regrouping output by 'dist_name' (override with `.groups` argument)
estimates
#> # A tibble: 15 x 3
#> # Groups: dist_name [3]
#> dist_name idx herm_est
#> <chr> <int> <list>
#> 1 exp 1 <list [8]>
#> 2 exp 2 <list [8]>
#> 3 exp 3 <list [8]>
#> 4 exp 4 <list [8]>
#> 5 exp 5 <list [8]>
#> 6 lnorm 1 <list [8]>
#> 7 lnorm 2 <list [8]>
#> 8 lnorm 3 <list [8]>
#> 9 lnorm 4 <list [8]>
#> 10 lnorm 5 <list [8]>
#> 11 logis 1 <list [8]>
#> 12 logis 2 <list [8]>
#> 13 logis 3 <list [8]>
#> 14 logis 4 <list [8]>
#> 15 logis 5 <list [8]>
# Group observations by distribution and combine Hermite estimators
combined_estimates <- estimates %>% group_by(dist_name) %>% summarise(herm_comb
= list(combine_hermite(herm_est)))
#> `summarise()` ungrouping output (override with `.groups` argument)
combined_estimates
#> # A tibble: 3 x 2
#> dist_name herm_comb
#> <chr> <list>
#> 1 exp <list [8]>
#> 2 lnorm <list [8]>
#> 3 logis <list [8]>
# Estimate probability densities, cumulative probabilities and quantiles
dens_vals <- combined_estimates %>%
rowwise() %>% mutate(dens_est = list(dens(herm_comb,c(0.5,1,1.5,2))))
cum_prob_vals <- combined_estimates %>%
rowwise() %>% mutate(cum_prob_est = list(cum_prob(herm_comb,c(0.5,1,1.5,2))))
quantile_vals <- combined_estimates %>%
rowwise() %>% mutate(quantile_est = list(quant(herm_comb,c(0.25,0.5,0.75))))
# Compute Mean Absolute Error
dens_vals <- dens_vals %>%
rowwise() %>% mutate(dens_actual = list(do.call(paste0("d",dist_name),
list(c(0.5,1,1.5,2))))) %>% mutate(mean_abs_error_density =
mean(abs(dens_est-dens_actual)))
cum_prob_vals <- cum_prob_vals %>%
rowwise() %>% mutate(cum_prob_actual = list(do.call(paste0("p",dist_name),
list(c(0.5,1,1.5,2)))))%>% mutate(mean_abs_error_cum_prob = mean(abs(cum_prob_est-cum_prob_actual)))
quantile_vals <- quantile_vals %>%
rowwise() %>% mutate(quantile_actual= list(do.call(paste0("q",dist_name),
list(c(0.25,0.5,0.75)))))%>% mutate(mean_abs_error_quantiles = mean(abs(quantile_est-quantile_actual)))
mean_abs_error_summary <- data.frame(dist_name=dens_vals$dist_name, mean_abs_error_density=dens_vals$mean_abs_error_density, mean_abs_error_cum_prob=cum_prob_vals$mean_abs_error_cum_prob,
mean_abs_error_quantiles=quantile_vals$mean_abs_error_quantiles)
Another useful application of the hermite_estimator class is to obtain pdf, cdf and quantile function estimates on streaming data. The speed of estimation allows the pdf, cdf and quantile functions to be estimated in real time. We illustrate this below for cdf and quantile estimation with a sample Shiny application. We reiterate that the particular usefulness is that the full pdf, cdf and quantile functions are updated in real time. Thus, any arbitrary quantile can be evaluated at any point in time. We include a stub for reading streaming data that generates micro-batches of standard exponential i.i.d. random data. This stub can easily be swapped out for a method reading micro-batches from a Kafka topic or similar.
The Shiny sample code below can be pasted into a single app.R file and run directly.
# Not Run. Copy and paste into app.R and run.
library(shiny)
library(hermiter)
library(ggplot2)
library(magrittr)
ui <- fluidPage(
titlePanel("Streaming Statistics Analysis Example: Exponential
i.i.d. stream"),
sidebarLayout(
sidebarPanel(
sliderInput("percentile", "Percentile:",
min = 0.01, max = 0.99,
value = 0.5, step = 0.01)
),
mainPanel(
plotOutput("plot"),
textOutput("quantile_text")
)
)
)
server <- function(input, output) {
values <- reactiveValues(hermite_est =
hermite_estimator(N = 10, standardize = TRUE))
x <- seq(-15, 15, 0.1)
# Note that the stub below could be replaced with code that reads streaming
# data from various sources, Kafka etc.
read_stream_stub_micro_batch <- reactive({
invalidateLater(1000)
new_observation <- rexp(10)
return(new_observation)
})
updated_cdf_calc <- reactive({
micro_batch <- read_stream_stub_micro_batch()
for (idx in seq_along(micro_batch)) {
values[["hermite_est"]] <- isolate(values[["hermite_est"]]) %>%
update_sequential(micro_batch[idx])
}
cdf_est <- isolate(values[["hermite_est"]]) %>%
cum_prob(x, clipped = TRUE)
df_cdf <- data.frame(x, cdf_est)
return(df_cdf)
})
updated_quantile_calc <- reactive({
values[["hermite_est"]] %>% quant(input$percentile)
})
output$plot <- renderPlot({
ggplot(updated_cdf_calc(), aes(x = x)) + geom_line(aes(y = cdf_est)) +
ylab("Cumulative Probability")
}
)
output$quantile_text <- renderText({
return(paste(input$percentile * 100, "th Percentile:",
round(updated_quantile_calc(), 2)))
})
}
shinyApp(ui = ui, server = server)