abess
The R package abess
implement a polynomial algorithm in the for best-subset selection problem: \[\min_{\beta \in \mathbb{R}^p} \frac{1}{2n} \| y - X\beta\|_2^2, \text{ subject to } \|\beta\|_0 \leq s,\] where \(\| \cdot \|_2\) is the \(\ell_2\) norm, \(\|\beta\|_0=\sum_{i=1}^pI( \beta_i\neq 0)\) is the \(\ell_0\) norm of \(\beta\), and the sparsity level \(s\) is usually an unknown non-negative integer. Next, we present an example to show how to use the abess package to solve a simple problem.
We generate a design matrix \(X\) containing 300 observation and each observation has 1000 predictors. The response variable \(y\) is linearly related to the first, second, and fifth predictors in \(X\): \[y = 3X_1 + 1.5X_2 + 2X_5 + \epsilon,\] where \(\epsilon\) is a standard normal random variable.
library(abess)
synthetic_data <- generate.data(n = 300, p = 1000,
beta = c(3, 1.5, 0, 0, 2, rep(0, 995)))
dim(synthetic_data[["x"]])
## [1] 300 1000
head(synthetic_data[["y"]])
## [,1]
## [1,] -4.063922
## [2,] 3.855246
## [3,] -3.041391
## [4,] -1.081257
## [5,] 4.986772
## [6,] 4.470901
dat <- cbind.data.frame("y" = synthetic_data[["y"]],
synthetic_data[["x"]])
Then, we use the main function abess
in the package to fit this dataset. By setting the arguments support.size = s
, abess
function conducts Algorithm 1 in the for best-subset selection with a sparsity level s
. In our example, we set the options: support.size = 3
, and we run Algorithm 1 with the following command:
abess_fit <- abess(y ~ ., data = dat, support.size = 3)
The output of abess
comprises the selected best model:
head(coef(abess_fit, sparse = FALSE))
## 3
## (intercept) -0.01802179
## x1 2.96418205
## x2 1.45090693
## x3 0.00000000
## x4 0.00000000
## x5 1.90592036
The best model’s support set is identical to the ground truth, and the coefficient estimation is the same as the oracle estimator given by lm
function:
lm(y ~ ., data = dat[, c(1, c(1, 2, 5) + 1)])
##
## Call:
## lm(formula = y ~ ., data = dat[, c(1, c(1, 2, 5) + 1)])
##
## Coefficients:
## (Intercept) x1 x2 x5
## -0.01802 2.96418 1.45091 1.90592
Imaging we are unknown about the true sparsity level in real world data, and thus, we need to determine the most proper one. The Algorithm 3 in the is designed for this scenario. abess
is capable of performing this algorithm:
abess_fit <- abess(y ~ ., data = dat)
The output of abess
also comprises the selected best model:
best_size <- abess_fit[["best.size"]]
print(best_size)
## [1] 3
head(coef(abess_fit, support.size = best_size, sparse = FALSE))
## 3
## (intercept) -0.01802179
## x1 2.96418205
## x2 1.45090693
## x3 0.00000000
## x4 0.00000000
## x5 1.90592036
The output model accurately detect the true model size, which implies the Algorithm 3 efficiently find both the optimal sparsity level and true effective predictors.
The consists of 18 variables about crime from the 1995 FBI UCR (e.g., per capita arson crimes and per capita violent crimes), communities information in the U.S. (e.g., the percent of the population considered urban), socio-economic data from the 90s census (e.g., the median family income), and law enforcement data from the 1990 law enforcement management and admin stats survey (e.g., per capita number of police officers). It would be appropriate if any of the crime state in community can be modeled by the basic community information, socio-economic and law enforcement state in community. Here, without the loss of generality, per capita violent crimes is chosen as the response variable, and 102 numerical variables as well as their pairwise interactions is considered as predictors. The pre-processed dataset for statistical modeling has 200 observations and 5253 predictors, and the code for pre-processing are openly shared in .
The pre-processed dataset can be freely downloaded by running:
working_directory <- getwd()
if (file.exists("crime.rda")) {
load("crime.rda")
} else {
crime_data_url <- "https://github.com/abess-team/abess/raw/master/R-package/data-raw/crime.rda"
download.file(crime_data_url, "crime.rda")
load(file.path(working_directory, "crime.rda"))
}
As mentioned before, this dataset comprises 5000+ features, much larger than the number of observations:
dim(crime)
## [1] 500 5254
And thus, it would be better to first perform feature screening, which is also supported by the abess
function. Suppose we are interested in retaining 1000 variables with the largest marginal utility, then we can conduct the command:
abess_fit <- abess(y ~ ., data = crime, screening.num = 1000)
abess_fit
## Call:
## abess.formula(formula = y ~ ., data = crime, screening.num = 1000)
##
## support.size dev GIC
## 1 0 381863.43 6426.409
## 2 1 172846.92 6042.701
## 3 2 155974.89 6003.965
## 4 3 147047.98 5987.116
## 5 4 141513.44 5980.554
## 6 5 135523.59 5971.549
## 7 6 132507.52 5972.916
## 8 7 128890.08 5971.696
## 9 8 123742.00 5963.935
## 10 9 120870.35 5964.815
## 11 10 118266.27 5966.545
## 12 11 116297.21 5970.770
## 13 12 113114.92 5969.517
## 14 13 110998.98 5972.696
## 15 14 109047.24 5976.445
## 16 15 108012.14 5984.296
## 17 16 106262.19 5988.749
## 18 17 105408.98 5997.338
## 19 18 104432.72 6005.306
## 20 19 103561.37 6013.736
## 21 20 103045.98 6023.861
## 22 21 102150.55 6032.117
## 23 22 101433.48 6041.215
## 24 23 100600.07 6049.709
## 25 24 97994.13 6049.207
## 26 25 97298.40 6058.264
## 27 26 96514.02 6066.836
## 28 27 95857.94 6076.046
## 29 28 95100.64 6084.700
## 30 29 94671.51 6095.058
## 31 30 93423.23 6101.042
## 32 31 92217.23 6107.165
## 33 32 91889.75 6118.006
The returned object of abess
includes the features selected by screening. We exhibit six variables of them:
head(abess_fit[["screening.vars"]])
## [1] "pctBlack" "pctWhite" "medIncome" "pctWdiv" "pctPubAsst"
## [6] "medFamIncome"
Then, by the generic extract
function, we can obtain the best model detected by ABESS
algorithm, and get the variables in the best model:
best_model <- extract(abess_fit)
str(best_model)
## List of 7
## $ beta :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. ..@ i : int [1:8] 303 331 368 442 1747 3014 3267 3950
## .. ..@ p : int [1:2] 0 8
## .. ..@ Dim : int [1:2] 5253 1
## .. ..@ Dimnames:List of 2
## .. .. ..$ : chr [1:5253] "pop" "perHoush" "pctBlack" "pctWhite" ...
## .. .. ..$ : chr "8"
## .. ..@ x : num [1:8] 0.307252 -0.457335 0.537162 -0.07251 0.000439 ...
## .. ..@ factors : list()
## $ intercept : num 530
## $ support.size: int 8
## $ support.vars: chr [1:8] "pctBlack:pctWhite" "pctBlack:pctCollGrad" "pctBlack:pctPopDenseHous" "pctWhite:pctKids2Par" ...
## $ support.beta: num [1:8] 0.307252 -0.457335 0.537162 -0.07251 0.000439 ...
## $ dev : num 123742
## $ tune.value : num 5964
best_vars <- best_model[["support.vars"]]
best_vars
## [1] "pctBlack:pctWhite"
## [2] "pctBlack:pctCollGrad"
## [3] "pctBlack:pctPopDenseHous"
## [4] "pctWhite:pctKids2Par"
## [5] "pctPubAsst:ownHousQrange"
## [6] "pctEmployMfg:pctVacantBoarded"
## [7] "pctMaleDivorc:pctSmallHousUnits"
## [8] "pctKidsBornNevrMarr:pctVacantBoarded"