## Briey introduction

The R package abess implement a polynomial algorithm in the for best-subset selection problem: $\min_{\beta \in \mathbb{R}^p} \frac{1}{2n} \| y - X\beta\|_2^2, \text{ subject to } \|\beta\|_0 \leq s,$ where $$\| \cdot \|_2$$ is the $$\ell_2$$ norm, $$\|\beta\|_0=\sum_{i=1}^pI( \beta_i\neq 0)$$ is the $$\ell_0$$ norm of $$\beta$$, and the sparsity level $$s$$ is usually an unknown non-negative integer. Next, we present an example to show how to use the abess package to solve a simple problem.

## Quick example

### Fixed support size best subset selection

We generate a design matrix $$X$$ containing 300 observation and each observation has 1000 predictors. The response variable $$y$$ is linearly related to the first, second, and fifth predictors in $$X$$: $y = 3X_1 + 1.5X_2 + 2X_5 + \epsilon,$ where $$\epsilon$$ is a standard normal random variable.

library(abess)
synthetic_data <- generate.data(n = 300, p = 1000,
beta = c(3, 1.5, 0, 0, 2, rep(0, 995)))
dim(synthetic_data[["x"]])
##   300 1000
head(synthetic_data[["y"]])
##           [,1]
## [1,] -4.063922
## [2,]  3.855246
## [3,] -3.041391
## [4,] -1.081257
## [5,]  4.986772
## [6,]  4.470901
dat <- cbind.data.frame("y" = synthetic_data[["y"]],
synthetic_data[["x"]])

Then, we use the main function abess in the package to fit this dataset. By setting the arguments support.size = s, abess function conducts Algorithm 1 in the for best-subset selection with a sparsity level s. In our example, we set the options: support.size = 3, and we run Algorithm 1 with the following command:

abess_fit <- abess(y ~ ., data = dat, support.size = 3)

The output of abess comprises the selected best model:

head(coef(abess_fit, sparse = FALSE))
##                       3
## (intercept) -0.01802179
## x1           2.96418205
## x2           1.45090693
## x3           0.00000000
## x4           0.00000000
## x5           1.90592036

The best model’s support set is identical to the ground truth, and the coefficient estimation is the same as the oracle estimator given by lm function:

lm(y ~ ., data = dat[, c(1, c(1, 2, 5) + 1)])
##
## Call:
## lm(formula = y ~ ., data = dat[, c(1, c(1, 2, 5) + 1)])
##
## Coefficients:
## (Intercept)           x1           x2           x5
##    -0.01802      2.96418      1.45091      1.90592

Imaging we are unknown about the true sparsity level in real world data, and thus, we need to determine the most proper one. The Algorithm 3 in the is designed for this scenario. abess is capable of performing this algorithm:

abess_fit <- abess(y ~ ., data = dat)

The output of abess also comprises the selected best model:

best_size <- abess_fit[["best.size"]]
print(best_size)
##  3
head(coef(abess_fit, support.size = best_size, sparse = FALSE))
##                       3
## (intercept) -0.01802179
## x1           2.96418205
## x2           1.45090693
## x3           0.00000000
## x4           0.00000000
## x5           1.90592036

The output model accurately detect the true model size, which implies the Algorithm 3 efficiently find both the optimal sparsity level and true effective predictors.

### Feature screening for ultra-high dimensional dataset

The consists of 18 variables about crime from the 1995 FBI UCR (e.g., per capita arson crimes and per capita violent crimes), communities information in the U.S. (e.g., the percent of the population considered urban), socio-economic data from the 90s census (e.g., the median family income), and law enforcement data from the 1990 law enforcement management and admin stats survey (e.g., per capita number of police officers). It would be appropriate if any of the crime state in community can be modeled by the basic community information, socio-economic and law enforcement state in community. Here, without the loss of generality, per capita violent crimes is chosen as the response variable, and 102 numerical variables as well as their pairwise interactions is considered as predictors. The pre-processed dataset for statistical modeling has 200 observations and 5253 predictors, and the code for pre-processing are openly shared in .

working_directory <- getwd()
if (file.exists("crime.rda")) {
} else {
crime_data_url <- "https://github.com/abess-team/abess/raw/master/R-package/data-raw/crime.rda"
}

As mentioned before, this dataset comprises 5000+ features, much larger than the number of observations:

dim(crime)
##   500 5254

And thus, it would be better to first perform feature screening, which is also supported by the abess function. Suppose we are interested in retaining 1000 variables with the largest marginal utility, then we can conduct the command:

abess_fit <- abess(y ~ ., data = crime, screening.num = 1000)
abess_fit
## Call:
## abess.formula(formula = y ~ ., data = crime, screening.num = 1000)
##
##    support.size       dev      GIC
## 1             0 381863.43 6426.409
## 2             1 172846.92 6042.701
## 3             2 155974.89 6003.965
## 4             3 147047.98 5987.116
## 5             4 141513.44 5980.554
## 6             5 135523.59 5971.549
## 7             6 132507.52 5972.916
## 8             7 128890.08 5971.696
## 9             8 123742.00 5963.935
## 10            9 120870.35 5964.815
## 11           10 118266.27 5966.545
## 12           11 116297.21 5970.770
## 13           12 113114.92 5969.517
## 14           13 110998.98 5972.696
## 15           14 109047.24 5976.445
## 16           15 108012.14 5984.296
## 17           16 106262.19 5988.749
## 18           17 105408.98 5997.338
## 19           18 104432.72 6005.306
## 20           19 103561.37 6013.736
## 21           20 103045.98 6023.861
## 22           21 102150.55 6032.117
## 23           22 101433.48 6041.215
## 24           23 100600.07 6049.709
## 25           24  97994.13 6049.207
## 26           25  97298.40 6058.264
## 27           26  96514.02 6066.836
## 28           27  95857.94 6076.046
## 29           28  95100.64 6084.700
## 30           29  94671.51 6095.058
## 31           30  93423.23 6101.042
## 32           31  92217.23 6107.165
## 33           32  91889.75 6118.006

The returned object of abess includes the features selected by screening. We exhibit six variables of them:

head(abess_fit[["screening.vars"]])
##  "pctBlack"     "pctWhite"     "medIncome"    "pctWdiv"      "pctPubAsst"
##  "medFamIncome"

Then, by the generic extract function, we can obtain the best model detected by ABESS algorithm, and get the variables in the best model:

best_model <- extract(abess_fit)
str(best_model)
## List of 7
##  $beta :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots ## .. ..@ i : int [1:8] 303 331 368 442 1747 3014 3267 3950 ## .. ..@ p : int [1:2] 0 8 ## .. ..@ Dim : int [1:2] 5253 1 ## .. ..@ Dimnames:List of 2 ## .. .. ..$ : chr [1:5253] "pop" "perHoush" "pctBlack" "pctWhite" ...
##   .. .. ..$: chr "8" ## .. ..@ x : num [1:8] 0.307252 -0.457335 0.537162 -0.07251 0.000439 ... ## .. ..@ factors : list() ##$ intercept   : num 530
##  $support.size: int 8 ##$ support.vars: chr [1:8] "pctBlack:pctWhite" "pctBlack:pctCollGrad" "pctBlack:pctPopDenseHous" "pctWhite:pctKids2Par" ...
##  $support.beta: num [1:8] 0.307252 -0.457335 0.537162 -0.07251 0.000439 ... ##$ dev         : num 123742
##  \$ tune.value  : num 5964
best_vars <- best_model[["support.vars"]]
best_vars
##  "pctBlack:pctWhite"
##  "pctKidsBornNevrMarr:pctVacantBoarded"