Integer variables.

2019-04-22

The information_gain and discretize functions are the two most important functions in the FSelectorRcpp package. Usually, there are easy to use, and the user should not worry about anything. However, there’s one case where the user intervention may be required.

By default information_gain and discretize discretizes numeric and integer columns, and leaves other ones as is. In the example below the x column is discretized:

library(FSelectorRcpp)
data <- data.frame(
y = factor(c("A", "A", "A", "B", "B", "A")),
x = c(1.2, 2.1, 4.1, 7.3, 8.2, 3.2),
z = c("x", "x", "y", "y", "y", "x"),
stringsAsFactors = FALSE)
discretize(data, y ~ .)
#>   y          x z
#> 1 A (-Inf,5.7] x
#> 2 A (-Inf,5.7] x
#> 3 A (-Inf,5.7] y
#> 4 B (5.7, Inf] y
#> 5 B (5.7, Inf] y
#> 6 A (-Inf,5.7] x

So far, so good. However, there is a problem when the column is of a type integer. Because integers might be used to encode levels of a factor, e.g., a number of a day in a week (0-6, 1-7), id from some mapping (e.g., 1 - Poland, 2 - USA, 3 - Germany), and probably many other (gender and so on), possibly any variable that is going to be one-hot encoded in the final model. In such cases, discretizing those values don’t make any sense, because they’re already discrete. On the other hand, some variables might be encoded as integers, but they could be discretized. E.g., age, income, height, weight, and so on.

It would be tough for FSelectorRcpp to guess if the integer encoded variable should be discretized because it depends heavily on the context. So we (as the authors of the FSelectorRcpp) decided that by default the integers columns will be treated like numerics, and they will be discretized. However, the user can control this behavior by using discIntegers parameter.

See the example below to get some idea about the behavior of the discretize function when integers columns are present:

library(FSelectorRcpp)
data <- data.frame(
y = factor(c("A", "A", "A", "B", "B", "A")),
x = c(1.2, 2.1, 4.1, 7.3, 8.2, 3.2),
z = c("x", "x", "y", "y", "y", "x"),
int = as.integer(c(1, 1, 1, 2, 2, 1)),
uniqueInt = as.integer(c(10, 20, 11, 22, 25, 11)),
stringsAsFactors = FALSE)

# default (integers are discretized)
discretize(data, y ~ .)
#>   y          x z        int uniqueInt
#> 1 A (-Inf,5.7] x (-Inf,1.5] (-Inf,21]
#> 2 A (-Inf,5.7] x (-Inf,1.5] (-Inf,21]
#> 3 A (-Inf,5.7] y (-Inf,1.5] (-Inf,21]
#> 4 B (5.7, Inf] y (1.5, Inf] (21, Inf]
#> 5 B (5.7, Inf] y (1.5, Inf] (21, Inf]
#> 6 A (-Inf,5.7] x (-Inf,1.5] (-Inf,21]

# discIntegers is set to FALSE - integers are left as is.
discretize(data, y ~ ., discIntegers = FALSE)
#>   y          x z int uniqueInt
#> 1 A (-Inf,5.7] x   1        10
#> 2 A (-Inf,5.7] x   1        20
#> 3 A (-Inf,5.7] y   1        11
#> 4 B (5.7, Inf] y   2        22
#> 5 B (5.7, Inf] y   2        25
#> 6 A (-Inf,5.7] x   1        11

There might be a case that you have both types of integer columns in your data, and you can’t directly use discIntegers. In this case, you need to manually convert to numeric the columns which should be discretized.

The code below shows a simple approach to convert the columns to numeric if they contain a lot of distinct values:

can_discretize <- function(x, treshold = 0.9) {

is.integer(x) &&
(length(unique(x)) / length(x) > treshold)
}

can_discretize(1:10)
#> [1] TRUE
can_discretize(rnorm(10))
#> [1] FALSE
can_discretize(as.integer(c(rep(1,10), rep(2, 10))))
#> [1] FALSE

library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#>     filter, lag
#> The following objects are masked from 'package:base':
#>
#>     intersect, setdiff, setequal, union

set.seed(123)
dt <- data_frame(
x = 1:20,
y = rnorm(20),
z = as.integer(c(rep(1,10), rep(2, 10)))
)

glimpse(dt)
#> Observations: 20
#> Variables: 3
#> $x <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1... #>$ y <dbl> -0.56047565, -0.23017749, 1.55870831, 0.07050839, 0.12928774...
#> $z <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2 glimpse(dt %>% mutate_if(can_discretize, as.numeric)) #> Observations: 20 #> Variables: 3 #>$ x <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1...
#> $y <dbl> -0.56047565, -0.23017749, 1.55870831, 0.07050839, 0.12928774... #>$ z <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2