Binning is the process of transforming numerical or continuous data into categorical data. It is a common data pre-processing step of the model building process.
rbin has the following features:
For manual binning, you need to specify the cut points for the bins. rbin
follows the left closed and right open interval ([0,1) = {x | 0 ≤ x < 1}
) for creating bins. The number of cut points you specify is one less than the number of bins you want to create i.e. if you want to create 10 bins, you need to specify only 9 cut points as shown in the below example. The accompanying RStudio addin, rbinAddin()
can be used to iteratively bin the data and to enforce monotonic increasing/decreasing trend.
After finalizing the bins, you can use rbin_create()
to create the dummy variables.
bins <- rbin_manual(mbank, y, age, c(29, 31, 34, 36, 39, 42, 46, 51, 56))
bins
#> Binning Summary
#> ---------------------------
#> Method Manual
#> Response y
#> Predictor age
#> Bins 10
#> Count 4521
#> Goods 517
#> Bads 4004
#> Entropy 0.5
#> Information Value 0.12
#>
#>
#> # A tibble: 10 x 7
#> cut_point bin_count good bad woe iv entropy
#> <chr> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 < 29 410 71 339 -0.484 0.0255 0.665
#> 2 < 31 313 41 272 -0.155 0.00176 0.560
#> 3 < 34 567 55 512 0.184 0.00395 0.459
#> 4 < 36 396 45 351 0.00712 0.00000443 0.511
#> 5 < 39 519 47 472 0.260 0.00701 0.438
#> 6 < 42 431 33 398 0.443 0.0158 0.390
#> 7 < 46 449 47 402 0.0993 0.000942 0.484
#> 8 < 51 521 40 481 0.440 0.0188 0.391
#> 9 < 56 445 49 396 0.0426 0.000176 0.500
#> 10 >= 56 470 89 381 -0.593 0.0456 0.700
bins <- rbin_manual(mbank, y, age, c(29, 31, 34, 36, 39, 42, 46, 51, 56))
rbin_create(mbank, age, bins)
#> # A tibble: 4,521 x 26
#> age job marital education default balance housing loan contact
#> <int> <fct> <fct> <fct> <fct> <dbl> <fct> <fct> <fct>
#> 1 34 tech~ married tertiary no 297 yes no cellul~
#> 2 49 serv~ married secondary no 180 yes yes unknown
#> 3 38 admi~ single secondary no 262 no no cellul~
#> 4 47 serv~ married secondary no 367 yes no cellul~
#> 5 51 self~ single secondary no 1640 yes no unknown
#> 6 40 unem~ married secondary no 3382 yes no unknown
#> 7 58 reti~ married secondary no 1227 no no cellul~
#> 8 32 unem~ married primary no 309 yes no teleph~
#> 9 46 blue~ married secondary no 922 yes no teleph~
#> 10 32 serv~ married tertiary no 0 no no cellul~
#> # ... with 4,511 more rows, and 17 more variables: day <int>, month <fct>,
#> # duration <int>, campaign <int>, pdays <int>, previous <int>,
#> # poutcome <fct>, y <fct>, `age_<_31` <dbl>, `age_<_34` <dbl>,
#> # `age_<_36` <dbl>, `age_<_39` <dbl>, `age_<_42` <dbl>,
#> # `age_<_46` <dbl>, `age_<_51` <dbl>, `age_<_56` <dbl>,
#> # `age_>=_56` <dbl>
You can collapse or combine levels of a factor/categorical variable using rbin_factor_combine()
and then use rbin_factor()
to look at weight of evidence, entropy and information value. After finalizing the bins, you can use rbin_factor_create()
to create the dummy variables. You can use the RStudio addin, rbinFactorAddin()
to interactively combine the levels and create dummy variables after finalizing the bins.
upper <- c("secondary", "tertiary")
out <- rbin_factor_combine(mbank, education, upper, "upper")
table(out$education)
#>
#> primary unknown upper
#> 691 179 3651
out <- rbin_factor_combine(mbank, education, c("secondary", "tertiary"), "upper")
table(out$education)
#>
#> primary unknown upper
#> 691 179 3651
bins <- rbin_factor(mbank, y, education)
bins
#> Binning Summary
#> ---------------------------
#> Method Custom
#> Response y
#> Predictor education
#> Levels 4
#> Count 4521
#> Goods 517
#> Bads 4004
#> Entropy 0.51
#> Information Value 0.05
#>
#>
#> # A tibble: 4 x 7
#> level bin_count good bad woe iv entropy
#> <fct> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 tertiary 1299 195 1104 -0.313 0.0318 0.610
#> 2 secondary 2352 231 2121 0.170 0.0141 0.463
#> 3 unknown 179 25 154 -0.229 0.00227 0.583
#> 4 primary 691 66 625 0.201 0.00572 0.455
upper <- c("secondary", "tertiary")
out <- rbin_factor_combine(mbank, education, upper, "upper")
rbin_factor_create(out, education)
#> # A tibble: 4,521 x 19
#> age job marital default balance housing loan contact day month
#> <int> <fct> <fct> <fct> <dbl> <fct> <fct> <fct> <int> <fct>
#> 1 34 tech~ married no 297 yes no cellul~ 29 jan
#> 2 49 serv~ married no 180 yes yes unknown 2 jun
#> 3 38 admi~ single no 262 no no cellul~ 3 feb
#> 4 47 serv~ married no 367 yes no cellul~ 12 may
#> 5 51 self~ single no 1640 yes no unknown 15 may
#> 6 40 unem~ married no 3382 yes no unknown 14 may
#> 7 58 reti~ married no 1227 no no cellul~ 14 aug
#> 8 32 unem~ married no 309 yes no teleph~ 13 may
#> 9 46 blue~ married no 922 yes no teleph~ 18 nov
#> 10 32 serv~ married no 0 no no cellul~ 21 nov
#> # ... with 4,511 more rows, and 9 more variables: duration <int>,
#> # campaign <int>, pdays <int>, previous <int>, poutcome <fct>, y <fct>,
#> # education <fct>, education_unknown <dbl>, education_upper <dbl>
Quantile binning aims to bin the data into roughly equal groups using quantiles.
bins <- rbin_quantiles(mbank, y, age, 10)
bins
#> Binning Summary
#> -----------------------------
#> Method Quantile
#> Response y
#> Predictor age
#> Bins 10
#> Count 4521
#> Goods 517
#> Bads 4004
#> Entropy 0.5
#> Information Value 0.12
#>
#>
#> # A tibble: 10 x 7
#> cut_point bin_count good bad woe iv entropy
#> <chr> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 < 29 410 71 339 -0.484 0.0255 0.665
#> 2 < 31 313 41 272 -0.155 0.00176 0.560
#> 3 < 34 567 55 512 0.184 0.00395 0.459
#> 4 < 36 396 45 351 0.00712 0.00000443 0.511
#> 5 < 39 519 47 472 0.260 0.00701 0.438
#> 6 < 42 431 33 398 0.443 0.0158 0.390
#> 7 < 46 449 47 402 0.0993 0.000942 0.484
#> 8 < 51 521 40 481 0.440 0.0188 0.391
#> 9 < 56 445 49 396 0.0426 0.000176 0.500
#> 10 >= 56 470 89 381 -0.593 0.0456 0.700
Equal length binning creates bins of equal widths. It is different from equal frequency binning which creates bins of equal size.
bins <- rbin_equal_length(mbank, y, age, 10)
bins
#> Binning Summary
#> ---------------------------------
#> Method Equal Length
#> Response y
#> Predictor age
#> Bins 10
#> Count 4521
#> Goods 517
#> Bads 4004
#> Entropy 0.5
#> Information Value 0.17
#>
#>
#> # A tibble: 10 x 7
#> cut_point bin_count good bad woe iv entropy
#> <chr> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 < 24.6 85 24 61 -1.11 0.0347 0.859
#> 2 < 31.2 822 106 716 -0.137 0.00358 0.555
#> 3 < 37.8 1133 115 1018 0.134 0.00425 0.474
#> 4 < 44.4 943 82 861 0.304 0.0172 0.426
#> 5 < 51 623 52 571 0.349 0.0147 0.414
#> 6 < 57.6 612 66 546 0.0660 0.000574 0.493
#> 7 < 64.2 229 43 186 -0.582 0.0214 0.697
#> 8 < 70.8 34 12 22 -1.44 0.0255 0.937
#> 9 < 77.4 25 13 12 -2.13 0.0471 0.999
#> 10 >= 77.4 15 4 11 -1.04 0.00517 0.837
Winsorized binning is similar to equal length binning except that both tails are cut off to obtain a smooth binning result. This technique is often used to remove outliers during the data pre-processing stage. For Winsorized binning, the Winsorized statistics are computed first. After the minimum and maximum have been found, the split points are calculated the same way as in equal length binning.
bins <- rbin_winsorize(mbank, y, age, 10, winsor_rate = 0.05)
bins
#> Binning Summary
#> ------------------------------
#> Method Winsorize
#> Response y
#> Predictor age
#> Bins 10
#> Count 4521
#> Goods 517
#> Bads 4004
#> Entropy 0.51
#> Information Value 0.1
#>
#>
#> # A tibble: 10 x 7
#> cut_point bin_count good bad woe iv entropy
#> <chr> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 < 30.2 723 112 611 -0.350 0.0224 0.622
#> 2 < 33.4 567 55 512 0.184 0.00395 0.459
#> 3 < 36.6 573 58 515 0.137 0.00225 0.473
#> 4 < 39.8 497 44 453 0.285 0.00798 0.432
#> 5 < 43 396 37 359 0.225 0.00408 0.448
#> 6 < 46.2 461 43 418 0.227 0.00482 0.447
#> 7 < 49.4 281 22 259 0.419 0.00927 0.396
#> 8 < 52.6 309 32 277 0.111 0.000811 0.480
#> 9 < 55.8 244 25 219 0.123 0.000781 0.477
#> 10 >= 55.8 470 89 381 -0.593 0.0456 0.700