Introduction to the Package tree.bins

Piro Polo

2018-06-13

Overview

When conducting data analysis or using machine learning algorithms, you may encounter variables with several levels. In these scenarios, decision trees can be used to decide how to best collapse these categorical variables into more manageable factors. I created the package ‘tree.bins’ to provide users the ability to recategorize categorical variables, dependent on a response variable, by iteratively creating a decision tree for each of the categorical variables (class factor) and the selected response variable. The decision tree is created from the rpart() function from the ‘rpart’ package. The rules from the leaves of the decision tree are extracted, and used to recategorize (bin) the appropriate categorical variable (predictor). This step is performed for each of the categorical variables that is passed onto the data component of the function. Only variables containing more than two factor levels will be considered in the function. The final output generates a data set containing the recategorized variables and/or a list containing a mapping table for each of the candidate variables. For more details see Dr. Yan-yan Song’s article (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4466856/) or T. Hastie et al (2009, ISBN: 978-0-387-84857-0). For detailed examples and functionality see vignettes.

Introduction

When working with large data sets, there may be a need to recategorize candidate variables by some criterion. The ‘tree.bins’ package allows users to recategorize these variables through a decision tree method derived from the rpart() function of the ‘rpart’ package. The ‘tree.bins’ package is especially useful if the data set contains several factor class variables with an abnormal amount of levels. The intended purpose of the package is to recategorize predictors in order to reduce the number of dummy variables created when applying a statistical method to model a response. This can result in more parsimonious and/or accurate modeling results. The first half of this document illustrates data analysis procedures to identify a typical problem that contains a variable with several levels, and the latter half covers ‘tree.bins’ functionality and usage.

Pre-Categorization: Typical Variable for Consideration

This section illustrates a typical variable that could be considered for recategorization.

Visualization of Candidate Variable

Using a subset of the Ames data set, the below chunk illustrates the average home sale price of each Neighborhood.

AmesSubset %>% 
  select(SalePrice, Neighborhood) %>% 
  group_by(Neighborhood) %>% 
  summarise(AvgPrice = mean(SalePrice)/1000) %>% 
  ggplot(aes(x = reorder(Neighborhood, -AvgPrice), y = AvgPrice)) +
  geom_bar(stat = "identity", fill = "#389135") + 
  labs(x = "Neighborhoods", y = "Avg Price (in thousands)", 
       title = paste0("Average Home Prices of Neighborhoods")) +
  theme_economist() +
  theme(legend.position = "none", axis.text.x = element_text(angle = 90, hjust = 1, size = 8),
        axis.title.x = element_text(size = 12), axis.text.y = element_text(size = 9),
        axis.title.y = element_text(size = 12))

Notice that many neighborhoods observe the same average sale price. This indicates that we could combine and recategorize the Neighborhoods variable into fewer levels.

Statistical Method Implementation of Candidate Variable

The following illustrates the results of using a statistical learning method without using the tree.bins() function – linear regression for this example – with the Neighborhoods categorical variable.

fit <- lm(formula = SalePrice ~ Neighborhood, data = AmesSubset)
summary(fit)
#> 
#> Call:
#> lm(formula = SalePrice ~ Neighborhood, data = AmesSubset)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -163138  -27138   -4526   20405  433829 
#> 
#> Coefficients:
#>                     Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)         189845.3    11430.8  16.608  < 2e-16 ***
#> NeighborhoodBlueste -56359.6    22861.6  -2.465 0.013774 *  
#> NeighborhoodBrDale  -85937.8    16366.4  -5.251 1.67e-07 ***
#> NeighborhoodBrkSide -63894.3    12860.7  -4.968 7.33e-07 ***
#> NeighborhoodClearCr  14812.4    14903.9   0.994 0.320410    
#> NeighborhoodCollgCr  12799.6    12093.5   1.058 0.290009    
#> NeighborhoodCrawfor  15204.5    12971.2   1.172 0.241264    
#> NeighborhoodEdwards -57332.8    12269.8  -4.673 3.17e-06 ***
#> NeighborhoodGilbert   -403.9    12398.5  -0.033 0.974018    
#> NeighborhoodGreens    8454.7    26066.3   0.324 0.745703    
#> NeighborhoodGrnHill  90154.7    38763.8   2.326 0.020131 *  
#> NeighborhoodIDOTRR  -86378.8    13199.2  -6.544 7.56e-11 ***
#> NeighborhoodLandmrk -52845.3    53615.2  -0.986 0.324428    
#> NeighborhoodMeadowV -97588.5    15121.5  -6.454 1.36e-10 ***
#> NeighborhoodMitchel -27485.2    12878.0  -2.134 0.032940 *  
#> NeighborhoodNAmes   -45419.1    11794.3  -3.851 0.000121 ***
#> NeighborhoodNoRidge 131325.3    13581.8   9.669  < 2e-16 ***
#> NeighborhoodNPkVill -49223.4    17382.7  -2.832 0.004675 ** 
#> NeighborhoodNridgHt 127292.8    12422.5  10.247  < 2e-16 ***
#> NeighborhoodNWAmes    2118.9    12631.2   0.168 0.866794    
#> NeighborhoodOldTown -64336.5    12166.8  -5.288 1.37e-07 ***
#> NeighborhoodSawyer  -53419.4    12531.9  -4.263 2.11e-05 ***
#> NeighborhoodSawyerW -11867.9    12913.9  -0.919 0.358202    
#> NeighborhoodSomerst  42257.6    12326.2   3.428 0.000620 ***
#> NeighborhoodStoneBr 115745.8    14538.5   7.961 2.81e-15 ***
#> NeighborhoodSWISU   -53087.1    14311.7  -3.709 0.000213 ***
#> NeighborhoodTimber   66699.6    13581.8   4.911 9.79e-07 ***
#> NeighborhoodVeenker  64328.2    17090.1   3.764 0.000172 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 52380 on 2021 degrees of freedom
#> Multiple R-squared:  0.5627, Adjusted R-squared:  0.5568 
#> F-statistic:  96.3 on 27 and 2021 DF,  p-value: < 2.2e-16

Notice that there are multiple dummy variables being created to capture the different levels found within the Neighborhoods variable.

Visualizing the Leaves Created by a Decision Tree

The below steps illustrate how rpart() categorizes the different levels of Neighborhoods into separate leaves. These leaves are used to generate the mappings that are extracted and applied within tree.bins() to recategorize the current data.

d.tree = rpart(formula = SalePrice ~ Neighborhood, data = AmesSubset)
rpart.plot::rpart.plot(d.tree)

These 5 categories are what tree.bins() will use to recategorize the variable Neighborhood.

Post-Categorization: Typical Variable for Consideration

This section illustrates the result of using tree.bins() to recategorize a typical variable.

Recategorization of Candidate Variable

Continuing from the above example, we can clearly identify that there are similarities in many of the levels within the Neighborhoods variable in relation to the response. To limit the number of dummy variables that are created in a statistical learning method, we would like to group the categories that display similar associations with the responses into one bin. We could create visualizations to identify these similarities in levels for each variable, but it would remain an extremely tedious task not to mention subjective to the analyst.

A better method would be to use the rules that are generated from a decision tree. This can be accomplished by using the rpart() function in the ‘rpart’ package. However, this task remains tedious, especially when there are numerous factor class variables to be considered. The tree.bins() function allows the user to iteratively recategorize each factor level variable for the specified data set.

sample.df <- AmesSubset %>% select(Neighborhood, MS.Zoning, SalePrice)
binned.df <- tree.bins(data = sample.df, y = SalePrice, bin.nm = "bin#.", return = "new.fctrs")
levels(sample.df$Neighborhood) #current levels of Neighborhood
#>  [1] "Blmngtn" "Blueste" "BrDale"  "BrkSide" "ClearCr" "CollgCr" "Crawfor"
#>  [8] "Edwards" "Gilbert" "Greens"  "GrnHill" "IDOTRR"  "Landmrk" "MeadowV"
#> [15] "Mitchel" "NAmes"   "NoRidge" "NPkVill" "NridgHt" "NWAmes"  "OldTown"
#> [22] "Sawyer"  "SawyerW" "Somerst" "StoneBr" "SWISU"   "Timber"  "Veenker"
unique(binned.df$Neighborhood) #new levels of Neighborhood
#> [1] "bin#.4" "bin#.3" "bin#.5" "bin#.2" "bin#.1"

The control parameter in the tree.bins() function serves the same purpose as the control parameter in the rpart() function. If the user specifies a value for this parameter, that value will be used to prune the tree for each variable passed in to the data parameter. Remember, that a decision tree is being built to refactor each variable into new levels.

binned.df2 <- tree.bins(data = sample.df, y = SalePrice, bin.nm = "bin#.", control = rpart.control(cp = .001), return = "new.fctrs")
unique(binned.df2$Neighborhood) #new levels of Neighborhood
#>  [1] "bin#.7"  "bin#.3"  "bin#.9"  "bin#.4"  "bin#.1"  "bin#.2"  "bin#.8" 
#>  [8] "bin#.5"  "bin#.6"  "bin#.10"

The user can also create a two-dimensinal data.frame() and pass this object into the control parameter. The first column must contain the variable name(s) that are contained in the data.frame() specified in the data parameter. The second column must contain the cp values of the respective variable name(s). Any variable name(s) not included in this user created data.frame() will use the generated cp value within the rpart() function. Lastly, the column names identified for this user created data.frame() are irrelavant, only the elements are important.

cp.df <- data.frame(Variables = c("Neighborhood", "MS.Zoning"), CP = c(.001, .1))
binned.df3 <- tree.bins(data = sample.df, y = SalePrice, bin.nm = "bin#.", control = cp.df, return = "new.fctrs")
unique(binned.df3$Neighborhood) #new levels of Neighborhood
#>  [1] "bin#.7"  "bin#.3"  "bin#.9"  "bin#.4"  "bin#.1"  "bin#.2"  "bin#.8" 
#>  [8] "bin#.5"  "bin#.6"  "bin#.10"
unique(binned.df3$MS.Zoning) #new levels of MS.Zoning
#> [1] "bin#.1" "bin#.2"

The Different Return Options of tree.bins()

Depending on what is the most useful information to the user, tree.bins() can return either the recategorized data.frame or a list comprised of lookup tables. The lookup tables contain the old to new value mappings for each recategorized variable generated by tree.bins().

The “new.fctrs” returns the recategorized data.frame.

head(binned.df)
#>    SalePrice Neighborhood MS.Zoning
#> 1:    105000       bin#.4    bin#.1
#> 2:    244000       bin#.4    bin#.2
#> 3:    189900       bin#.3    bin#.2
#> 4:    195500       bin#.3    bin#.2
#> 5:    191500       bin#.5    bin#.2
#> 6:    236500       bin#.5    bin#.2

The “lkup.list” returns a list of the lookup tables.

lookup.list <- tree.bins(data = sample.df, y = SalePrice, bin.nm = "bin#.", control = rpart.control(cp = .01), return = "lkup.list")
head(lookup.list[[1]])
#>   Neighborhood Categories
#> 1       BrDale     bin#.1
#> 2      BrkSide     bin#.1
#> 3       IDOTRR     bin#.1
#> 4      MeadowV     bin#.1
#> 5      OldTown     bin#.1
#> 6      Somerst     bin#.2

The “both” returns an object containing both the new.fctrs and lkup.list outputs. These can be returned by using the “$” notation.

both <- tree.bins(data = sample.df, y = SalePrice, bin.nm = "bin#.", control = rpart.control(cp = .01), return = "both")
head(both$new.fctrs)
#>    SalePrice Neighborhood MS.Zoning
#> 1:    105000       bin#.4    bin#.1
#> 2:    244000       bin#.4    bin#.2
#> 3:    189900       bin#.3    bin#.2
#> 4:    195500       bin#.3    bin#.2
#> 5:    191500       bin#.5    bin#.2
#> 6:    236500       bin#.5    bin#.2
head(both$lkup.list)
#> [[1]]
#>    Neighborhood Categories
#> 1        BrDale     bin#.1
#> 2       BrkSide     bin#.1
#> 3        IDOTRR     bin#.1
#> 4       MeadowV     bin#.1
#> 5       OldTown     bin#.1
#> 6       Somerst     bin#.2
#> 7        Timber     bin#.2
#> 8       Veenker     bin#.2
#> 9       Blmngtn     bin#.3
#> 10      ClearCr     bin#.3
#> 11      CollgCr     bin#.3
#> 12      Crawfor     bin#.3
#> 13      Gilbert     bin#.3
#> 14       Greens     bin#.3
#> 15       NWAmes     bin#.3
#> 16      SawyerW     bin#.3
#> 17      Blueste     bin#.4
#> 18      Edwards     bin#.4
#> 19      Landmrk     bin#.4
#> 20      Mitchel     bin#.4
#> 21        NAmes     bin#.4
#> 22      NPkVill     bin#.4
#> 23       Sawyer     bin#.4
#> 24        SWISU     bin#.4
#> 25      GrnHill     bin#.5
#> 26      NoRidge     bin#.5
#> 27      NridgHt     bin#.5
#> 28      StoneBr     bin#.5
#> 
#> [[2]]
#>   MS.Zoning Categories
#> 1   A (agr)     bin#.1
#> 2   C (all)     bin#.1
#> 3   I (all)     bin#.1
#> 4        RH     bin#.1
#> 5        RM     bin#.1
#> 6        FV     bin#.2
#> 7        RL     bin#.2

Using the bin.oth() Function

Using tree.bins() the user will be able to recategorize factor class variables of only the data.frame passed into the data parameter. Assuming that similar data will continue to be collected, or perhaps used in testing the performance of the model, a user may want to recategorize this new data.frame by the same lookup tables that were generated from the first data.frame. In this case, being able to bin other data.frames with the same lookup table would be quite useful. The example below takes in a subset of the AmesSubset data and returns a data.frame recategorized by the lookup list generated from the tree.bins() function.

oth.binned.df <- bin.oth(list = lookup.list, data = sample.df)
head(oth.binned.df)
#>    SalePrice Neighborhood MS.Zoning
#> 1:    105000       bin#.4    bin#.1
#> 2:    244000       bin#.4    bin#.2
#> 3:    189900       bin#.3    bin#.2
#> 4:    195500       bin#.3    bin#.2
#> 5:    191500       bin#.5    bin#.2
#> 6:    236500       bin#.5    bin#.2