# vtreat significance

#### 2016-10-24

vtreat::prepare includes a required argument pruneSig that (if not NULL) is used to prune variables. Obviously significance depends on training set size (so is not an intrinsic property of just the variables) and there are issues of bias in the estimate (which vtreat attempts to eliminate by estimating significance of complex sub-model variables on cross-validated or out of sample data). As always there is a question of what to set a significance control to.

Our advice is the following pragmatic:

Use variable filtering on wide datasets (datasets with many columns or variables). Most machine learning algorithms can not defend themselves against large numbers of noise variables (including those algorithms that have cross-validation procedures built in). Examples are given here.

As an upper bound think of setting pruneSig below 1/numberOfColumns. Setting pruneSig to 1/numberOfColumns means that (in expectation) only a constant number of pure noise variables (variables with no actual relation to the outcome we are trying to predict) should create columns. This means (under some assumptions, and in expectation) we expect only a bounded number of noisy columns to be exposed to downstream statistical and machine learning algorithms (which they can presumably handle).

As a lower bound think of what sort of good variables get thrown out at a given setting of pruneSig. For example suppose our problem is categorization in a data set with n/2 positive examples and n/2 negative examples. Consider the observed significance of a rare indicator variable that is on k times in training and is only on for positive instances. A random variable that is on k times would achieve this purity with probability $$2^{-k}$$, so we expect it to have a -log(significance) in the ballpark of k. So a pruneSig of $$2^{-k}$$ will filter all such variables out (be they good or bad). Thus if you want levels or indicators that are on only a z fraction of the time on a training set of size n you want pruneSig >> $$2^{-z*n}$$.

Example:

signk <- function(n,k) {
sigTab <- data.frame(y=c(rep(TRUE,n/2),rep(FALSE,n/2)),v=FALSE)
sigTab[seq_len(k),'v'] <- TRUE
vtreat::designTreatmentsC(sigTab,'v','y',TRUE,verbose=FALSE)$scoreFrame[1,'sig'] } sigTab <- data.frame(k=c(1,2,3,4,5,10,20,50,100)) # If you want to see a rare but perfect indicator of positive class # that's only on k times out of 1000, this is the lower bound on pruneSig sigTab$sigEst = vapply(sigTab$k,function(k) signk(1000,k),numeric(1)) sigTab$minusLogSig = -log(sigTab$sigEst) # we expect this to be approximately k print(sigTab) ## k sigEst minusLogSig ## 1 1 2.388636e-01 1.431863 ## 2 2 9.565153e-02 2.347044 ## 3 3 4.119677e-02 3.189395 ## 4 4 1.836242e-02 3.997449 ## 5 5 8.351092e-03 4.785363 ## 6 10 1.863495e-04 8.587887 ## 7 20 1.131954e-07 15.994150 ## 8 50 2.209988e-17 38.350959 ## 9 100 1.952762e-34 77.618649 For a data set with 100 variables (and 1000 rows), you might want to set pruneSig <= 0.01 to limit the number of pure noise variables that enter the model. Note that this value is smaller than the lower bounds given above for $$k < 5$$. This means that in a data set of this width and length, you may not be able to detect rare but perfect indicators that occur fewer than 5 times. You would have a chance of using such rare indicators in a catN or catB effects coded variable. Below we design a data frame with a perfect categorical variable (completely determines the outcome y) where each level occurs exactly 2 times. The individual levels are insignificant, but we can still extract a significant catB effect coded variable. set.seed(3346) n <- 1000 k <- 4 d <- data.frame(y=rbinom(n,size=1,prob=0.5)>0) d$catVarNoise <- rep(paste0('lev',sprintf("%03d",1:floor(n/k))),(k+1))[1:n]
d$catVarPerfect <- paste0(d$catVar,substr(as.character(d$y),1,1)) d <- d[order(d$catVarPerfect),]
head(d)
##         y catVarNoise catVarPerfect
## 1   FALSE      lev001       lev001F
## 501 FALSE      lev001       lev001F
## 251  TRUE      lev001       lev001T
## 751  TRUE      lev001       lev001T
## 2   FALSE      lev002       lev002F
## 252 FALSE      lev002       lev002F
treatmentsC <- vtreat::designTreatmentsC(d,c('catVarNoise','catVarPerfect'),'y',TRUE)
## [1] "desigining treatments Mon Oct 24 08:41:25 2016"
## [1] "designing treatments Mon Oct 24 08:41:25 2016"
## [1] " have level statistics Mon Oct 24 08:41:25 2016"
## [1] "design var catVarNoise Mon Oct 24 08:41:25 2016"
## [1] "design var catVarPerfect Mon Oct 24 08:41:25 2016"
## [1] " scoring treatments Mon Oct 24 08:41:25 2016"
## [1] "have treatment plan Mon Oct 24 08:41:25 2016"
## [1] "rescoring complex variables Mon Oct 24 08:41:25 2016"
## [1] "done rescoring complex variables Mon Oct 24 08:41:25 2016"
# Estimate effect significance (not coeficient significance).
estSigGLM <- function(xVar,yVar,numberOfHiddenDegrees=0) {
d <- data.frame(x=xVar,y=yVar,stringsAsFactors = FALSE)
model <- stats::glm(stats::as.formula('y~x'),
data=d,
delta_deviance <- model$null.deviance - model$deviance
delta_df <- model$df.null - model$df.residual + numberOfHiddenDegrees
pRsq <- 1.0 - model$deviance/model$null.deviance
sig <- stats::pchisq(delta_deviance, delta_df, lower.tail=FALSE)
sig
}

prepD <- vtreat::prepare(treatmentsC,d,pruneSig=c())

vtreat produces good variable significances using out of sample simulation (cross frames).

print(treatmentsC$scoreFrame[c('varName','origName','sig')]) ## varName origName sig ## 1 catVarNoise_catB catVarNoise 4.683998e-01 ## 2 catVarPerfect_catP catVarPerfect 6.198374e-01 ## 3 catVarPerfect_catB catVarPerfect 1.802669e-211 Signal carrying complex variables can score as signficant, even those composed of rare levels. summary(glm(y~d$catVarPerfect=='lev001T',data=d,family=binomial))
##
## Call:
## glm(formula = y ~ d$catVarPerfect == "lev001T", family = binomial, ## data = d) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.152 -1.152 -1.152 1.203 1.203 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.06014 0.06334 -0.949 0.342 ## d$catVarPerfect == "lev001T"TRUE  13.62620  378.59287   0.036    0.971
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 1385.5  on 999  degrees of freedom
## Residual deviance: 1382.6  on 998  degrees of freedom
## AIC: 1386.6
##
## Number of Fisher Scoring iterations: 12
estSigGLM(prepD$catVarPerfect_catB,prepD$y,0) # wrong est
## Warning: glm.fit: algorithm did not converge
## [1] 2.958641e-303
estSigGLM(prepD$catVarPerfect_catB,prepD$y,
numberOfHiddenDegrees=length(unique(d$catVarPerfect))-1) ## Warning: glm.fit: algorithm did not converge ## [1] 3.963376e-90 Noise variables (those without a relation to outcome) are also scored correctly as long was we account for the degrees of freedom. summary(glm(y~d$catVarNoise=='lev001',data=d,family=binomial))
##
## Call:
## glm(formula = y ~ d$catVarNoise == "lev001", family = binomial, ## data = d) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.177 -1.154 -1.154 1.201 1.201 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.05624 0.06340 -0.887 0.375 ## d$catVarNoise == "lev001"TRUE  0.05624    1.00201   0.056    0.955
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 1385.5  on 999  degrees of freedom
## Residual deviance: 1385.5  on 998  degrees of freedom
## AIC: 1389.5
##
## Number of Fisher Scoring iterations: 3
estSigGLM(prepD$catVarNoise_catB,prepD$y,0) # wrong est
## [1] 1.223667e-63
estSigGLM(prepD$catVarNoise_catB,prepD$y,
numberOfHiddenDegrees=length(unique(d\$catVarNoise))-1)
## [1] 0.07074029