The DisImpact
R package contains functions that help in determining disproportionate impact (DI) based on the following methodologies:
# From CRAN (Official)
## install.packages('DisImpact')
# From github (Development)
## devtools::install_github('vinhdizzo/DisImpact')
library(DisImpact)
library(dplyr) # Ease in manipulations with data frames
To illustrate the functionality of the package, let’s load a toy data set:
# Load fake data set
data(student_equity)
The toy data set can be summarized as follows:
# Summarize toy data
dim(student_equity)
## [1] 20000 3
dSumm <- student_equity %>%
group_by(Cohort, Ethnicity) %>%
summarize(n=n(), Transfer_Rate=mean(Transfer))
dSumm
## # A tibble: 12 x 4
## # Groups: Cohort [?]
## Cohort Ethnicity n Transfer_Rate
## <int> <chr> <int> <dbl>
## 1 2017 Asian 3000 0.6873333
## 2 2017 Black 1000 0.3100000
## 3 2017 Hispanic 2000 0.2050000
## 4 2017 Multi-Ethnicity 500 0.5240000
## 5 2017 Native American 100 0.4300000
## 6 2017 White 3400 0.6038235
## 7 2018 Asian 3000 0.7433333
## 8 2018 Black 1000 0.2970000
## 9 2018 Hispanic 2000 0.2185000
## 10 2018 Multi-Ethnicity 500 0.4840000
## 11 2018 Native American 100 0.3500000
## 12 2018 White 3400 0.6314706
di_ppg
is the main work function, and it can take on vectors or column names the tidy way:
# Vector
di_ppg(success=student_equity$Transfer, group=student_equity$Ethnicity) %>% as.data.frame
## group n success pct reference moe pct_lo
## 1 Asian 6000 4292 0.7153333 0.5264 0.03000000 0.6853333
## 2 Black 2000 607 0.3035000 0.5264 0.03000000 0.2735000
## 3 Hispanic 4000 847 0.2117500 0.5264 0.03000000 0.1817500
## 4 Multi-Ethnicity 1000 504 0.5040000 0.5264 0.03099032 0.4730097
## 5 Native American 200 78 0.3900000 0.5264 0.06929646 0.3207035
## 6 White 6800 4200 0.6176471 0.5264 0.03000000 0.5876471
## pct_hi di_indicator
## 1 0.7453333 0
## 2 0.3335000 1
## 3 0.2417500 1
## 4 0.5349903 0
## 5 0.4592965 1
## 6 0.6476471 0
# Tidy and column reference
di_ppg(success=Transfer, group=Ethnicity, data=student_equity) %>%
as.data.frame
## group n success pct reference moe pct_lo
## 1 Asian 6000 4292 0.7153333 0.5264 0.03000000 0.6853333
## 2 Black 2000 607 0.3035000 0.5264 0.03000000 0.2735000
## 3 Hispanic 4000 847 0.2117500 0.5264 0.03000000 0.1817500
## 4 Multi-Ethnicity 1000 504 0.5040000 0.5264 0.03099032 0.4730097
## 5 Native American 200 78 0.3900000 0.5264 0.06929646 0.3207035
## 6 White 6800 4200 0.6176471 0.5264 0.03000000 0.5876471
## pct_hi di_indicator
## 1 0.7453333 0
## 2 0.3335000 1
## 3 0.2417500 1
## 4 0.5349903 0
## 5 0.4592965 1
## 6 0.6476471 0
Sometimes, one might want to break out the DI calculation by cohort:
# Cohort
di_ppg(success=Transfer, group=Ethnicity, cohort=Cohort, data=student_equity) %>%
as.data.frame
## cohort group n success pct reference moe
## 1 2017 Asian 3000 2062 0.6873333 0.5140 0.03000000
## 2 2017 Black 1000 310 0.3100000 0.5140 0.03099032
## 3 2017 Hispanic 2000 410 0.2050000 0.5140 0.03000000
## 4 2017 Multi-Ethnicity 500 262 0.5240000 0.5140 0.04382693
## 5 2017 Native American 100 43 0.4300000 0.5140 0.09800000
## 6 2017 White 3400 2053 0.6038235 0.5140 0.03000000
## 7 2018 Asian 3000 2230 0.7433333 0.5388 0.03000000
## 8 2018 Black 1000 297 0.2970000 0.5388 0.03099032
## 9 2018 Hispanic 2000 437 0.2185000 0.5388 0.03000000
## 10 2018 Multi-Ethnicity 500 242 0.4840000 0.5388 0.04382693
## 11 2018 Native American 100 35 0.3500000 0.5388 0.09800000
## 12 2018 White 3400 2147 0.6314706 0.5388 0.03000000
## pct_lo pct_hi di_indicator
## 1 0.6573333 0.7173333 0
## 2 0.2790097 0.3409903 1
## 3 0.1750000 0.2350000 1
## 4 0.4801731 0.5678269 0
## 5 0.3320000 0.5280000 0
## 6 0.5738235 0.6338235 0
## 7 0.7133333 0.7733333 0
## 8 0.2660097 0.3279903 1
## 9 0.1885000 0.2485000 1
## 10 0.4401731 0.5278269 1
## 11 0.2520000 0.4480000 1
## 12 0.6014706 0.6614706 0
di_ppg
is also applicable to summarized data; just pass the counts to success
and group size to weight
:
di_ppg(success=Transfer_Rate*n, group=Ethnicity, cohort=Cohort, weight=n, data=dSumm) %>%
as.data.frame
## cohort group n success pct reference moe
## 1 2017 Asian 3000 2062 0.6873333 0.5140 0.03000000
## 2 2017 Black 1000 310 0.3100000 0.5140 0.03099032
## 3 2017 Hispanic 2000 410 0.2050000 0.5140 0.03000000
## 4 2017 Multi-Ethnicity 500 262 0.5240000 0.5140 0.04382693
## 5 2017 Native American 100 43 0.4300000 0.5140 0.09800000
## 6 2017 White 3400 2053 0.6038235 0.5140 0.03000000
## 7 2018 Asian 3000 2230 0.7433333 0.5388 0.03000000
## 8 2018 Black 1000 297 0.2970000 0.5388 0.03099032
## 9 2018 Hispanic 2000 437 0.2185000 0.5388 0.03000000
## 10 2018 Multi-Ethnicity 500 242 0.4840000 0.5388 0.04382693
## 11 2018 Native American 100 35 0.3500000 0.5388 0.09800000
## 12 2018 White 3400 2147 0.6314706 0.5388 0.03000000
## pct_lo pct_hi di_indicator
## 1 0.6573333 0.7173333 0
## 2 0.2790097 0.3409903 1
## 3 0.1750000 0.2350000 1
## 4 0.4801731 0.5678269 0
## 5 0.3320000 0.5280000 0
## 6 0.5738235 0.6338235 0
## 7 0.7133333 0.7733333 0
## 8 0.2660097 0.3279903 1
## 9 0.1885000 0.2485000 1
## 10 0.4401731 0.5278269 1
## 11 0.2520000 0.4480000 1
## 12 0.6014706 0.6614706 0
The user could also pass in custom reference points for comparison (eg, a state-wide rate). di_ppg
accepts either a single reference point to be used or a vector of reference points, one for each cohort. For the latter, the vector of reference points will be taken to correspond to the cohort
variable, alphabetically ordered.
# With custom reference (single)
di_ppg(success=Transfer, group=Ethnicity, reference=0.54, data=student_equity) %>%
as.data.frame
## group n success pct reference moe pct_lo
## 1 Asian 6000 4292 0.7153333 0.54 0.03000000 0.6853333
## 2 Black 2000 607 0.3035000 0.54 0.03000000 0.2735000
## 3 Hispanic 4000 847 0.2117500 0.54 0.03000000 0.1817500
## 4 Multi-Ethnicity 1000 504 0.5040000 0.54 0.03099032 0.4730097
## 5 Native American 200 78 0.3900000 0.54 0.06929646 0.3207035
## 6 White 6800 4200 0.6176471 0.54 0.03000000 0.5876471
## pct_hi di_indicator
## 1 0.7453333 0
## 2 0.3335000 1
## 3 0.2417500 1
## 4 0.5349903 1
## 5 0.4592965 1
## 6 0.6476471 0
# With custom reference (multiple)
di_ppg(success=Transfer, group=Ethnicity, cohort=Cohort, reference=c(0.5, 0.55), data=student_equity) %>%
as.data.frame
## Joining, by = "cohort"
## cohort group n success pct reference moe
## 1 2017 Asian 3000 2062 0.6873333 0.50 0.03000000
## 2 2017 Black 1000 310 0.3100000 0.50 0.03099032
## 3 2017 Hispanic 2000 410 0.2050000 0.50 0.03000000
## 4 2017 Multi-Ethnicity 500 262 0.5240000 0.50 0.04382693
## 5 2017 Native American 100 43 0.4300000 0.50 0.09800000
## 6 2017 White 3400 2053 0.6038235 0.50 0.03000000
## 7 2018 Asian 3000 2230 0.7433333 0.55 0.03000000
## 8 2018 Black 1000 297 0.2970000 0.55 0.03099032
## 9 2018 Hispanic 2000 437 0.2185000 0.55 0.03000000
## 10 2018 Multi-Ethnicity 500 242 0.4840000 0.55 0.04382693
## 11 2018 Native American 100 35 0.3500000 0.55 0.09800000
## 12 2018 White 3400 2147 0.6314706 0.55 0.03000000
## pct_lo pct_hi di_indicator
## 1 0.6573333 0.7173333 0
## 2 0.2790097 0.3409903 1
## 3 0.1750000 0.2350000 1
## 4 0.4801731 0.5678269 0
## 5 0.3320000 0.5280000 0
## 6 0.5738235 0.6338235 0
## 7 0.7133333 0.7733333 0
## 8 0.2660097 0.3279903 1
## 9 0.1885000 0.2485000 1
## 10 0.4401731 0.5278269 1
## 11 0.2520000 0.4480000 1
## 12 0.6014706 0.6614706 0
The margin of error (MOE) in di_ppg
has 2 underlying assumptions (defaults):
To override 1, the user could specify min_moe
in di_ppg
. To override 2, the user could specify use_prop_in_moe=TRUE
in di_ppg
.
# min_moe
di_ppg(success=Transfer, group=Ethnicity, data=student_equity, min_moe=0.02) %>%
as.data.frame
## group n success pct reference moe pct_lo
## 1 Asian 6000 4292 0.7153333 0.5264 0.02000000 0.6953333
## 2 Black 2000 607 0.3035000 0.5264 0.02191347 0.2815865
## 3 Hispanic 4000 847 0.2117500 0.5264 0.02000000 0.1917500
## 4 Multi-Ethnicity 1000 504 0.5040000 0.5264 0.03099032 0.4730097
## 5 Native American 200 78 0.3900000 0.5264 0.06929646 0.3207035
## 6 White 6800 4200 0.6176471 0.5264 0.02000000 0.5976471
## pct_hi di_indicator
## 1 0.7353333 0
## 2 0.3254135 1
## 3 0.2317500 1
## 4 0.5349903 0
## 5 0.4592965 1
## 6 0.6376471 0
# use_prop_in_moe
di_ppg(success=Transfer, group=Ethnicity, data=student_equity, min_moe=0.02, use_prop_in_moe=TRUE) %>%
as.data.frame
## group n success pct reference moe pct_lo
## 1 Asian 6000 4292 0.7153333 0.5264 0.02000000 0.6953333
## 2 Black 2000 607 0.3035000 0.5264 0.02015028 0.2833497
## 3 Hispanic 4000 847 0.2117500 0.5264 0.02000000 0.1917500
## 4 Multi-Ethnicity 1000 504 0.5040000 0.5264 0.03098933 0.4730107
## 5 Native American 200 78 0.3900000 0.5264 0.06759869 0.3224013
## 6 White 6800 4200 0.6176471 0.5264 0.02000000 0.5976471
## pct_hi di_indicator
## 1 0.7353333 0
## 2 0.3236503 1
## 3 0.2317500 1
## 4 0.5349893 0
## 5 0.4575987 1
## 6 0.6376471 0
di_prop_index
is the main work function for this method, and it can take on vectors or column names the tidy way:
# Without cohort
## Vector
di_prop_index(success=student_equity$Transfer, group=student_equity$Ethnicity) %>% as.data.frame
## group n success pct_success pct_group di_prop_index
## 1 Asian 6000 4292 0.407674772 0.30 1.3589159
## 2 Black 2000 607 0.057655775 0.10 0.5765578
## 3 Hispanic 4000 847 0.080452128 0.20 0.4022606
## 4 Multi-Ethnicity 1000 504 0.047872340 0.05 0.9574468
## 5 Native American 200 78 0.007408815 0.01 0.7408815
## 6 White 6800 4200 0.398936170 0.34 1.1733417
## Tidy and column reference
di_prop_index(success=Transfer, group=Ethnicity, data=student_equity) %>%
as.data.frame
## group n success pct_success pct_group di_prop_index
## 1 Asian 6000 4292 0.407674772 0.30 1.3589159
## 2 Black 2000 607 0.057655775 0.10 0.5765578
## 3 Hispanic 4000 847 0.080452128 0.20 0.4022606
## 4 Multi-Ethnicity 1000 504 0.047872340 0.05 0.9574468
## 5 Native American 200 78 0.007408815 0.01 0.7408815
## 6 White 6800 4200 0.398936170 0.34 1.1733417
# With cohort
## Vector
di_prop_index(success=student_equity$Transfer, group=student_equity$Ethnicity, cohort=student_equity$Cohort) %>% as.data.frame
## cohort group n success pct_success pct_group di_prop_index
## 1 2017 Asian 3000 2062 0.401167315 0.30 1.3372244
## 2 2017 Black 1000 310 0.060311284 0.10 0.6031128
## 3 2017 Hispanic 2000 410 0.079766537 0.20 0.3988327
## 4 2017 Multi-Ethnicity 500 262 0.050972763 0.05 1.0194553
## 5 2017 Native American 100 43 0.008365759 0.01 0.8365759
## 6 2017 White 3400 2053 0.399416342 0.34 1.1747539
## 7 2018 Asian 3000 2230 0.413882702 0.30 1.3796090
## 8 2018 Black 1000 297 0.055122494 0.10 0.5512249
## 9 2018 Hispanic 2000 437 0.081106162 0.20 0.4055308
## 10 2018 Multi-Ethnicity 500 242 0.044914625 0.05 0.8982925
## 11 2018 Native American 100 35 0.006495917 0.01 0.6495917
## 12 2018 White 3400 2147 0.398478099 0.34 1.1719944
## Tidy and column reference
di_prop_index(success=Transfer, group=Ethnicity, cohort=Cohort, data=student_equity) %>%
as.data.frame
## cohort group n success pct_success pct_group di_prop_index
## 1 2017 Asian 3000 2062 0.401167315 0.30 1.3372244
## 2 2017 Black 1000 310 0.060311284 0.10 0.6031128
## 3 2017 Hispanic 2000 410 0.079766537 0.20 0.3988327
## 4 2017 Multi-Ethnicity 500 262 0.050972763 0.05 1.0194553
## 5 2017 Native American 100 43 0.008365759 0.01 0.8365759
## 6 2017 White 3400 2053 0.399416342 0.34 1.1747539
## 7 2018 Asian 3000 2230 0.413882702 0.30 1.3796090
## 8 2018 Black 1000 297 0.055122494 0.10 0.5512249
## 9 2018 Hispanic 2000 437 0.081106162 0.20 0.4055308
## 10 2018 Multi-Ethnicity 500 242 0.044914625 0.05 0.8982925
## 11 2018 Native American 100 35 0.006495917 0.01 0.6495917
## 12 2018 White 3400 2147 0.398478099 0.34 1.1719944
di_80_index
is the main work function for this method, and it can take on vectors or column names the tidy way:
# Without cohort
## Vector
di_80_index(success=student_equity$Transfer, group=student_equity$Ethnicity) %>% as.data.frame
## group n success pct reference di_80_index
## 1 Asian 6000 4292 0.7153333 0.7153333 1.0000000
## 2 Black 2000 607 0.3035000 0.7153333 0.4242777
## 3 Hispanic 4000 847 0.2117500 0.7153333 0.2960158
## 4 Multi-Ethnicity 1000 504 0.5040000 0.7153333 0.7045666
## 5 Native American 200 78 0.3900000 0.7153333 0.5452004
## 6 White 6800 4200 0.6176471 0.7153333 0.8634395
## di_indicator
## 1 0
## 2 1
## 3 1
## 4 1
## 5 1
## 6 0
## Tidy and column reference
di_80_index(success=Transfer, group=Ethnicity, data=student_equity) %>%
as.data.frame
## group n success pct reference di_80_index
## 1 Asian 6000 4292 0.7153333 0.7153333 1.0000000
## 2 Black 2000 607 0.3035000 0.7153333 0.4242777
## 3 Hispanic 4000 847 0.2117500 0.7153333 0.2960158
## 4 Multi-Ethnicity 1000 504 0.5040000 0.7153333 0.7045666
## 5 Native American 200 78 0.3900000 0.7153333 0.5452004
## 6 White 6800 4200 0.6176471 0.7153333 0.8634395
## di_indicator
## 1 0
## 2 1
## 3 1
## 4 1
## 5 1
## 6 0
# With cohort
## Vector
di_80_index(success=student_equity$Transfer, group=student_equity$Ethnicity, cohort=student_equity$Cohort) %>% as.data.frame
## cohort group n success pct reference di_80_index
## 1 2017 Asian 3000 2062 0.6873333 0.6873333 1.0000000
## 2 2017 Black 1000 310 0.3100000 0.6873333 0.4510184
## 3 2017 Hispanic 2000 410 0.2050000 0.6873333 0.2982541
## 4 2017 Multi-Ethnicity 500 262 0.5240000 0.6873333 0.7623666
## 5 2017 Native American 100 43 0.4300000 0.6873333 0.6256062
## 6 2017 White 3400 2053 0.6038235 0.6873333 0.8785017
## 7 2018 Asian 3000 2230 0.7433333 0.7433333 1.0000000
## 8 2018 Black 1000 297 0.2970000 0.7433333 0.3995516
## 9 2018 Hispanic 2000 437 0.2185000 0.7433333 0.2939462
## 10 2018 Multi-Ethnicity 500 242 0.4840000 0.7433333 0.6511211
## 11 2018 Native American 100 35 0.3500000 0.7433333 0.4708520
## 12 2018 White 3400 2147 0.6314706 0.7433333 0.8495120
## di_indicator
## 1 0
## 2 1
## 3 1
## 4 1
## 5 1
## 6 0
## 7 0
## 8 1
## 9 1
## 10 1
## 11 1
## 12 0
## Tidy and column reference
di_80_index(success=Transfer, group=Ethnicity, cohort=Cohort, data=student_equity) %>%
as.data.frame
## cohort group n success pct reference di_80_index
## 1 2017 Asian 3000 2062 0.6873333 0.6873333 1.0000000
## 2 2017 Black 1000 310 0.3100000 0.6873333 0.4510184
## 3 2017 Hispanic 2000 410 0.2050000 0.6873333 0.2982541
## 4 2017 Multi-Ethnicity 500 262 0.5240000 0.6873333 0.7623666
## 5 2017 Native American 100 43 0.4300000 0.6873333 0.6256062
## 6 2017 White 3400 2053 0.6038235 0.6873333 0.8785017
## 7 2018 Asian 3000 2230 0.7433333 0.7433333 1.0000000
## 8 2018 Black 1000 297 0.2970000 0.7433333 0.3995516
## 9 2018 Hispanic 2000 437 0.2185000 0.7433333 0.2939462
## 10 2018 Multi-Ethnicity 500 242 0.4840000 0.7433333 0.6511211
## 11 2018 Native American 100 35 0.3500000 0.7433333 0.4708520
## 12 2018 White 3400 2147 0.6314706 0.7433333 0.8495120
## di_indicator
## 1 0
## 2 1
## 3 1
## 4 1
## 5 1
## 6 0
## 7 0
## 8 1
## 9 1
## 10 1
## 11 1
## 12 0
Suppose we have a variable that indicates something negative. We could calculate DI on the converse of it:
di_ppg(success=!Transfer, group=Ethnicity, data=student_equity) %>%
as.data.frame
## group n success pct reference moe pct_lo
## 1 Asian 6000 1708 0.2846667 0.4736 0.03000000 0.2546667
## 2 Black 2000 1393 0.6965000 0.4736 0.03000000 0.6665000
## 3 Hispanic 4000 3153 0.7882500 0.4736 0.03000000 0.7582500
## 4 Multi-Ethnicity 1000 496 0.4960000 0.4736 0.03099032 0.4650097
## 5 Native American 200 122 0.6100000 0.4736 0.06929646 0.5407035
## 6 White 6800 2600 0.3823529 0.4736 0.03000000 0.3523529
## pct_hi di_indicator
## 1 0.3146667 1
## 2 0.7265000 0
## 3 0.8182500 0
## 4 0.5269903 0
## 5 0.6792965 0
## 6 0.4123529 1
We can compute the success, group, and cohort variables on the fly:
# Transform success
a <- sample(0:1, size=nrow(student_equity), replace=TRUE, prob=c(0.95, 0.05))
mean(a)
## [1] 0.0493
di_ppg(success=pmax(Transfer, a), group=Ethnicity, data=student_equity) %>%
as.data.frame
## group n success pct reference moe pct_lo
## 1 Asian 6000 4392 0.7320000 0.5504 0.03000000 0.7020000
## 2 Black 2000 678 0.3390000 0.5504 0.03000000 0.3090000
## 3 Hispanic 4000 989 0.2472500 0.5504 0.03000000 0.2172500
## 4 Multi-Ethnicity 1000 537 0.5370000 0.5504 0.03099032 0.5060097
## 5 Native American 200 83 0.4150000 0.5504 0.06929646 0.3457035
## 6 White 6800 4329 0.6366176 0.5504 0.03000000 0.6066176
## pct_hi di_indicator
## 1 0.7620000 0
## 2 0.3690000 1
## 3 0.2772500 1
## 4 0.5679903 0
## 5 0.4842965 1
## 6 0.6666176 0
# Collapse Black and Hispanic
di_ppg(success=Transfer, group=ifelse(Ethnicity %in% c('Black', 'Hispanic'), 'Black/Hispanic', Ethnicity), data=student_equity) %>% as.data.frame
## group n success pct reference moe pct_lo
## 1 Asian 6000 4292 0.7153333 0.5264 0.03000000 0.6853333
## 2 Black/Hispanic 6000 1454 0.2423333 0.5264 0.03000000 0.2123333
## 3 Multi-Ethnicity 1000 504 0.5040000 0.5264 0.03099032 0.4730097
## 4 Native American 200 78 0.3900000 0.5264 0.06929646 0.3207035
## 5 White 6800 4200 0.6176471 0.5264 0.03000000 0.5876471
## pct_hi di_indicator
## 1 0.7453333 0
## 2 0.2723333 1
## 3 0.5349903 0
## 4 0.4592965 1
## 5 0.6476471 0