cdfquantreg: Smithson & Verkuilen data example

Yiyun Shou, Michael Smithson

Example 1: Juror Judgment Study

Deady (2004) studied naive mock-jurors’ responses to two types of vertic setups. One has two options only (guilt vs. acquittal) and the other has an additional option “not proven”. The researcher also investigated the influence of conflicting testimonial evidence on judgments. The study had 2 (two vs. three-option verdict) by 2 (conflict vs. no-conflict conditons) factorial design. The dependent variable was the juror’s degree of confidence in her or his verdict, expressed as a percentage rating (0–100). The data included the responses from 104 first-year psychology students at The Australian National University.

About the data

The data included three variables.

library(cdfquantreg)
data(cdfqrExampleData)
#Quick overview of the variables
rbind(head(JurorData,4),tail(JurorData,4))
##     vert confl crc99
## 1     -1    -1 0.500
## 2     -1    -1 0.698
## 3     -1    -1 0.797
## 4     -1    -1 0.698
## 101    1     1 0.946
## 102    1     1 0.698
## 103    1     1 0.995
## 104    1     1 0.847

Model fit example

The hypothesis was that the conflicting evidence would lower juror confidence and that the three-option condition would increase it.

That is, it was expected that there would be significant interaction effect between vert and confl when predicting crc99. To test the hypothesis, we fit a a model in cdfquantreg by using the T2-T2 distribution as a demonstration.

# We use T2-T2 distribution
fd <- "logit" # The parent distribution
sd <- "logistic" # The child distribution

# Fit the null model
fit_null <- cdfquantreg(crc99 ~ 1 | 1, fd, sd, data = JurorData)

# Fit a main effect model
fit1 <- cdfquantreg(crc99 ~ vert + confl | 1, fd, sd, data = JurorData)

# Fit the full model
fit2 <- cdfquantreg(crc99 ~ vert*confl | 1, fd, sd, data = JurorData)

anova(fit1,fit2)
## Likelihood ratio tests 
## 
##   Resid. Df -2Loglik Df LR stat Pr(>Chi)
## 1       100  -60.535                    
## 2        99  -62.309  1  1.7737   0.1829
# Obtain the statistics for the null model
summary(fit2)
## Family:  logit logistic 
## Call:  cdfquantreg(formula = crc99 ~ vert * confl | 1, data = JurorData,  
##     fd = fd, sd = sd) 
## 
## Mu coefficients (Location submodel)
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.2156     0.1480   8.211 2.22e-16 ***
## vert          0.1680     0.1477   1.137    0.255    
## confl         0.1296     0.1477   0.877    0.380    
## vert:confl    0.1969     0.1478   1.332    0.183    
## 
## Sigma coefficients (Dispersion submodel)
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.11581    0.08511  -1.361    0.174
## 
## Converge:  0
## Log-Likelihood:  31.1545 
## 
## Gradient:  -2e-04 -3e-04 -2e-04 3e-04 7e-04

The results were similar to the ones outlined in Smithson & Verkuilen (2006). There was no signficant interaction between vert and confl in the location submodel.

Next, we replicate the steps in Smithson & Verkuilen by fitting models with the dispersion submodel, expanding the formula to crc99 ~ vert*confl |vert + confl and crc99 ~ vert*confl |vert*confl.

# Fit a main effect model
fit3 <- cdfquantreg(crc99 ~ vert*confl |vert + confl, fd, sd, data = JurorData)

# Fit the full model
fit4 <- cdfquantreg(crc99 ~ vert*confl |vert*confl, fd, sd, data = JurorData)

anova(fit2, fit3, fit4)
## Likelihood ratio tests 
## 
##   Resid. Df -2Loglik Df LR stat Pr(>Chi)   
## 1        99  -62.309                       
## 2        97  -71.570  2  9.2612 0.009749 **
## 3        96  -72.776  1  1.2063 0.272068   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Obtain the statistics for the null model
summary(fit4)
## Family:  logit logistic 
## Call:  
## cdfquantreg(formula = crc99 ~ vert * confl | vert * confl, data = JurorData,  
##     fd = fd, sd = sd) 
## 
## Mu coefficients (Location submodel)
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.2064     0.1555   7.758 8.66e-15 ***
## vert          0.1869     0.1555   1.202    0.229    
## confl         0.1353     0.1555   0.870    0.384    
## vert:confl    0.1890     0.1555   1.215    0.224    
## 
## Sigma coefficients (Dispersion submodel)
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -0.15087    0.08438  -1.788   0.0738 . 
## vert         0.24287    0.08438   2.878   0.0040 **
## confl       -0.08058    0.08438  -0.955   0.3396   
## vert:confl  -0.09272    0.08438  -1.099   0.2718   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Converge:  0
## Log-Likelihood:  36.3882 
## 
## Gradient:  0 0 0 0 0 -1e-04 0 0

As shown in the results, the number of options had signficant effect on the dispersion of the distribution. While whether the evidence is conflict or not does not influence the dispersion. There was no signficant interaction between the two predictors.

The following figures shows how well the model fit the actual response distributions.

# Compare the empirical distribution and the fitted values distribution
breaks <- seq(0,1,length.out =11)

plot(fit4,xlim = c(0.1,1),ylim = c(0,3), breaks = breaks)

Model diagosis

We can check the fitted values as well as model residuals. Currently three types of residuals are avaiable: raw residuals, Pearson’s residuals and deviance residuals.

par(mfrow=c(2,2),mar = c(2,3,2,2))
# Plot the fitted values
plot(fitted(fit4, "full"), main = "Fitted Values")

# Check Residuals
plot(residuals(fit4, "raw"), main = "Raw Residuals")

plot(residuals(fit4, "pearson"), main = "Pearson Residuals")

plot(residuals(fit4, "deviance"), main = "Deviance Residuals")

Example 2: Stress-Anxiety data introduction

A second example is a data from a study that investigates the relationship between stress and anxiety. The data includes 166 nonclinical women in Townsville, Queensland, Australia. The variables are linearly transformed scales from the Depression Anxiety Stress Scales, which normally range from 0 to 42.

About the data

The scores of the two continuous variables Anxiety and Stress were firstly linearly transformed to the (0, 1) interval. The figure below shows kernel density estimates for the two variables. As one would expect in a nonclinical population, there is an active floor for each variable, with this being most pronounced for anxiety. It should be clear that anxiety is strongly skewed.

head(AnxStrData, 8)
##   Anxiety Stress
## 1    0.01   0.01
## 2    0.17   0.29
## 3    0.01   0.17
## 4    0.05   0.41
## 5    0.09   0.21
## 6    0.41   0.45
## 7    0.05   0.21
## 8    0.01   0.01
plot(density(AnxStrData$Anxiety), main = "Anxiety and Stress")
lines(density(AnxStrData$Stress), lty = 2)

Heteroscedasticity is to be expected, as theoretically it is likely that someone experiencing anxiety will also experience a great deal of stress, but the converse is not true, because many people might experience stress without being anxious. That is, stress could be thought of as a necessary but not sufficient condition for anxiety. If so, the relationship between the variables is not one-to-one plus noise, as predicted by ordinary homoscedastic regression, but instead is many-to-one plus noise, which involves heteroscedasticity.

Model fit

We can fit the cdf model the same way as we did for Example data set 1.

# Fit the null model
fit_null <- cdfquantreg(Anxiety ~ 1 | 1, fd, sd, data = AnxStrData)

# Fit the location model
fit1 <- cdfquantreg(Anxiety ~ Stress | 1, fd, sd, data = AnxStrData)

# Fit the full model
fit2 <- cdfquantreg(Anxiety ~ Stress | Stress, fd, sd, data = AnxStrData)

anova(fit_null,fit1, fit2)
## Likelihood ratio tests 
## 
##   Resid. Df -2Loglik Df LR stat Pr(>Chi)    
## 1       164  -518.96                        
## 2       163  -633.74  1  114.78  < 2e-16 ***
## 3       162  -639.49  1    5.74  0.01658 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(fit2)
## Family:  logit logistic 
## Call:  
## cdfquantreg(formula = Anxiety ~ Stress | Stress, data = AnxStrData,  
##     fd = fd, sd = sd) 
## 
## Mu coefficients (Location submodel)
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -4.7663     0.1174  -40.59   <2e-16 ***
## Stress        5.6518     0.4772   11.84   <2e-16 ***
## 
## Sigma coefficients (Dispersion submodel)
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.7564     0.1425  -5.307 1.12e-07 ***
## Stress        1.1007     0.4786   2.300   0.0215 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Converge:  0
## Log-Likelihood:  319.7427 
## 
## Gradient:  0 0 0 0

It is clear that Stress influenced both the mean and the dispersion of scores on anxiety.

Model diagosis

# Compare the empirical distribution and the fitted values distribution
plot(fit2)

# Plot the fitted values
plot(fitted(fit2, "full"))

# Check Residuals
plot(residuals(fit2, "raw"))