Dexter is an R (R Development Core Team 2005) package intended as a robust and fairly comprehensive system for managing and analyzing test data organized in booklets. It includes facilities for importing and managing test data, assessing and improving the quality of data through basic test-and-item analysis, fitting an IRT model, and computing various estimates of ability.
Dexter differs from other psychometric software both in terms of the tools in includes as in those it omits. The data management component is quite developed to promote data integrity and prevent SQL injections and other mishaps. Many psychometric methods not found elsewhere are provided, such as Haberman’s (2007) interaction model generalized for polytomous items, new methods for exploratory and confirmatory DIF analysis, support for the 3DC method of standard setting, and more. On the other hand, there is no support for multivariate IRT models, 2PL and 3PL models and other methods sufficiently represented in other packages. The central IRT model is a polytomous generalization of the extended marginal Rasch model: we call it the Extended NOminal Response Model (ENORM).
Dexter scales well to a wide range of situations, from tiny examples to production-size problems. Parallel to problem size, the level of complexity and flexibility offered to the user is also scalable. For example, users fluent in DBMS may wish to tweak the underlying database using either R or SQL, but it is perfectly possible to use Dexter without even realizing that a database is at work in the background. Visual utilities based on shiny are provided for some of the most common analyses, but all methods and graphs can be accessed through programming language for customized analyses and reports.
Simplicity of use is achieved by imposing some simple and clear-cut constrains:
To enter and analyse data, the user must start a new project with function start_new_project
. This takes as a vital argument a data frame of scoring rules. The data frame must contain, in any order, three columns named exactly item
, response
, and score
. It must provide an exhaustive list of all items in the test (across all booklets), all admissible responses to each items, and the score that will be given to each response.
Open items are accommodated by entering given scores in place of the original responses, and providing a trivial scoring rule (0 is scored as 0, 1 as 1, etc.) Missing value codes may be listed explicitly, but any actual responses not matched in the scoring rules will be automatically assigned a score of 0.
Another argument that may be passed to start_new_project
is a name for the SQLite database file (there is a default value). The function creates an empty database under that name, and users are then able to input response data. Finally, the names of any person covariates to be entered along with the cognitive data must be declared here.
While various approaches to inputting data may be possible, we believe that the one provided with Dexter is easy and safe. As most researchers seem to keep their data in the rectangular data formats provided by software like Excel, SPSS, or even R, we propose the repeated use of function add_booklet
. The user will first use functions like read.table
, read.csv
or packages like readxl, haven, readr, foreign, to read the data for each booklet separately. Function add_booklet
will then add the data to the database and return a detailed report on its actions. It is very important to read this report carefully.
Basically, the data from all columns whose names have an exact match in the scoring rules table will be treated as responses to items. Hence, great care must be taken with column names but, apart from that, the process is fairly automatic.
In addition to the person properties provided with the booklets, users can also provide an arbitrary number of item properties. These are supplied as a data frame, one column per item property, and including an item ID column called exactly item
. The data frame is then passed to function add_item_properties
. The function returns a detailed report about any items not matched in the database, or items present in the database but not given item properties.
Functions like show_booklets
, show_items
, show_item_properties
, and show_person_properties
help users keep track of what has been entered in the database.
All analyses and graphs are available through R syntax, which facilitates automation and easy programming of customized reports. In addition, there are two interactive functions, iTIA
and iModels
Tools for assessing the quality of items and tests include: * a very traditional table of test-and-item analysis (mean scores, p-values, correlations between the item and the sum/rest score on the test etc.), produced with function tia_tables
, and supplemented with distractor plots * a much less traditional comparison between the Rasch / Partial credit model and the Interaction model proposed by Haberman (2007). This is produced with function fit_inter
and equipped with print and plot methods.
The interaction model was initially developed for dichotomous items. We have generalised it for items having more than two (potentially, a largish number) of response categories. A useful upshot is that we can represent scores on subtests as super-items and analyse these as normal items (function fit_domains
).
Function profile_plot
produces a novel plot that provides information related to both DIF analysis and the profile analysis proposed by Verhelst (2012).
When the test contains multiple test forms (booklets), the design is determined automatically as the data is being entered and compared against the scoring rules. Function design_is_connected
is useful in determining whether the design is connected.
The main function to calibrate an IRT model is fit_enorm
. Examinee proficiency can be estimated by maximum likelihood (function ability
) or via plausible values.
Most of the functions accept an arbitrarily complex logical predicate involving booklets, items, responses, persons, person and item covariates etc., allowing to subset the data. An obvious use is to exclude, in a flexible way, some data from analysis. As subsetting can be applied to the functions independently, we can, e.g., estimate a model on one subset of examinees and apply it to another.
As an example of first steps with Dexter, consider the verbal aggression data (Vansteelandt 2000) analysed in great detail in (Paul De Boeck 2004) and many others. 243 females and 73 males have assessed on a 3-point scale (‘yes’, ‘perhaps’, or ‘no’) how likely they are to become verbally aggressive in four different frustrating situations. Responses are further structured by type (curse, scold, or shout) and mode (want to do or actually do), resulting in 24 items.
A new project always starts with a complete enumeration of all items, all admissible responses, and the scores assigned to each distinct response. The data does not come from a cognitive test, so the scoring rules are trivial. We have supplied them in data set verbAggrRules
:
library(dexter)
head(verbAggrRules)
## item response score
## 1 S1WantCurse 0 0
## 2 S1WantCurse 1 1
## 3 S1WantCurse 2 2
## 4 S1DoCurse 0 0
## 5 S1DoCurse 1 1
## 6 S1DoCurse 2 2
db = start_new_project(verbAggrRules, "verbAggression.db", covariates=list(gender="<unknown>"))
There is a single booklet provided in data set verbAggrData
; we add it to the project:
head(verbAggrData)
## Gender S1DoCurse S1DoScold S1DoShout S1WantCurse S1WantScold S1WantShout
## 1 Male 1 0 1 0 0 0
## 2 Female 1 2 1 2 2 2
## 3 Female 0 0 0 1 0 0
## 4 Female 2 0 0 2 2 0
## 5 Female 0 0 0 0 1 0
## 6 Female 2 2 0 0 2 0
## S2DoCurse S2DoScold S2DoShout S2WantCurse S2WantScold S2WantShout
## 1 1 0 0 0 0 0
## 2 2 2 1 2 2 1
## 3 0 0 0 1 0 0
## 4 1 1 0 1 1 0
## 5 0 0 0 0 1 0
## 6 2 2 0 0 2 0
## S3DoCurse S3DoScold S3DoShout S3WantCurse S3WantScold S3WantShout
## 1 1 0 0 0 0 1
## 2 0 0 0 1 0 1
## 3 0 0 0 0 0 0
## 4 1 0 0 2 0 0
## 5 0 0 0 0 1 0
## 6 1 1 0 2 1 0
## S4DoCurse S4DoScold S4DoShout S4WantCurse S4WantScold S4WantShout
## 1 2 2 2 2 0 0
## 2 1 1 1 1 1 1
## 3 1 0 0 1 0 0
## 4 1 1 0 2 2 0
## 5 2 0 0 2 0 0
## 6 2 2 0 2 2 0
add_booklet(db, verbAggrData, "agg")
## $items
## [1] "S1DoCurse" "S1DoScold" "S1DoShout" "S1WantCurse" "S1WantScold"
## [6] "S1WantShout" "S2DoCurse" "S2DoScold" "S2DoShout" "S2WantCurse"
## [11] "S2WantScold" "S2WantShout" "S3DoCurse" "S3DoScold" "S3DoShout"
## [16] "S3WantCurse" "S3WantScold" "S3WantShout" "S4DoCurse" "S4DoScold"
## [21] "S4DoShout" "S4WantCurse" "S4WantScold" "S4WantShout"
##
## $covariates
## [1] "gender"
##
## $columns_ignored
## character(0)
##
## $auto_add_unknown_rules
## [1] TRUE
##
## $zero_rules_added
## [1] item_id response
## <0 rows> (or 0-length row.names)
The report looks clean. We can take a look at the booklets and items:
show_booklets(db)
## booklet_id n_items n_persons
## 1 agg 24 316
show_items(db)
## item_id
## 1 S1DoCurse
## 2 S1DoScold
## 3 S1DoShout
## 4 S1WantCurse
## 5 S1WantScold
## 6 S1WantShout
## 7 S2DoCurse
## 8 S2DoScold
## 9 S2DoShout
## 10 S2WantCurse
## 11 S2WantScold
## 12 S2WantShout
## 13 S3DoCurse
## 14 S3DoScold
## 15 S3DoShout
## 16 S3WantCurse
## 17 S3WantScold
## 18 S3WantShout
## 19 S4DoCurse
## 20 S4DoScold
## 21 S4DoShout
## 22 S4WantCurse
## 23 S4WantScold
## 24 S4WantShout
We can also add the item properties, i.e., the three factors defining the experimental design. Again, we have supplied them in a separate data set:
data("verbAggrProperties")
head(verbAggrProperties)
## item behavior mode blame situation
## 1 S1DoCurse Curse Do Other Bus
## 2 S1DoScold Scold Do Other Bus
## 3 S1DoShout Shout Do Other Bus
## 4 S1WantCurse Curse Want Other Bus
## 5 S1WantScold Scold Want Other Bus
## 6 S1WantShout Shout Want Other Bus
add_item_properties(db, verbAggrProperties)
show_item_properties(db)
## # A tibble: 96 x 3
## # Groups: value [11]
## item_property value N
## <chr> <chr> <int>
## 1 behavior Curse 8
## 2 behavior Scold 8
## 3 behavior Shout 8
## 4 behavior Curse 8
## 5 behavior Scold 8
## 6 behavior Shout 8
## 7 behavior Curse 8
## 8 behavior Scold 8
## 9 behavior Shout 8
## 10 behavior Curse 8
## # ... with 86 more rows
show_person_properties(db)
## # A tibble: 316 x 3
## # Groups: value [2]
## person_property value N
## <chr> <chr> <int>
## 1 gender Male 73
## 2 gender Female 243
## 3 gender Male 73
## 4 gender Female 243
## 5 gender Male 73
## 6 gender Male 73
## 7 gender Female 243
## 8 gender Female 243
## 9 gender Female 243
## 10 gender Female 243
## # ... with 306 more rows
Let us do the simplest analysis based on classical test theory:
tt = tia_tables(db)
knitr::kable(tt$itemStats, digits=3)
booklet_id | item_id | meanScore | sdScore | maxScore | pvalue | rit | rir | n |
---|---|---|---|---|---|---|---|---|
agg | S1DoCurse | 1.082 | 0.808 | 2 | 0.541 | 0.582 | 0.519 | 316 |
agg | S1DoScold | 0.832 | 0.817 | 2 | 0.416 | 0.651 | 0.596 | 316 |
agg | S1DoShout | 0.468 | 0.710 | 2 | 0.234 | 0.520 | 0.460 | 316 |
agg | S1WantCurse | 1.123 | 0.828 | 2 | 0.562 | 0.537 | 0.468 | 316 |
agg | S1WantScold | 0.930 | 0.852 | 2 | 0.465 | 0.593 | 0.528 | 316 |
agg | S1WantShout | 0.712 | 0.778 | 2 | 0.356 | 0.529 | 0.464 | 316 |
agg | S2DoCurse | 1.003 | 0.834 | 2 | 0.502 | 0.590 | 0.527 | 316 |
agg | S2DoScold | 0.684 | 0.781 | 2 | 0.342 | 0.633 | 0.578 | 316 |
agg | S2DoShout | 0.326 | 0.616 | 2 | 0.163 | 0.532 | 0.481 | 316 |
agg | S2WantCurse | 1.222 | 0.774 | 2 | 0.611 | 0.529 | 0.465 | 316 |
agg | S2WantScold | 0.959 | 0.840 | 2 | 0.479 | 0.562 | 0.494 | 316 |
agg | S2WantShout | 0.734 | 0.816 | 2 | 0.367 | 0.563 | 0.498 | 316 |
agg | S3DoCurse | 0.576 | 0.693 | 2 | 0.288 | 0.497 | 0.438 | 316 |
agg | S3DoScold | 0.294 | 0.557 | 2 | 0.147 | 0.505 | 0.458 | 316 |
agg | S3DoShout | 0.104 | 0.345 | 2 | 0.052 | 0.344 | 0.310 | 316 |
agg | S3WantCurse | 0.810 | 0.766 | 2 | 0.405 | 0.474 | 0.406 | 316 |
agg | S3WantScold | 0.462 | 0.654 | 2 | 0.231 | 0.521 | 0.467 | 316 |
agg | S3WantShout | 0.282 | 0.534 | 2 | 0.141 | 0.438 | 0.389 | 316 |
agg | S4DoCurse | 0.883 | 0.786 | 2 | 0.441 | 0.528 | 0.463 | 316 |
agg | S4DoScold | 0.566 | 0.725 | 2 | 0.283 | 0.567 | 0.510 | 316 |
agg | S4DoShout | 0.225 | 0.513 | 2 | 0.112 | 0.424 | 0.377 | 316 |
agg | S4WantCurse | 0.978 | 0.774 | 2 | 0.489 | 0.495 | 0.427 | 316 |
agg | S4WantScold | 0.589 | 0.744 | 2 | 0.294 | 0.573 | 0.515 | 316 |
agg | S4WantShout | 0.424 | 0.684 | 2 | 0.212 | 0.453 | 0.391 | 316 |
knitr::kable(tt$testStats, digits=3)
booklet_id | nItems | alpha | meanP | meanRit | meanRir | maxTestScore | N |
---|---|---|---|---|---|---|---|
agg | 24 | 0.888 | 0.339 | 0.527 | 0.468 | 48 | 316 |
The Rasch model and the Interaction models can be estimated with the `fit_inter’ function. This always happens on a booklet-per-booklet basis, so the number of the booklet must always be given. If more than one booklet are specified, the function will try to find their intersection, i.e., the maximum rectangular data set containing items that are common to all booklets specified. If the intersection is empty, there will be just a warning message. Unfortunately, we cannot demonstrate this with the verbal aggression data because it only contains one booklet.
m = fit_inter(db, booklet_id=='agg')
print(m)
## $item_id
## [1] "S1DoCurse" "S1DoScold" "S1DoShout" "S1WantCurse" "S1WantScold"
## [6] "S1WantShout" "S2DoCurse" "S2DoScold" "S2DoShout" "S2WantCurse"
## [11] "S2WantScold" "S2WantShout" "S3DoCurse" "S3DoScold" "S3DoShout"
## [16] "S3WantCurse" "S3WantScold" "S3WantShout" "S4DoCurse" "S4DoScold"
## [21] "S4DoShout" "S4WantCurse" "S4WantScold" "S4WantShout"
##
## attr(,"class")
## [1] "tbl_df" "tbl" "data.frame"
The print function returns only the item IDs because the parameters of the interaction model are not very easy to interpret. The models are most useful when plotted:
plot(m, "S1DoScold", show.observed=TRUE)
By default, the plot shows the regressions of the item score on the total score. The interaction model is shown with a thicker but lighter line, and the Rasch model is shown with a thinner, darker line. The unsmoothed data is not shown by default, so we had to change the show.observed
parameter. The ‘curtains’ are drawn at the 5% lowest and the 5% highest sum scores.
The Rasch, or, in this case, Partial Credit Model, tends to fit these items very well, so the two curves practically coincide. This need not be always the case. The interaction model is quite interesting for practice. It shares the conditioning properties of the Rasch model, which means that we can predict the expected item score at each observed total score. Hence, we have the possibility to represent the model and the observed data on the same plot, without tricks or circular reasoning.
Furthermore, the interaction model reproduces the mean item scores and the correlations between the item scores and the total score: the vital classical test statistics in the table above. In other words, the interaction model represents, as far as we are concerned, the data, and the two item-total regressions capture well the systematic departures of the Rasch/PCM model from the data. What remains is predominantly random noise; if we had an item where the Rasch model does not fit (not available in this data set), we would observe that the pink dots tend to fluctuate randomly around the interaction model.
We can show the response category probabilities instead of the expected item score:
plot(m, 'S1DoCurse', summate=FALSE)
Now fit and display models for the sum scores on each of the four frustrating situations:
mSit = fit_domains(db, item_property= "situation")
plot(mSit)
Finally, a profile plot comparing responses by males and females on the two modes, want to curse (scold, shout), against actually curse (scold, shout):
profile_plot(db, item_property='mode', covariate='gender')
We would not like to see such a differential effect in a cognitive test, but with this data set the result is quite interesting.
Function fit_enorm
estimates the parameters of the Extended NOminal Response Model, the principal IRT model. In fact, we have seen this model above, where it was compared to the interaction model. However, similar to the methods associated with CTT, the interaction model requires complete data and can be only applied on a booklet-per-booklet basis, while the fit_enorm
function also works with incomplete designs (function design_is_connected
is useful in determining whether the design is connected):
parms = fit_enorm(db)
## ==
The default estimation method is conditional maximum likelihood (CML), which is usually quite fast. As an alternative, we have provided a reasonably fast Gibbs sampler:
parms_gibbs = fit_enorm(db, method='Bayes')
## ===========================================================================
The function returns a parameters object, which can be passed to the functions that estimate person parameters. Currently, there are two of them: ability
, which produces maximum likelihood (MLE) and Bayes expected a posteriori (EAP) estimators, and plausible_values
. Being able to choose between a frequentist or a Bayesian approach to the estimation of either the item parameters or the person parameters enables the user to consider various sources of uncertainty.
Let us draw plausible values for our subject in what is now a scale of verbal aggression, and examine their distribution.
pv = plausible_values(db, parms)
plot(density(pv$PV1))
For a direct comparison on gender:
pv = merge(pv, dbReadTable(db, 'dxPersons'))
boxplot(PV1~gender, data=pv)
It appears that we cannot find a marked difference in verbal agression between male and female respondents overall. Compare this result to the very interesting structural differences revealed by the profile plot above.
To show dexter at work with a large set of cognitive test data, we would like to analyse a PISA data set. Such a problem is probably too large for a vignette, so we will only present a possible script without actually running it.
library(dplyr)
library(readr)
library(SAScii)
# Fetching the data from the OECD site requires a certain ... dexterity
# for which we are indebted to a kind soul on stackexchange.
# Download the data dictionary and read it in with SAScii
temp = tempfile()
download.file("https://www.oecd.org/pisa/pisaproducts/PISA2012_SAS_scored_cognitive_item.sas", temp)
dict_scored = parse.SAScii(sas_ri = temp)
unlink(temp)
# Download the scored cognitive data
temp = tempfile()
download.file("https://www.oecd.org/pisa/pisaproducts/INT_COG12_S_DEC03.zip",temp, mode="wb")
unzip(temp, "INT_COG12_S_DEC03.txt")
scored = read_fwf(file = 'INT_COG12_S_DEC03.txt',
col_positions = fwf_widths(dict_scored$width), progress = TRUE)
colnames(scored) = dict_scored$varname
unlink(temp)
# Keep only the maths items
standard = 1:13
pisa12_M = subset(scored, (BOOKID %in% standard),
select=c("CNT", "BOOKID", grep("PM",names(scored), value=TRUE)))
rm(scored)
# Items missing by design are coded with 7, and actual nonresponse with 8
# Will replace both with NA for simplicity.
tempBOOKID = pisa12_M$BOOKID
pisa12_M[pisa12_M==7] = NA
pisa12_M[pisa12_M==8] = NA
pisa12_M$BOOKID = tempBOOKID
rm(tempBOOKID)
allNA = apply(pisa12_M, 2, function(x)all(is.na(x)))
pisa12_M = pisa12_M[, !allNA]
mathItems = grep("PM",names(pisa12_M), value=TRUE)
# prepare the scoring rules
responses = lapply(mathItems, function(x)unique(pisa12_M[,x]))
foo = lapply(responses, function(x){
y=data.frame(item=names(x),response=x[,1],score=0L)
names(y)[2]="response"
y
})
rules = Reduce(rbind, foo)
rules$score[rules$response==1] = 1
rules$score[rules$response==2] = 2
rules=rules[order(rules$item, rules$response),]
# create the new project
db = start_new_project(rules, "pisa_math.db", covariates=list(cnt='<unknown country>'))
# add all booklets one by one, deleting columns that may be all NA
pisa12_M %>%
group_by(BOOKID) %>%
do({
allMissing = apply(., 2, function(x)all(is.na(x)))
dat = .[,!allMissing]
print(.$BOOKID[1])
add_booklet(db, dat, .$BOOKID[1])
})
rm(pisa12_M)
# add some item properties -- we have supplied a data set to
# make things a bit easier
items = data.frame(item=show_items(db)$item)
domain = subset(
merge(items, PISA_item_class, by.x="item", by.y="ItemCode"),
select = c("item", "Content")
)
domain$isSaS[domain$Content=="Space and shape"] = "SaS"
domain$isSaS[domain$Content!="Space and shape"] = "notSaS"
add_item_properties(db, domain)
show_item_properties(db)
show_person_properties(db)
# Fit the two models for booklet 1, all countries (takes about
# 10 seconds on my current machine)
system.time({m <- fit_inter2(db,booklet_id=='1')})
plot(m)
# Analyse by domain
md = fit_domains(db, 1, 'Content')
plot(md, nr=2, nc=2)
# Compare three countries on 'Space and shape' vs NOT 'Shape and space'
dev.off()
profile_plot(db, 1, "isSaS", "CNT", c("DEU","FRA","ITA"))
dbDisconnect(db)
Paul De Boeck, Mark Wilson, ed. 2004. Explanatory Item Response Models. Springer.
R Development Core Team. 2005. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org.
Vansteelandt, K. 2000. “Formal Methods for Contextualized Personality Psychology.” PhD thesis, K. U. Leuven.