rqdatatable
is an implementation of the rquery
piped Codd-style relational algebra hosted on data.table
. rquery
allow the expression of complex transformations as a series of relational operators and rqdatatable
implements the operators using data.table
.
For example scoring a logistic regression model (which requires grouping, ordering, and ranking) is organized as follows. For more on this example please see “Let’s Have Some Sympathy For The Part-time R User”.
## Loading required package: rquery
# data example
dL <- build_frame(
"subjectID", "surveyCategory" , "assessmentTotal" |
1 , "withdrawal behavior", 5 |
1 , "positive re-framing", 2 |
2 , "withdrawal behavior", 3 |
2 , "positive re-framing", 4 )
scale <- 0.237
# example rquery pipeline
rquery_pipeline <- local_td(dL) %.>%
extend_nse(.,
probability :=
exp(assessmentTotal * scale)) %.>%
normalize_cols(.,
"probability",
partitionby = 'subjectID') %.>%
pick_top_k(.,
k = 1,
partitionby = 'subjectID',
orderby = c('probability', 'surveyCategory'),
reverse = c('probability', 'surveyCategory')) %.>%
rename_columns(., c('diagnosis' = 'surveyCategory')) %.>%
select_columns(., c('subjectID',
'diagnosis',
'probability')) %.>%
orderby(., cols = 'subjectID')
We can show the expanded form of query tree.
table(dL;
subjectID,
surveyCategory,
assessmentTotal) %.>%
extend(.,
probability := exp(assessmentTotal * 0.237)) %.>%
extend(.,
probability := probability / sum(probability),
p= subjectID) %.>%
extend(.,
row_number := row_number(),
p= subjectID,
o= "probability" DESC, "surveyCategory" DESC) %.>%
select_rows(.,
row_number <= 1) %.>%
rename(.,
c('diagnosis' = 'surveyCategory')) %.>%
select_columns(.,
subjectID, diagnosis, probability) %.>%
orderby(., subjectID)
And execute it using data.table
.
## subjectID diagnosis probability
## 1: 1 withdrawal behavior 0.6706221
## 2: 2 positive re-framing 0.5589742
One can also apply the pipeline to new tables.
build_frame(
"subjectID", "surveyCategory" , "assessmentTotal" |
7 , "withdrawal behavior", 5 |
7 , "positive re-framing", 20 ) %.>%
rquery_pipeline
## subjectID diagnosis probability
## 1: 7 positive re-framing 0.9722128
Initial bench-marking of rqdatatable
is very favorable (notes here).
Note rqdatatable
has an “immediate mode” which allows direct application of pipelines stages without pre-assembling the pipeline. “Immediate mode” is a convenience for ad-hoc analyses, and has some negative performance impact, so we encourage users to build pipelines for most work. Some notes on the issue can be found here.
rqdatatable
is a fairly complete implementation of rquery
. The main differences are the rqdatatable
implementations of sql_node()
and theta_join()
are implemented by round-tripping through a database handle specified by the rquery.rquery_db_executor
option (so it is not they are not very desirable implementation).
To install rqdatatable
please use install.packages("rqdatatable")
.