Usage Examples

2016-10-18

Introduction

expss package implements some popular functions from spreadsheets and SPSS Statistics software. Implementations are not complete copies of their originals. I try to make them consistent with other R functions. See examples below. You can get help about any function by typing ?function_name in the R console.

Excel functions

Excel toy table:

A B C
1 2 15 50
2 1 70 80
3 3 30 40
4 2 30 40

Code for creating the same table in R:

w = read.csv(text = "
a,b,c
2,15,50
1,70,80
3,30,40
2,30,40"
)

w is the name of our table.

IF

Excel: IF(B1>60, 1, 0)

R: Here we create new column with name d with results. ifelse function is from base R not from ‘expss’ package but included here for completeness.

w$d = ifelse(w$b>60, 1, 0)

If we need to use multiple transformations it is often convenient to use modify function. Inside modify we can put arbitrary number of statements:

w = modify(w, {
    d = ifelse(b>60, 1, 0)
    e = 42
    abc_sum = sum_row(a, b, c)
    abc_mean = mean_row(a, b, c)
})
COUNTIF

Count 1’s in entire dataset.

Excel: COUNTIF(A1:C4, 1)

R:

count_if(1, w)

or

with(w, count_if(1, a, b, c))

Count values greater than 1 in each row of dataset.

Excel: COUNTIF(A1:C1, ">1")

R:

w$d = count_row_if(gt(1), w)  

or

w = modify(w, {
    d = count_row_if(gt(1), a, b, c) 
})

Count values less than or equal to 1 in column A of dataset.

Excel: COUNTIF(A1:A4, "<=1")

R:

count_col_if(lte(1), w$a)

Table of criteria:

Excel R
“<1” lt(1)
“<=1” lte(1)
“<>1” neq(1)
“=1” eq(1)
“>=1” gte(1)
“>1” gt(1)
SUM/AVERAGE

Sum all values in dataset.

Excel: SUM(A1:C4)

R:

sum(w, na.rm = TRUE)

Calculate average of each row of dataset.

Excel: AVERAGE(A1:C1)

R:

w$d = mean_row(w)  

or

w = modify(w, {
    d = mean_row(a, b, c) 
})

Sum values of column A of dataset.

Excel: SUM(A1:A4)

R:

sum_col(w$a)
SUMIF/AVERAGEIF

Sum values greater than 40 in entire dataset.

Excel: SUMIF(A1:C4, ">40")

R:

sum_if(gt(40), w)

or

with(w, sum_if(gt(40), a, b, c))

Sum values less than 40 in each row of dataset.

Excel: SUMIF(A1:C1, "<40")

R:

w$d = sum_row_if(lt(40), w)  

or

w = modify(w, {
    d = sum_row_if(lt(40), a, b, c) 
})

Calculate average of B column with column A values less than 3.

Excel: AVERAGEIF(A1:A4, "<3", B1:B4)

R:

mean_col_if(lt(3), w$a, data = w$b)

or, if we want calculate means for both b and c columns:

with(w, mean_col_if(lt(3), a, data = dtfrm(b, c)))
VLOOKUP

Our dictionary for lookup:

X Y
1 1 apples
2 2 oranges
3 3 peaches

Code for creating the same dictionary in R:

dict = read.csv(text = "
x,y
1,apples
2,oranges
3,peaches",
stringsAsFactors = FALSE
)

Excel: VLOOKUP(A1, $X$1:$Y$3, 2, FALSE)

R:

w$d = vlookup(w$a, dict, 2)

or, we can use column names:

w$d = vlookup(w$a, dict, "y")

SPSS functions

COMPUTE

SPSS:

COMPUTE d = 1.

R:

w$d = 1

or, in specific data.frame

default_dataset(w)

.compute({
    d = 1
})

There can be arbitrary number of statements inside .compute.

IF

SPSS:

IF(a = 3) d = 2.

R:

Default dataset should be already predefined as in previous example.

.compute({
    d = ifelse(a == 3, 2, NA)
})

or,

.compute({
    d = ifs(a == 3 ~ 2)
})
DO IF

SPSS:

DO IF (a>1).
    COMPUTE d = 4.
END IF.

R:

.do_if(a>1, {
    d = 4
})

There can be arbitrary number of statements inside .do_if.

COUNT

SPSS:

COUNT cnt = a1 TO a5 (LO THRU HI).

R:

cnt = count_row_if(lo %thru% hi, a1 %to% a5)

SPSS:

COUNT cnt = a1 TO a5 (SYSMIS).

R:

cnt = count_row_if(NA, a1 %to% a5)

SPSS:

COUNT cnt = a1 TO a5 (1 THRU 5).

R:

cnt = count_row_if(1 %thru% 5, a1 %to% a5)

SPSS:

COUNT cnt = a1 TO a5 (1 THRU HI).

R:

cnt = count_row_if(1 %thru% hi, a1 %to% a5)

or,

cnt = count_row_if(gte(1), a1 %to% a5)

SPSS:

COUNT cnt = a1 TO a5 (LO THRU 1).

R:

cnt = count_row_if(lo %thru% 1, a1 %to% a5)

or,

cnt = count_row_if (lte(1), a1 %to% a5)

SPSS:

COUNT cnt = a1 TO a5 (1 THRU 5, 99).

R:

cnt = count_row_if(1 %thru% 5 | 99, a1 %to% a5)

SPSS:

COUNT cnt = a1 TO a5(1,2,3,4,5, SYSMIS).

R:

cnt = count_row_if(c(1:5, NA), a1 %to% a5)

count_row_if can be used with default dataset inside the .compute.

RECODE

SPSS:

RECODE V1 (0=1) (1=0) (2, 3=-1) (9=9) (ELSE=SYSMIS)

R:

if_val(v1) = c(0 ~ 1, 1 ~ 0, 2:3 ~ -1, 9 ~ 9, other ~ NA)

SPSS:

RECODE QVAR(1 THRU 5=1)(6 THRU 10=2)(11 THRU HI=3)(ELSE=0).

R:

if_val(qvar) = c(1 %thru% 5 ~ 1, 6 %thru% 10 ~ 2, 11 %thru% hi ~ 3, other ~ 0)

SPSS:

RECODE STRNGVAR ('A', 'B', 'C'='A')('D', 'E', 'F'='B')(ELSE=' '). 

R:

if_val(strngvar) = c(c('A', 'B', 'C') ~ 'A', c('D', 'E', 'F') ~ 'B', other ~ ' ')

SPSS:

RECODE AGE (MISSING=9) (18 THRU HI=1) (0 THRU 18=0) INTO VOTER. 

R:

voter = if_val(age, NA ~ 9, 18 %thru% hi ~ 1, 0 %thru% 18 ~ 0)

if_val can be used with default dataset inside the .compute.

VARIABLE LABELS

SPSS:

VARIABLE LABELS a "Fruits"
                b "Cost"
                c "Price".

R:

.compute({
    var_lab(a) = "Fruits"
    var_lab(b) = "Cost"
    var_lab(c) = "Price"
})
VALUE LABELS

SPSS:

VALUE LABELS a
    1 "apples"
    2 "oranges"
    3 "peaches". 

R:

.compute({
    val_lab(a) = ml_left("
        1 apples
        2 oranges
        3 peaches 
    ")
})

or, without using default dataset:

val_lab(w$a) = ml_left("
    1 apples
    2 oranges
    3 peaches 
")
Simple tables

R:

fre(w$a) # Frequency of fruits
##   Fruits Count Valid percent Percent Responses, % Cumulative responses, %
##   apples     1            25      25           25                      25
##  oranges     2            50      50           50                      75
##  peaches     1            25      25           25                     100
##   #Total     4           100     100          100                      NA
##     <NA>     0            NA       0           NA                      NA
cro_cpct(w$b, w$a) # Column percent of cost by fruits
##    Cost apples oranges peaches #Total
##      15     NA      50      NA     25
##      30     NA      50     100     50
##      70    100      NA      NA     25
##  #Total      1       2       1      4
cro_mean(dtfrm(w$b, w$c), w$a) # Mean cost and price by fruits
##        apples oranges peaches #Total
##   Cost     70    22.5      30  36.25
##  Price     80    45.0      40  52.50

or, the same with default dataset:

.fre(a) # Frequency of fruits
.cro_cpct(b, a) # Column percent of cost by fruits
.cro_mean(dtfrm(b, c), a) # Mean cost and price by fruits

Session example

It is rather artificial dataset with data from product test of two samples of chocolate sweets. 150 respondents tested two kinds of sweets (codenames: VSX123 and SDF546). Sample was divided into two groups (cells) of 75 respondents in each group. In cell 1 product VSX123 was presented first and then SDF546. In cell 2 sweets were presented in reversed order. Questions about respondent impressions about first product are in the block A (and about second tested product in the block B). At the end of the questionnaire there is a question about preferences between sweets.

List of variables:

library(expss)
library(knitr)
options(digits = 2) # for pretty printing
data(product_test)

default_dataset(product_test)
## Set default dataset to 'product_test'
## here we recode variables from first/second tested product to separate variables for each product

# create empty variables - 'h' variables for VSX123 and 'p' variables for 'SDF456'
.set(c("h1_`1:6`","h22", "p1_`1:6`", "p22"))

# recode variables according to their cells
.recode(vars("h1_`1:6`","h22", "p1_`1:6`", "p22"), 
        cell == 1 ~ vars("a1_`1:6`","a22", "b1_`1:6`", "b22"),
        cell == 2 ~ vars("b1_`1:6`","b22", "a1_`1:6`", "a22")
)

# here we prepare likes codeframe for future usage
codeframe_likes = ml_left("
1 Liked everything
2 Disliked everything
3 Chocolate
4 Appearance
5 Taste
6 Stuffing
7 Nuts
8 Consistency
98 Other
99 Hard to answer
")

# recode preferences from first/second product to true names
# for first cell there are no changes, for second cell we should change 1 and 2.
.do_if(cell == 1, {
    c1r = c1
})
.do_if(cell == 2, {
    c1r = if_val(c1, 1 ~ 2, 2 ~ 1, other ~ copy)
})
.compute({
    # recode age by groups
    age_cat = if_val(s2a, lo %thru% 25 ~ 1, lo %thru% hi ~ 2)
    # counter number of likes
    # codes 1, 3-98. 2 and 99 are ignored.
    h_likes = count_row_if(1 | 3 %thru% 98, h1_1 %to% h1_6) 
    p_likes = count_row_if(1 | 3 %thru% 98, p1_1 %to% p1_6) 
    
    # Apply labels

    var_lab(c1r) = "Preferences"
    val_lab(c1r) = ml_left("
        1 VSX123 
        2 SDF456
        3 Hard to say
    ")
    
    var_lab(age_cat) = "Age"
    val_lab(age_cat) = c("18 - 25" = 1, "26 - 35" = 2)
    
    var_lab(h1_1) = "Likes. VSX123"
    var_lab(p1_1) = "Likes. SDF456"
    val_lab(h1_1) = codeframe_likes
    val_lab(p1_1) = codeframe_likes
    
    var_lab(h_likes) = "Number of likes. VSX123"
    var_lab(p_likes) = "Number of likes. SDF456"
    
    var_lab(h22) = "Overall quality. VSX123"
    var_lab(p22) = "Overall quality. SDF456"
    val_lab(h22) = ml_left("
                           1 Extremely poor 
                           2 Very poor
                           3 Quite poor
                           4 Neither good, nor poor
                           5 Quite good
                           6 Very good
                           7 Excellent
                           ")
    val_lab(p22) = val_lab(h22)
})

# Tables. 
# 'kable' function just makes tables prettier in this document. 

# column percents.
kable(.fre(c1r))
Preferences Count Valid percent Percent Responses, % Cumulative responses, %
VSX123 94 63 63 63 63
SDF456 50 33 33 33 96
Hard to say 6 4 4 4 100
#Total 150 100 100 100 NA
0 NA 0 NA NA
# is there significant difference between preferences?
# 'na_if(c1r, 3)' remove 'hard to say' from vector 
.with(chisq.test(table(na_if(c1r, 3)))) # yes, it is significant
## 
##  Chi-squared test for given probabilities
## 
## data:  table(na_if(c1r, 3))
## X-squared = 10, df = 1, p-value = 2e-04
kable(.cro_cpct(c1r, age_cat))
Preferences 18 - 25 26 - 35 #Total
VSX123 65.7 60 63
SDF456 31.4 35 33
Hard to say 2.9 5 4
#Total 70.0 80 150
kable(.cro_cpct(h22, age_cat))
Overall quality. VSX123 18 - 25 26 - 35 #Total
Quite poor 2.9 1.2 2
Neither good, nor poor 11.4 10.0 11
Quite good 45.7 33.8 39
Very good 24.3 41.2 33
Excellent 15.7 13.8 15
#Total 70.0 80.0 150
kable(.cro_cpct(p22, age_cat))
Overall quality. SDF456 18 - 25 26 - 35 #Total
Very poor NA 1.2 0.67
Quite poor 4.3 1.2 2.67
Neither good, nor poor 20.0 13.8 16.67
Quite good 27.1 35.0 31.33
Very good 35.7 35.0 35.33
Excellent 12.9 13.8 13.33
#Total 70.0 80.0 150.00
kable(.cro_cpct(h1_1 %to% h1_6, age_cat))
Likes. VSX123 18 - 25 26 - 35 #Total
Disliked everything 1.4 5 3.3
Chocolate 38.6 30 34.0
Appearance 21.4 36 29.3
Taste 38.6 26 32.0
Stuffing 20.0 34 27.3
Nuts 72.9 61 66.7
Consistency 4.3 19 12.0
#Total 70.0 80 150.0
kable(.cro_cpct(p1_1 %to% p1_6, age_cat))
Likes. SDF456 18 - 25 26 - 35 #Total
Disliked everything 1.4 1.2 1.33
Chocolate 27.1 36.2 32.00
Appearance 35.7 28.8 32.00
Taste 42.9 36.2 39.33
Stuffing 24.3 30.0 27.33
Nuts 60.0 62.5 61.33
Consistency 5.7 13.8 10.00
Other NA 1.2 0.67
#Total 70.0 80.0 150.00
# means
kable(.cro_mean(dtfrm(h22, p22, h_likes, p_likes), age_cat))
18 - 25 26 - 35 #Total
Overall quality. VSX123 5.4 5.6 5.5
Overall quality. SDF456 5.3 5.4 5.4
Number of likes. VSX123 2.0 2.1 2.0
Number of likes. SDF456 2.0 2.1 2.0

We can save labelled dataset as *.csv file with accompanying R code for labelling.

write_labelled_csv(product_test, file  filename = "product_test.csv")

Or, we can save dataset as *.csv file with SPSS syntax to read data and apply labels.

write_labelled_spss(product_test, file  filename = "product_test.csv")