*expss* package implements some popular functions from spreadsheets and SPSS Statistics software. Implementations are not complete copies of their originals. I try to make them consistent with other R functions. See examples below. You can get help about any function by typing `?function_name`

in the R console.

Excel toy table:

A | B | C | |
---|---|---|---|

1 |
2 | 15 | 50 |

2 |
1 | 70 | 80 |

3 |
3 | 30 | 40 |

4 |
2 | 30 | 40 |

Code for creating the same table in R:

```
w = read.csv(text = "
a,b,c
2,15,50
1,70,80
3,30,40
2,30,40"
)
```

`w`

is the name of our table.

**Excel: IF(B1>60, 1, 0)**

**R:** Here we create new column with name `d`

with results. `ifelse`

function is from base R not from ‘expss’ package but included here for completeness.

`w$d = ifelse(w$b>60, 1, 0)`

If we need to use multiple transformations it is often convenient to use `modify`

function. Inside `modify`

we can put arbitrary number of statements:

```
w = modify(w, {
d = ifelse(b>60, 1, 0)
e = 42
abc_sum = sum_row(a, b, c)
abc_mean = mean_row(a, b, c)
})
```

Count 1’s in entire dataset.

**Excel: COUNTIF(A1:C4, 1)**

**R:**

`count_if(1, w)`

or

`with(w, count_if(1, a, b, c))`

Count values greater than 1 in each row of dataset.

**Excel: COUNTIF(A1:C1, ">1")**

**R**:

`w$d = count_row_if(gt(1), w) `

or

```
w = modify(w, {
d = count_row_if(gt(1), a, b, c)
})
```

Count values less than or equal to 1 in column A of dataset.

**Excel: COUNTIF(A1:A4, "<=1")**

**R**:

`count_col_if(lte(1), w$a)`

**Table of criteria:**

Excel | R |
---|---|

“<1” | lt(1) |

“<=1” | lte(1) |

“<>1” | neq(1) |

“=1” | eq(1) |

“>=1” | gte(1) |

“>1” | gt(1) |

Sum all values in dataset.

**Excel: SUM(A1:C4)**

**R:**

`sum(w, na.rm = TRUE)`

Calculate average of each row of dataset.

**Excel: AVERAGE(A1:C1)**

**R**:

`w$d = mean_row(w) `

or

```
w = modify(w, {
d = mean_row(a, b, c)
})
```

Sum values of column `A`

of dataset.

**Excel: SUM(A1:A4)**

**R**:

`sum_col(w$a)`

Sum values greater than 40 in entire dataset.

**Excel: SUMIF(A1:C4, ">40")**

**R:**

`sum_if(gt(40), w)`

or

`with(w, sum_if(gt(40), a, b, c))`

Sum values less than 40 in each row of dataset.

**Excel: SUMIF(A1:C1, "<40")**

**R**:

`w$d = sum_row_if(lt(40), w) `

or

```
w = modify(w, {
d = sum_row_if(lt(40), a, b, c)
})
```

Calculate average of `B`

column with column `A`

values less than 3.

**Excel: AVERAGEIF(A1:A4, "<3", B1:B4)**

**R**:

`mean_col_if(lt(3), w$a, data = w$b)`

or, if we want calculate means for both `b`

and `c`

columns:

`with(w, mean_col_if(lt(3), a, data = dtfrm(b, c)))`

Our dictionary for lookup:

X | Y | |
---|---|---|

1 |
1 | apples |

2 |
2 | oranges |

3 |
3 | peaches |

Code for creating the same dictionary in R:

```
dict = read.csv(text = "
x,y
1,apples
2,oranges
3,peaches",
stringsAsFactors = FALSE
)
```

**Excel: VLOOKUP(A1, $X$1:$Y$3, 2, FALSE)**

**R**:

`w$d = vlookup(w$a, dict, 2)`

or, we can use column names:

`w$d = vlookup(w$a, dict, "y")`

**SPSS**:

`COMPUTE d = 1.`

**R**:

`w$d = 1`

or, in specific data.frame

```
default_dataset(w)
.compute({
d = 1
})
```

There can be arbitrary number of statements inside `.compute`

.

**SPSS**:

`IF(a = 3) d = 2.`

**R**:

Default dataset should be already predefined as in previous example.

```
.compute({
d = ifelse(a == 3, 2, NA)
})
```

or,

```
.compute({
d = ifs(a == 3 ~ 2)
})
```

**SPSS**:

```
DO IF (a>1).
COMPUTE d = 4.
END IF.
```

**R**:

```
.do_if(a>1, {
d = 4
})
```

There can be arbitrary number of statements inside `.do_if`

.

**SPSS**:

`COUNT cnt = a1 TO a5 (LO THRU HI).`

**R**:

`cnt = count_row_if(lo %thru% hi, a1 %to% a5)`

**SPSS**:

`COUNT cnt = a1 TO a5 (SYSMIS).`

**R**:

`cnt = count_row_if(NA, a1 %to% a5)`

**SPSS**:

`COUNT cnt = a1 TO a5 (1 THRU 5).`

**R**:

`cnt = count_row_if(1 %thru% 5, a1 %to% a5)`

**SPSS**:

`COUNT cnt = a1 TO a5 (1 THRU HI).`

**R**:

`cnt = count_row_if(1 %thru% hi, a1 %to% a5)`

or,

`cnt = count_row_if(gte(1), a1 %to% a5)`

**SPSS**:

`COUNT cnt = a1 TO a5 (LO THRU 1).`

**R**:

`cnt = count_row_if(lo %thru% 1, a1 %to% a5)`

or,

`cnt = count_row_if (lte(1), a1 %to% a5)`

**SPSS**:

`COUNT cnt = a1 TO a5 (1 THRU 5, 99).`

**R**:

`cnt = count_row_if(1 %thru% 5 | 99, a1 %to% a5)`

**SPSS**:

`COUNT cnt = a1 TO a5(1,2,3,4,5, SYSMIS).`

**R**:

`cnt = count_row_if(c(1:5, NA), a1 %to% a5)`

`count_row_if`

can be used with default dataset inside the `.compute`

.

**SPSS**:

`RECODE V1 (0=1) (1=0) (2, 3=-1) (9=9) (ELSE=SYSMIS)`

**R**:

`if_val(v1) = c(0 ~ 1, 1 ~ 0, 2:3 ~ -1, 9 ~ 9, other ~ NA)`

**SPSS**:

`RECODE QVAR(1 THRU 5=1)(6 THRU 10=2)(11 THRU HI=3)(ELSE=0).`

**R**:

`if_val(qvar) = c(1 %thru% 5 ~ 1, 6 %thru% 10 ~ 2, 11 %thru% hi ~ 3, other ~ 0)`

**SPSS**:

`RECODE STRNGVAR ('A', 'B', 'C'='A')('D', 'E', 'F'='B')(ELSE=' '). `

**R**:

`if_val(strngvar) = c(c('A', 'B', 'C') ~ 'A', c('D', 'E', 'F') ~ 'B', other ~ ' ')`

**SPSS**:

`RECODE AGE (MISSING=9) (18 THRU HI=1) (0 THRU 18=0) INTO VOTER. `

**R**:

`voter = if_val(age, NA ~ 9, 18 %thru% hi ~ 1, 0 %thru% 18 ~ 0)`

`if_val`

can be used with default dataset inside the `.compute`

.

**SPSS**:

```
VARIABLE LABELS a "Fruits"
b "Cost"
c "Price".
```

**R**:

```
.compute({
var_lab(a) = "Fruits"
var_lab(b) = "Cost"
var_lab(c) = "Price"
})
```

**SPSS**:

```
VALUE LABELS a
1 "apples"
2 "oranges"
3 "peaches".
```

**R**:

```
.compute({
val_lab(a) = ml_left("
1 apples
2 oranges
3 peaches
")
})
```

or, without using default dataset:

```
val_lab(w$a) = ml_left("
1 apples
2 oranges
3 peaches
")
```

**R**:

`fre(w$a) # Frequency of fruits`

```
## Fruits Count Valid percent Percent Responses, % Cumulative responses, %
## apples 1 25 25 25 25
## oranges 2 50 50 50 75
## peaches 1 25 25 25 100
## #Total 4 100 100 100 NA
## <NA> 0 NA 0 NA NA
```

`cro_cpct(w$b, w$a) # Column percent of cost by fruits`

```
## Cost apples oranges peaches #Total
## 15 NA 50 NA 25
## 30 NA 50 100 50
## 70 100 NA NA 25
## #Total 1 2 1 4
```

`cro_mean(dtfrm(w$b, w$c), w$a) # Mean cost and price by fruits`

```
## apples oranges peaches #Total
## Cost 70 22.5 30 36.25
## Price 80 45.0 40 52.50
```

or, the same with default dataset:

```
.fre(a) # Frequency of fruits
.cro_cpct(b, a) # Column percent of cost by fruits
.cro_mean(dtfrm(b, c), a) # Mean cost and price by fruits
```

It is rather artificial dataset with data from product test of two samples of chocolate sweets. 150 respondents tested two kinds of sweets (codenames: VSX123 and SDF546). Sample was divided into two groups (cells) of 75 respondents in each group. In cell 1 product VSX123 was presented first and then SDF546. In cell 2 sweets were presented in reversed order. Questions about respondent impressions about first product are in the block A (and about second tested product in the block B). At the end of the questionnaire there is a question about preferences between sweets.

List of variables:

`id`

Respondent Id`cell`

First tested product (cell number)`s2a`

Age`a1_1-a1_6`

What did you like in these sweets? Multiple response. First tested product`a22`

Overall quality. First tested product`b1_1-b1_6`

What did you like in these sweets? Multiple response. Second tested product`b22`

Overall quality. Second tested product`c1`

Preferences

```
library(expss)
library(knitr)
options(digits = 2) # for pretty printing
data(product_test)
default_dataset(product_test)
```

`## Set default dataset to 'product_test'`

```
## here we recode variables from first/second tested product to separate variables for each product
# create empty variables - 'h' variables for VSX123 and 'p' variables for 'SDF456'
.set(c("h1_`1:6`","h22", "p1_`1:6`", "p22"))
# recode variables according to their cells
.recode(vars("h1_`1:6`","h22", "p1_`1:6`", "p22"),
cell == 1 ~ vars("a1_`1:6`","a22", "b1_`1:6`", "b22"),
cell == 2 ~ vars("b1_`1:6`","b22", "a1_`1:6`", "a22")
)
# here we prepare likes codeframe for future usage
codeframe_likes = ml_left("
1 Liked everything
2 Disliked everything
3 Chocolate
4 Appearance
5 Taste
6 Stuffing
7 Nuts
8 Consistency
98 Other
99 Hard to answer
")
# recode preferences from first/second product to true names
# for first cell there are no changes, for second cell we should change 1 and 2.
.do_if(cell == 1, {
c1r = c1
})
.do_if(cell == 2, {
c1r = if_val(c1, 1 ~ 2, 2 ~ 1, other ~ copy)
})
.compute({
# recode age by groups
age_cat = if_val(s2a, lo %thru% 25 ~ 1, lo %thru% hi ~ 2)
# counter number of likes
# codes 1, 3-98. 2 and 99 are ignored.
h_likes = count_row_if(1 | 3 %thru% 98, h1_1 %to% h1_6)
p_likes = count_row_if(1 | 3 %thru% 98, p1_1 %to% p1_6)
# Apply labels
var_lab(c1r) = "Preferences"
val_lab(c1r) = ml_left("
1 VSX123
2 SDF456
3 Hard to say
")
var_lab(age_cat) = "Age"
val_lab(age_cat) = c("18 - 25" = 1, "26 - 35" = 2)
var_lab(h1_1) = "Likes. VSX123"
var_lab(p1_1) = "Likes. SDF456"
val_lab(h1_1) = codeframe_likes
val_lab(p1_1) = codeframe_likes
var_lab(h_likes) = "Number of likes. VSX123"
var_lab(p_likes) = "Number of likes. SDF456"
var_lab(h22) = "Overall quality. VSX123"
var_lab(p22) = "Overall quality. SDF456"
val_lab(h22) = ml_left("
1 Extremely poor
2 Very poor
3 Quite poor
4 Neither good, nor poor
5 Quite good
6 Very good
7 Excellent
")
val_lab(p22) = val_lab(h22)
})
# Tables.
# 'kable' function just makes tables prettier in this document.
# column percents.
kable(.fre(c1r))
```

Preferences | Count | Valid percent | Percent | Responses, % | Cumulative responses, % |
---|---|---|---|---|---|

VSX123 | 94 | 63 | 63 | 63 | 63 |

SDF456 | 50 | 33 | 33 | 33 | 96 |

Hard to say | 6 | 4 | 4 | 4 | 100 |

#Total | 150 | 100 | 100 | 100 | NA |

0 | NA | 0 | NA | NA |

```
# is there significant difference between preferences?
# 'na_if(c1r, 3)' remove 'hard to say' from vector
.with(chisq.test(table(na_if(c1r, 3)))) # yes, it is significant
```

```
##
## Chi-squared test for given probabilities
##
## data: table(na_if(c1r, 3))
## X-squared = 10, df = 1, p-value = 2e-04
```

`kable(.cro_cpct(c1r, age_cat))`

Preferences | 18 - 25 | 26 - 35 | #Total |
---|---|---|---|

VSX123 | 65.7 | 60 | 63 |

SDF456 | 31.4 | 35 | 33 |

Hard to say | 2.9 | 5 | 4 |

#Total | 70.0 | 80 | 150 |

`kable(.cro_cpct(h22, age_cat))`

Overall quality. VSX123 | 18 - 25 | 26 - 35 | #Total |
---|---|---|---|

Quite poor | 2.9 | 1.2 | 2 |

Neither good, nor poor | 11.4 | 10.0 | 11 |

Quite good | 45.7 | 33.8 | 39 |

Very good | 24.3 | 41.2 | 33 |

Excellent | 15.7 | 13.8 | 15 |

#Total | 70.0 | 80.0 | 150 |

`kable(.cro_cpct(p22, age_cat))`

Overall quality. SDF456 | 18 - 25 | 26 - 35 | #Total |
---|---|---|---|

Very poor | NA | 1.2 | 0.67 |

Quite poor | 4.3 | 1.2 | 2.67 |

Neither good, nor poor | 20.0 | 13.8 | 16.67 |

Quite good | 27.1 | 35.0 | 31.33 |

Very good | 35.7 | 35.0 | 35.33 |

Excellent | 12.9 | 13.8 | 13.33 |

#Total | 70.0 | 80.0 | 150.00 |

`kable(.cro_cpct(h1_1 %to% h1_6, age_cat))`

Likes. VSX123 | 18 - 25 | 26 - 35 | #Total |
---|---|---|---|

Disliked everything | 1.4 | 5 | 3.3 |

Chocolate | 38.6 | 30 | 34.0 |

Appearance | 21.4 | 36 | 29.3 |

Taste | 38.6 | 26 | 32.0 |

Stuffing | 20.0 | 34 | 27.3 |

Nuts | 72.9 | 61 | 66.7 |

Consistency | 4.3 | 19 | 12.0 |

#Total | 70.0 | 80 | 150.0 |

`kable(.cro_cpct(p1_1 %to% p1_6, age_cat))`

Likes. SDF456 | 18 - 25 | 26 - 35 | #Total |
---|---|---|---|

Disliked everything | 1.4 | 1.2 | 1.33 |

Chocolate | 27.1 | 36.2 | 32.00 |

Appearance | 35.7 | 28.8 | 32.00 |

Taste | 42.9 | 36.2 | 39.33 |

Stuffing | 24.3 | 30.0 | 27.33 |

Nuts | 60.0 | 62.5 | 61.33 |

Consistency | 5.7 | 13.8 | 10.00 |

Other | NA | 1.2 | 0.67 |

#Total | 70.0 | 80.0 | 150.00 |

```
# means
kable(.cro_mean(dtfrm(h22, p22, h_likes, p_likes), age_cat))
```

18 - 25 | 26 - 35 | #Total | |
---|---|---|---|

Overall quality. VSX123 | 5.4 | 5.6 | 5.5 |

Overall quality. SDF456 | 5.3 | 5.4 | 5.4 |

Number of likes. VSX123 | 2.0 | 2.1 | 2.0 |

Number of likes. SDF456 | 2.0 | 2.1 | 2.0 |

We can save labelled dataset as *.csv file with accompanying R code for labelling.

`write_labelled_csv(product_test, file filename = "product_test.csv")`

Or, we can save dataset as *.csv file with SPSS syntax to read data and apply labels.

`write_labelled_spss(product_test, file filename = "product_test.csv")`