First read the Employee data included as part of lessR.
##
## >>> Suggestions
## Details about your data, Enter: details() for d, or details(name)
##
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
##
## Variable Missing Unique
## Name Type Values Values Values First and last values
## ------------------------------------------------------------------------------------------
## 1 Years integer 36 1 16 7 NA 15 ... 1 2 10
## 2 Gender character 37 0 2 M M M ... F F M
## 3 Dept character 36 1 5 ADMN SALE SALE ... MKTG SALE FINC
## 4 Salary double 37 0 37 53788.26 94494.58 ... 56508.32 57562.36
## 5 JobSat character 35 2 3 med low low ... high low high
## 6 Plan integer 37 0 3 1 1 3 ... 2 2 1
## 7 Pre integer 37 0 27 82 62 96 ... 83 59 80
## 8 Post integer 37 0 22 92 74 97 ... 90 71 87
## ------------------------------------------------------------------------------------------
Obtain the summary statistics and 95% confidence interval for a single variable by specifying that variable with ttest()
.
##
##
## ------ Description ------
##
## Salary: n.miss = 0, n = 37, mean = 73795.557, sd = 21799.533
##
##
## ------ Normality Assumption ------
##
## Sample mean assumed normal because n>30, so no test needed.
##
##
## ------ Inference ------
##
## t-cutoff for 95% range of variation: tcut = 2.028
## Standard Error of Mean: SE = 3583.821
##
## Margin of Error for 95% Confidence Level: 7268.326
## 95% Confidence Interval for Mean: 66527.230 to 81063.883
Add a hypothesis test to the above.
##
##
## ------ Description ------
##
## Salary: n.miss = 0, n = 37, mean = 73795.557, sd = 21799.533
##
##
## ------ Normality Assumption ------
##
## Sample mean assumed normal because n>30, so no test needed.
##
##
## ------ Inference ------
##
## t-cutoff for 95% range of variation: tcut = 2.028
## Standard Error of Mean: SE = 3583.821
##
## Hypothesized Value H0: mu = 52000
## Hypothesis Test of Mean: t-value = 6.082, df = 36, p-value = 0.000
##
## Margin of Error for 95% Confidence Level: 7268.326
## 95% Confidence Interval for Mean: 66527.230 to 81063.883
##
##
## ------ Effect Size ------
##
## Distance of sample mean from hypothesized: 21795.557
## Standardized Distance, Cohen's d: 1.000
##
##
## ------ Graphics Smoothing Parameter ------
##
## Density bandwidth for 12035.673
## --------------------------------------------------
Analysis of the above from summary statistics only.
##
##
## ------ Description ------
##
## Salary: n = 37, mean = 73795.56, sd = 21799.53
##
##
## ------ Inference ------
##
## t-cutoff for 95% range of variation: tcut = 2.028
## Standard Error of Mean: SE = 3583.821
##
## Hypothesized Value H0: mu = 52000
## Hypothesis Test of Mean: t-value = 6.082, df = 36, p-value = 0.000
##
## Margin of Error for 95% Confidence Level: 7268.326
## 95% Confidence Interval for Mean: 66527.231 to 81063.883
##
##
## ------ Effect Size ------
##
## Distance of sample mean from hypothesized: 21795.557
## Standardized Distance, Cohen's d: 1.000
Full analysis with ttest()
function, abbreviated as tt()
, with formula mode.
##
## Compare Salary across Gender levels M and F
##
## ------ Describe ------
##
## Salary for Gender M: n.miss = 0, n = 18, mean = 81147.458, sd = 23128.436
## Salary for Gender F: n.miss = 0, n = 19, mean = 66830.598, sd = 18438.456
##
## Mean Difference of Salary: 14316.860
##
## Weighted Average Standard Deviation: 20848.636
##
##
## ------ Assumptions ------
##
## Note: These hypothesis tests can perform poorly, and the
## t-test is typically robust to violations of assumptions.
## Use as heuristic guides instead of interpreting literally.
##
## Null hypothesis, for each group, is a normal distribution of Salary.
## Group M Shapiro-Wilk normality test: W = 0.962, p-value = 0.647
## Group F Shapiro-Wilk normality test: W = 0.828, p-value = 0.003
##
## Null hypothesis is equal variances of Salary, i.e., homogeneous.
## Variance Ratio test: F = 534924536.348/339976675.129 = 1.573, df = 17;18, p-value = 0.349
## Levene's test, Brown-Forsythe: t = 1.302, df = 35, p-value = 0.201
##
##
## ------ Infer ------
##
## --- Assume equal population variances of Salary for each Gender
##
## t-cutoff for 95% range of variation: tcut = 2.030
## Standard Error of Mean Difference: SE = 6857.494
##
## Hypothesis Test of 0 Mean Diff: t = 2.088, df = 35, p-value = 0.044
##
## Margin of Error for 95% Confidence Level: 13921.454
## 95% Confidence Interval for Mean Difference: 395.406 to 28238.314
##
##
## --- Do not assume equal population variances of Salary for each Gender
##
## t-cutoff: tcut = 2.036
## Standard Error of Mean Difference: SE = 6900.112
##
## Hypothesis Test of 0 Mean Diff: t = 2.075, df = 32.505, p-value = 0.046
##
## Margin of Error for 95% Confidence Level: 14046.505
## 95% Confidence Interval for Mean Difference: 270.355 to 28363.365
##
##
## ------ Effect Size ------
##
## --- Assume equal population variances of Salary for each Gender
##
## Standardized Mean Difference of Salary, Cohen's d: 0.687
##
##
## ------ Practical Importance ------
##
## Minimum Mean Difference of practical importance: mmd
## Minimum Standardized Mean Difference of practical importance: msmd
## Neither value specified, so no analysis
##
##
## ------ Graphics Smoothing Parameter ------
##
## Density bandwidth for Gender M: 14777.329
## Density bandwidth for Gender F: 11630.959
Brief version of the output contains just the basics.
##
## Compare Salary across Gender levels M and F
##
## --- Describe ---
##
## Salary for Gender M: n.miss = 0, n = 18, mean = 81147.458, sd = 23128.436
## Salary for Gender F: n.miss = 0, n = 19, mean = 66830.598, sd = 18438.456
##
## Mean Difference of Salary: 14316.860
## Weighted Average Standard Deviation: 20848.636
## Standardized Mean Difference of Salary: 0.687
##
## --- Infer ---
##
## t-cutoff for 95% range of variation: tcut = 2.030
## Standard Error of Mean Difference: SE = 6857.494
##
## Hypothesis Test of 0 Mean Diff: t = 2.088, df = 35, p-value = 0.044
##
## Margin of Error for 95% Confidence Level: 13921.454
## 95% Confidence Interval for Mean Difference: 395.406 to 28238.314
##
## Compare Y across X levels Group2 and Group1
##
## --- Describe ---
##
## Y for X Group2: n.miss = 0, n = 37, mean = 81.000, sd = 11.593
## Y for X Group1: n.miss = 0, n = 37, mean = 78.784, sd = 12.037
##
## Mean Difference of Y: 2.216
## Weighted Average Standard Deviation: 11.817
## Standardized Mean Difference of Y: 0.188
##
## --- Infer ---
##
## t-cutoff for 95% range of variation: tcut = 1.993
## Standard Error of Mean Difference: SE = 2.747
##
## Hypothesis Test of 0 Mean Diff: t = 0.807, df = 72, p-value = 0.423
##
## Margin of Error for 95% Confidence Level: 5.477
## 95% Confidence Interval for Mean Difference: -3.261 to 7.693
Analysis of variance applies to the inferential analysis of means across groups. The lessR function ANOVA()
, abbreviated av()
, provides this analysis, based on the base R function aov()
.
The data for these examples is the warpbreaks data set included with the R datasets package. The data are from a weaving device called a loom for a fixed length of yarn. The response variable is the number of times the yarn broke during the weaving. Independent variables are the type of wool – A or B –and the level of tension – L, M, or H.
Because warpbreaks is not the default data frame, specify with the data
parameter (or set d equal to warpbreaks).
First, for illustrative purposes, ignore the type of wool and only examine the impact of tension on breaks.
The output includes descriptive statistics, ANOVA table, effect size indices, Tukey’s multiple comparisons of means, and residuals, as well as the scatterplot of the response variable with the levels of the independent variable, and a visualization of the mean comparisons.
## BACKGROUND
##
## Response Variable: breaks
##
## Factor Variable: tension
## Levels: L M H
##
## Number of cases (rows) of data: 54
## Number of cases retained for analysis: 54
##
##
## DESCRIPTIVE STATISTICS
##
## n mean sd min max
## L 18 36.39 16.45 14.00 70.00
## M 18 26.39 9.12 12.00 42.00
## H 18 21.67 8.35 10.00 43.00
##
## Grand Mean: 28.148
##
##
## BASIC ANALYSIS
##
## df Sum Sq Mean Sq F-value p-value
## tension 2 2034.26 1017.13 7.21 0.0018
## Residuals 51 7198.56 141.15
##
##
## R Squared: 0.22
## R Sq Adjusted: 0.19
## Omega Squared: 0.19
##
## Cohen's f: 0.48
##
##
## TUKEY MULTIPLE COMPARISONS OF MEANS
##
## Family-wise Confidence Level:
## -------------------------------
## diff lwr upr p adj
## M-L -10.00 -19.56 -0.44 0.04
## H-L -14.72 -24.28 -5.16 0.00
## H-M -4.72 -14.28 4.84 0.46
##
##
## RESIDUALS
##
## Fitted Values, Residuals, Standardized Residuals
## [sorted by Standardized Residuals, ignoring + or - sign]
## [res_rows = 20, out of 54 cases (rows) of data, or res_rows="all"]
## -------------------------------------------
## tension breaks fitted residual z-resid
## 5 L 70.00 36.39 33.61 2.91
## 9 L 67.00 36.39 30.61 2.65
## 29 L 14.00 36.39 -22.39 -1.94
## 24 H 43.00 21.67 21.33 1.85
## 3 L 54.00 36.39 17.61 1.53
## 31 L 19.00 36.39 -17.39 -1.51
## 35 L 20.00 36.39 -16.39 -1.42
## 37 M 42.00 26.39 15.61 1.35
## 6 L 52.00 36.39 15.61 1.35
## 7 L 51.00 36.39 14.61 1.27
## 14 M 12.00 26.39 -14.39 -1.25
## 19 H 36.00 21.67 14.33 1.24
## 41 M 39.00 26.39 12.61 1.09
## 44 M 39.00 26.39 12.61 1.09
## 23 H 10.00 21.67 -11.67 -1.01
## 4 L 25.00 36.39 -11.39 -0.99
## 8 L 26.00 36.39 -10.39 -0.90
## 40 M 16.00 26.39 -10.39 -0.90
## 1 L 26.00 36.39 -10.39 -0.90
## 18 M 36.00 26.39 9.61 0.83
##
##
## ----------------------------------------
## Plot 1: Scatterplot with Cell Means
## Plot 2: 95% family-wise confidence level
## ----------------------------------------
The brief version forgoes the multiple comparisons and the residuals.
## BACKGROUND
##
## Response Variable: breaks
##
## Factor Variable: tension
## Levels: L M H
##
## Number of cases (rows) of data: 54
## Number of cases retained for analysis: 54
##
##
## DESCRIPTIVE STATISTICS
##
## n mean sd min max
## L 18 36.39 16.45 14.00 70.00
## M 18 26.39 9.12 12.00 42.00
## H 18 21.67 8.35 10.00 43.00
##
## Grand Mean: 28.148
##
##
## BASIC ANALYSIS
##
## df Sum Sq Mean Sq F-value p-value
## tension 2 2034.26 1017.13 7.21 0.0018
## Residuals 51 7198.56 141.15
##
##
## R Squared: 0.22
## R Sq Adjusted: 0.19
## Omega Squared: 0.19
##
## Cohen's f: 0.48
##
##
## TUKEY MULTIPLE COMPARISONS OF MEANS
##
## RESIDUALS
Specify the second independent variable preceded by a *
sign. The plot of the cell means is generated automatically.
## BACKGROUND
##
## Response Variable: breaks
##
## Factor Variable 1: tension
## Levels: L M H
##
## Factor Variable 2: wool
## Levels: A B
##
## Number of cases (rows) of data: 54
## Number of cases retained for analysis: 54
##
## The design is balanced
##
## Two-way Between Groups ANOVA
##
##
## DESCRIPTIVE STATISTICS
##
## Cell Sample Size: 9
##
##
## tension
## wool L M H
## A 44.56 24.00 24.56
## B 28.22 28.78 18.78
##
##
## tension
## ---------------------
## L M H
## 1 36.39 26.39 21.67
##
## wool
## ---------------
## A B
## 1 31.04 25.26
##
##
## 28.148
##
##
## tension
## wool L M H
## A 18.10 8.66 10.27
## B 9.86 9.43 4.89
##
##
## BASIC ANALYSIS
##
## df Sum Sq Mean Sq F-value p-value
## tension 2 2034.26 1017.13 8.50 0.0007
## wool 1 450.67 450.67 3.77 0.0582
## tension:wool 2 1002.78 501.39 4.19 0.0210
## Residuals 48 5745.11 119.69
##
##
## Partial Omega Squared for tension: 0.22
## Partial Omega Squared for wool: 0.05
## Partial Omega Squared for tension & wool: 0.11
##
## Cohen's f for tension: 0.53
## Cohen's f for wool: 0.23
## Cohen's f for tension_&_wool: 0.34
##
##
## TUKEY MULTIPLE COMPARISONS OF MEANS
##
## Family-wise Confidence Level:
##
## Factor: tension
## -------------------------------
## diff lwr upr p adj
## M-L -10.00 -18.82 -1.18 0.02
## H-L -14.72 -23.54 -5.90 0.00
## H-M -4.72 -13.54 4.10 0.40
##
## Factor: wool
## -----------------------------
## diff lwr upr p adj
## B-A -5.78 -11.76 0.21 0.06
##
## Cell Means
## ------------------------------------
## diff lwr upr p adj
## M:A-L:A -20.56 -35.86 -5.25 0.00
## H:A-L:A -20.00 -35.31 -4.69 0.00
## L:B-L:A -16.33 -31.64 -1.03 0.03
## M:B-L:A -15.78 -31.08 -0.47 0.04
## H:B-L:A -25.78 -41.08 -10.47 0.00
## H:A-M:A 0.56 -14.75 15.86 1.00
## L:B-M:A 4.22 -11.08 19.53 0.96
## M:B-M:A 4.78 -10.53 20.08 0.94
## H:B-M:A -5.22 -20.53 10.08 0.91
## L:B-H:A 3.67 -11.64 18.97 0.98
## M:B-H:A 4.22 -11.08 19.53 0.96
## H:B-H:A -5.78 -21.08 9.53 0.87
## M:B-L:B 0.56 -14.75 15.86 1.00
## H:B-L:B -9.44 -24.75 5.86 0.46
## H:B-M:B -10.00 -25.31 5.31 0.39
##
##
## RESIDUALS
##
## Fitted Values, Residuals, Standardized Residuals
## [sorted by Standardized Residuals, ignoring + or - sign]
## [res_rows = 20, out of 54 cases (rows) of data, or res_rows="all"]
## ------------------------------------------------
## tension wool breaks fitted residual z-resid
## 5 L A 70.00 44.56 25.44 2.47
## 9 L A 67.00 44.56 22.44 2.18
## 4 L A 25.00 44.56 -19.56 -1.90
## 8 L A 26.00 44.56 -18.56 -1.80
## 1 L A 26.00 44.56 -18.56 -1.80
## 24 H A 43.00 24.56 18.44 1.79
## 36 L B 44.00 28.22 15.78 1.53
## 23 H A 10.00 24.56 -14.56 -1.41
## 2 L A 30.00 44.56 -14.56 -1.41
## 29 L B 14.00 28.22 -14.22 -1.38
## 37 M B 42.00 28.78 13.22 1.28
## 34 L B 41.00 28.22 12.78 1.24
## 40 M B 16.00 28.78 -12.78 -1.24
## 14 M A 12.00 24.00 -12.00 -1.16
## 18 M A 36.00 24.00 12.00 1.16
## 19 H A 36.00 24.56 11.44 1.11
## 16 M A 35.00 24.00 11.00 1.07
## 41 M B 39.00 28.78 10.22 0.99
## 44 M B 39.00 28.78 10.22 0.99
## 39 M B 19.00 28.78 -9.78 -0.95
Can also obtain the cell mean plot directly from the means. Here use lessR pivot()
to compute the cell means of breaks across tension and wool.
data(warpbreaks)
dm <- pivot(warpbreaks, mean, breaks, c(tension, wool))
Plot(tension, breaks, by=wool, segments=TRUE, size=2, data=dm, main="Cell Means")
##
## >>> Note
## The integrated Violin/Box/Scatterplot (VBS) for breaks
## at each level of tension is only obtained if the categorical
## variable is the variable listed second, that is, the y-variable.
##
## This ordering with tension listed first yields the
## scatterplot and the associated means, but no VBS plot.
## >>> Suggestions
## Plot(tension, breaks, data=dm, by=wool, size=2, segments=TRUE, main="Cell Means", means=FALSE) # do not plot means
## Plot(tension, breaks, data=dm, by=wool, size=2, segments=TRUE, main="Cell Means", stat="mean") # only plot means
## ANOVA(breaks ~ tension) # inferential analysis
##
##
## breaks
## - by levels of -
## tension
##
## n miss mean sd min mdn max
## L 2 0 36.388889 11.549411 28.222222 36.388889 44.555556
## M 2 0 26.388889 3.378399 24.000000 26.388889 28.777778
## H 2 0 21.666667 4.085506 18.777778 21.666667 24.555556
The randomized block design has a treatment variable, usually administered over time, and a blocking variable. The values of the treatment variable are measured across each instance of the blocking variable. In this example, repetitions are measured across four different workout sessions. The person takes one of four supplements before each session. Person is the blocking variable, and Supplement is the treatment variable. Repetitions is the response variable.
The data are presented in a wide-form data table, a single row for each person.
d <- read.csv(header=TRUE, text="
Person,sup1,sup2,sup3,sup4
p1,2,4,4,3
p2,2,5,4,6
p3,8,6,7,9
p4,4,3,5,7
p5,2,1,2,3
p6,5,5,6,8
p7,2,3,2,4")
The ANOVA, however, requires data to be in long-form. Reshape data from wide form to long form with base R reshape()
according to the following parameters. With each parameter, either identify existing variables in the given wide-form data, or name newly created variables in the long-form. This R function refers to a time-variable, which in the context of ANOVA is the treatment variable, of which the values occur over time: first treatment, second treatment, etc.
The reshaping from a wide-form to a long-form data table creates two new variables: the variable whose values are collected over time, here the blocking variable, Supplement, and the response variable, here Reps.
idvar
: Identify the existing blocking (within) variable in the wide-form datavarying
: Identify the wide-form variables, which occur over time, to be gathered into a single variable in long-formattimevar
: Name the corresponding treatment (factor) variable in the created long-formv.names
: Name the response variable in the created long-formThere are many ways to identify the names of the wide-form variables to be gathered into a single time-oriented long-form variable. The most general is to specify a vector of the names, here
c("sup1", "sup2" "sup3", "sup4")
In this example use the lessR to
function to create that vector without needed to individual list each variable.
## [1] "sup1" "sup2" "sup3" "sup4"
d <- reshape(d, direction="long",
idvar="Person", varying=list(to("sup", 4)),
timevar="Supplement", v.names="Reps")
Do not need the row names, so remove before displaying new long-form data.
## Person Supplement Reps
## 1 p1 1 2
## 2 p2 1 2
## 3 p3 1 8
## 4 p4 1 4
## 5 p5 1 2
## 6 p6 1 5
## 7 p7 1 2
## 8 p1 2 4
## 9 p2 2 5
## 10 p3 2 6
To run the ANOVA, specify the blocking variable preceded by a +
sign.
## BACKGROUND
##
## Response Variable: Reps
##
## Factor Variable 1: Supplement
## Levels: 1 2 3 4
##
## Factor Variable 2: Person
## Levels: p1 p2 p3 p4 p5 p6 p7
##
## Number of cases (rows) of data: 28
## Number of cases retained for analysis: 28
##
## The design is balanced
##
## Randomized Blocks ANOVA
## Factor of Interest: Supplement
## Blocking Factor: Person
##
## Note: For the resulting F statistic for Supplement to be distributed as F,
## the population covariances of Reps must be spherical.
##
##
## DESCRIPTIVE STATISTICS
##
## Supplement
## -----------------------
## X1 X2 X3 X4
## 1 3.57 3.86 4.29 5.71
##
## Person
## --------------------------------------
## p1 p2 p3 p4 p5 p6 p7
## 1 3.25 4.25 7.50 4.75 2.00 6.00 2.75
##
##
## 4.357
##
##
## BASIC ANALYSIS
##
## df Sum Sq Mean Sq F-value p-value
## Supplement 3 19.00 6.33 6.71 0.0031
## Person 6 88.43 14.74 15.61 0.0000
## Residuals 18 17.00 0.94
##
##
## Partial Omega Squared for Supplement: 0.38
## Partial Intraclass Correlation for Person: 0.79
##
## Cohen's f for Supplement: 0.78
## Cohen's f for Person: 1.91
##
##
## TUKEY MULTIPLE COMPARISONS OF MEANS
##
## Family-wise Confidence Level:
##
## Factor: Supplement
## ---------------------------
## diff lwr upr p adj
## 2-1 0.29 -1.18 1.75 0.95
## 3-1 0.71 -0.75 2.18 0.53
## 4-1 2.14 0.67 3.61 0.00
## 3-2 0.43 -1.04 1.90 0.84
## 4-2 1.86 0.39 3.33 0.01
## 4-3 1.43 -0.04 2.90 0.06
##
##
## RESIDUALS
##
## Fitted Values, Residuals, Standardized Residuals
## [sorted by Standardized Residuals, ignoring + or - sign]
## [res_rows = 20, out of 28 cases (rows) of data, or res_rows="all"]
## ---------------------------------------------------
## Supplement Person Reps fitted residual z-resid
## 22 4 p1 3 4.61 -1.61 -2.06
## 2 1 p2 2 3.46 -1.46 -1.88
## 3 1 p3 8 6.71 1.29 1.65
## 9 2 p2 5 3.75 1.25 1.60
## 8 2 p1 4 2.75 1.25 1.60
## 11 2 p4 3 4.25 -1.25 -1.60
## 10 2 p3 6 7.00 -1.00 -1.28
## 25 4 p4 7 6.11 0.89 1.15
## 15 3 p1 4 3.18 0.82 1.05
## 5 1 p5 2 1.21 0.79 1.01
## 14 2 p7 3 2.25 0.75 0.96
## 21 3 p7 2 2.68 -0.68 -0.87
## 27 4 p6 8 7.36 0.64 0.83
## 13 2 p6 5 5.50 -0.50 -0.64
## 12 2 p5 1 1.50 -0.50 -0.64
## 1 1 p1 2 2.46 -0.46 -0.60
## 17 3 p3 7 7.43 -0.43 -0.55
## 23 4 p2 6 5.61 0.39 0.50
## 26 4 p5 3 3.36 -0.36 -0.46
## 18 3 p4 5 4.68 0.32 0.41
##
##
## ------------------------
## Plot 1: Interaction Plot
## Plot 2: Fitted Values
## ------------------------
Use the base R help()
function to view the full manual for ttest()
or ANOVA()
. Simply enter a question mark followed by the name of the function.
?ttest
?ANOVA