11.4 – Two sample effect size

Introduction
What makes a large effect size?
Examples
Questions
Quiz
Chapter 11 contents

Introduction

An effect size is a measure of the strength of the difference between two samples. Effect size tells us the practical importance of a result, say, the difference in outcome between a control group and the treatment group. A combined effect size is a common goal of any meta-analysis (Chapter 20.15).

Note 1: To calculate a combined effect size in meta-analysis, pool individual study effect sizes into a single summary effect. If sample size differs widely across the studies, use a weighted average. Thus, studies with more precision get more weight.

By including effect sizes, this complement inferential statistics such as p-values. Importantly, effect size moves us beyond “significant” or “not significant.” Significance tells us a difference is likely between the groups, but offers no information about the practicality of the observed difference. After all, large studies inherently have more statistical power, which, by definition implicates even small differences can be statistically significant.

In statistics there are many estimates of effect size: for categorical data, odds ratio and relative risk between exposed and unexposed groups; for quantitative data, correlation, regression slope are effect size measures, in addition to the difference between group means. For differences in means between groups, Cohen’s d (and Hedge’s g) are common statistics. In ANOVA, eta-squared ( $\eta^{2}$ ) and omega-squared ( $\omega^{2}$ ), summarizes the proportion of total variance in the dependent variable explained by the independent variables while $R^{2}$ the coefficient of determination, represents the proportion of variance in the outcome explained by the predictors. Advice and help for understanding “effect size” is an enormously popular topic — nearly 3 million hits in Google Scholar and about forty-five thousand papers in Pubmed. Two highly cited papers are those by Durlak (2009) and Ferguson (2009).

The effect size statistic, Cohen’s d, is calculated by subtracting one sample mean from the other and dividing by the pooled standard deviation.

$\begin{align*} d = \frac{\bar{X}_{1} + \bar{X}_{2}}{\s_{pooled}} \end{align*}$

where s_pooled is the pooled standard deviation for the two sample means.

Hedge’s d, which uses the s_pooled is the pooled standard deviation for the two sample means instead of the pooled population standard deviation, is written as

$\begin{align*} g = \frac{\bar{X}_{1} + \bar{X}_{2}}{s_{pooled}} \end{align*}$

The $s_{pooled}$ used in d is from an equation for pooled standard deviation was provided in Chapter 3.3, but we’ll give it again here.

$\begin{align*} s_{pooled} = \sqrt{\frac{s_{1}^2 + s_{2}^2}{2}} \end{align*}$

An alternative version of Cohen’s d is available for the t-test test statistic value

$\begin{align*} d = \frac{2t}{\sqrt{df}} \end{align*}$

A d of one (1) indicates the effect size is equal to one standard deviation; a d of two (2) indicates the effect size between two sample means is equal to two standard deviations, and so on.

The variance of Cohen’s d, $V(d)$ , can be estimated as

$\begin{align*} V(d)=\frac{n_{1} + n_{2}}{n_{1}n_{2}}+\frac{\frac{d^2}{2}}{n_{1}+n_{2}} \end{align*}$

Hedges’ g (Hedges 1981), which includes a Bessel’s correction for bias in small samples, includes “ $-2$ in the denominator of the pooled standard deviation calculation.

$\begin{align*} s_{pooled} = \sqrt{\frac{s_{1}^2 + s_{2}^2}{n_{1}+n_{2}-2}} \end{align*}$

The “-2” forces the pooled variance to be larger — that makes g more conservative than d, which is of use for small sample studies.

A simple modification converts d to g. Multiple d by a small-sample size correction factor J (Durlak 2009):

$\begin{align*} J \approx 1-\frac{3}{4\left ( N-9 \right )} \end{align*}$

Note 2: Fall 2025 update: this section is actively being modified to better reflect what Cohen, Hedges, and others actually meant. In particular, formulas presented here should be considered as provisional awaiting confirmation and improvements to better reflect historical definitions and modern implementations.

What makes a large effect size?

Cohen cautiously suggested that values of d can be interpreted as

0.2 – small effect size

0.5 – medium effect size

0.8 – large effect size

That is, if the two group means don’t differ by much more than 0.2 standard deviations, than the magnitude of the treatment effect is small and unlikely to be biologically important, whereas a d = 0.8 or more would indicate a difference of 0.8 standard deviations between the sample means and, thus, likely to be an important treatment effect. Cohen (1992) provided these guidelines based on the following argument. The small effect 0.2 comes from the idea that it is much worse to conclude there is an effect when in fact there is no effect of the treatment rather than the converse (conclude no effect when there is an effect). The ratio of the Type II error (0.2) divided by the Type I error (0.05) gives us the penalty of 4. Similarly, for a moderate effect, $0.5/0.05}$ equals 10. Clearly, these are only guidelines (see Lakens 2013).

Examples

The difference in average body size between six week old females of two strains of lab mice is 0.4 g (Table 1), and increases to 1.38 g by 16 weeks (Table 2).

Table 1. Average body weights of 6 week old female mice of two different inbred strains.†

Strain	$\bar{X}$	s
C57BL/6J	18.5	0.9
CBA/J	18.1	1.27

†Source: Jackson Laboratories: C57BL/6J; CBA/J

Table 2. Average body weights of 16 week old female mice of two different inbred strains.†

Strain	$\bar{X}$	s
C57BL/6J	23.9	2.3
CBA/J	25.38	3.76

†Source: Jackson Laboratories: C57BL/6J; CBA/J

The descriptive statistics are based on weights of 360 individuals in each strain (Jackson Labs).

The differences are both statistically significant from a independent t-test, i.e., p-value less than 0.05. I’ll show you how to calculate independent t-test given summary statistics (means, standard deviations), for Table 1 data, then will ask you to do this on your own in Questions.

Write an R script, example data from Table 1

sdd1 = 0.9
var1 = sdd1^2
sdd2 = 1.27
var2 = sdd2^2
mean1 = 18.5
mean2 = 18.1
n1 = 360
n2 = 360
dff = n1+n2-2
pooledSD <-sqrt((var1+var2)/2)
pooledSEM <-sqrt(var1/n1 + var2/n2); pooledSEM
tdiff<-(mean1-mean2)/pooledSEM; tdiff
pt(tdiff, df=dff, lower.tail=FALSE)
#get two-tailed p-value
2*0.0000006675956 
#get cohen's d
2*tdiff/sqrt(dff)

Results from the calculations we report (value of the test statistic, degrees of freedom, p-value), and the effect size, then are

t = 4.875773, df = 718, p-value = 0.0000006675956
cohen's d = 0.364

Now, I’m from the school of “don’t reinvent the wheel” or “someone has already solved your problems” (Freeman et al 2008), when it comes to coding problems. And, as you would expect, of course someone has written a function to calculate the t-test given summary statistics. In addition to base R and the pwr package (see Chapter 11.5), the package BSDA contains several nice functions for power calculations.

To follow this example, install BSDA, then run the following code

require(BSDA)
tsum.test(mean1, sdd1, n1, mean2, sdd2, n2, alternative = "two.sided", mu = 0, var.equal = TRUE, conf.level = 0.95)

R output

Standard Two-Sample t-Test

data: Summarized x and y
t = 4.8758, df = 718, p-value = 0.000001335
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.2389364 0.5610636
sample estimates:
mean of x mean of y 
18.5 18.1

Similarly, Cohen’s d is available from a package called effsize.

Note 3: One reason to “re-invent the wheel,” I only needed the one function; the BSDA package contains more 330 different objects/functions. A simple way to check how many objects in a package, e.g., BSDA, run

ls("package:BSDA")

BSDA stands for “Basic Statistics and Data Analysis,” and was intended to accompany the 2002 book of the same title by Larry Kitchens.

And of course, if using some else’s code, give proper citation!

Questions

We needed an equation to calculate pooled standard error of the mean (pooledSEM in the R code). Read the code and write the equation used to calculate the pooled SEM.
Calculate the t-test and the effect size for the Table 1 data, but at three smaller sample sizes. Change from 360 for n₁ = n₂ = 20, repeat for n₁ = n₂ = 50, and finally, repeat for n₁ = n₂ = 100. Use your own code, or use the tsum.test function from the BSDA package.
Calculate Cohen’s effect size d for each new calculation based on different sample size.
Create a table to report the p-values from the t-tests, the effect size, for each of the four n₁ = n₂ = (20, 50, 100, 360).
True or false. The mean difference between sample means remains unaffected by sample size.
True or false. The effect size between sample means remains unaffected by sample size.
Based on comparisons in your table, what can you conclude about p-value and “statistical significance?” About effect size?
Repeat questions 2 – 7 for Table 2.

Quiz Chapter 11.4

Two sample effect size

11.4 – Two sample effect size

Introduction

What makes a large effect size?

Examples

Questions

Chapter 11 contents