Test Statistics for the Comparison of Means for Two Samples That Include Both Paired and Independent Observations

Journal of Modern Applied Statistical

Methods

Volume 16

Issue 1 Article 9

5-1-2017

Test Statistics for the Comparison of Means for

Two Samples !at Include Both Paired and

Independent Observations

Ben Derrick

University of the West of England, ben.[email protected].uk

Bethan Russ

Oce for National Statistics, [email protected]v.uk

Deirdre Toher

University of the West of England, deirdre.toher@uwe.ac.uk

Paul White

University of the West of England, paul.white@uwe.ac.uk

Follow this and additional works at: h>p://digitalcommons.wayne.edu/jmasm

Part of the Applied Statistics Commons, Social and Behavioral Sciences Commons, and the

Statistical =eory Commons

=is Regular Article is brought to you for free and open access by the Open Access Journals at DigitalCommons@WayneState. It has been accepted for

inclusion in Journal of Modern Applied Statistical Methods by an authorized editor of DigitalCommons@WayneState.

Recommended Citation

Derrick, B., Russ, B., Toher, D., & White, P. (2017). Test statistics for the comparison of means for two samples that include both

paired and independent observations. Journal of Modern Applied Statistical Methods, 16(1), 137-157. doi: 10.22237/jmasm/

1493597280

Journal of Modern Applied Statistical Methods

May 2017, Vol. 16, No. 1, 137-157.

doi: 10.22237/jmasm/1493597280

ISSN 1538 − 9472

Ben Derrick is a Lecturer with the Applied Statistics Group. Email at

ben.derrick@uwe.ac.uk. Bethan Russ is an associate with the Applied Statistics Group at

UWE. Email at bethan.russ@ons.gsi.gov.uk. Dr. Toher is a Senior Lecturer with the

Applied Statistics Group. Email at deirdre.toher@uwe.ac.uk. Dr. White is an Associate

Professor and the academic lead for the Applied Statistics Group. Email at

paul.white@uwe.ac.uk.

137

Test Statistics for the Comparison of Means

for Two Samples That Include Both Paired

and Independent Observations

Ben Derrick

University of the West Of England

Bristol, England, UK

Deirdre Toher

University of the West Of England

Bristol, England, UK

Bethan Russ

Office for National Statistics

Newport, Wales, UK

Paul White

University of the West Of England

Bristol, England, UK

Standard approaches for analyzing the difference in two means, where partially

overlapping samples are present, are less than desirable. Here are introduced two test

statistics, making reference to the t-distribution. It is shown that these test statistics are

Type I error robust, and more powerful than standard tests.

Keywords: partially overlapping samples, test for equality of means, corrected z-test,

partially correlated data, partially matched pairs

Introduction

Hypothesis tests for the comparison of two population means, μ

and μ

, with two

samples of either independent observations or paired observations are well

established. When the assumptions of the test are met, the independent samples

t-test is the most powerful test for comparing means between two independent

samples (Sawilowsky and Blair, 1992). Similarly, when the assumptions of the

test are met, the paired samples t-test is the most powerful test for the comparison

of means between two dependent samples (Zimmerman, 1997). If a paired design

COMPARISON OF MEANS FOR TWO SAMPLES

138

can avoid extraneous systematic bias, then paired designs are generally

considered to be advantageous when contrasted with independent designs.

There are scenarios where, in a paired design, some observations may be

missing. In the literature, this scenario is referred to as paired samples that are

either “incomplete” (Ekbohm, 1976) or with “missing observations” (Bhoj, 1978).

There are designs that do not have completely balanced pairings. Occasions where

there may be two samples with both paired observations and independent

observations include:

i) Two groups with some common element between both groups. For

example, in education when comparing the average exam marks for

two optional subjects, where some students take one of the two

subjects and some students take both.

ii) Observations taken at two points in time, where the population

membership changes over time but retains some common members.

For example, an annual survey of employee satisfaction may include

new employees that were unable to respond at time point one,

employees that left after time point one, and employees that

remained in employment throughout.

iii) When some natural pairing occurs. For example, in a survey taken

comparing views of males and females, there will be some matched

pairs (couples) and some independent individuals (single).

The examples given above can be seen as part of the wider missing data

framework. There is much literature on methods for dealing with missing data and

the proposals in this paper do not detract from extensive research into the area.

The simulations and discussion in this paper are done in the context of data

missing completely at random (MCAR).

Two samples that include both paired and independent observations is

referred to using varied terminology in the literature. The example scenarios

outlined can be referred to as “partially paired data” (Samawi and Vogel, 2011).

However, this terminology has connotations suggesting that the pairs themselves

are not directly matched. Derrick et al. (2015) suggest that appropriate

terminology for the scenarios outlined gives reference to “partially overlapping

samples.” For work that has previously been done on a comparison of means

when partially overlapping samples are present, “the partially overlapping

DERRICK ET AL.

139

samples framework… has been treated poorly in the literature” (Martínez-

Camblor, Corral, and María de la Hera, 2012, p.77). In this paper, the term

partially overlapping samples will be used to refer to scenarios where there are

two samples with both paired and independent observations.

When partially overlapping samples exist, the goal remains to test the null

hypothesis H

: μ

= μ

. Standard approaches when faced with such a situation, are

to perform the paired samples t-test, discarding the unpaired data, or alternatively

perform the independent samples t-test, discarding the paired data (Looney and

Jones, 2003). These approaches are wasteful and can result in a loss of power.

The bias created with these approaches may be of concern. Other solutions

proposed in a similar context are to perform the independent samples t-test on all

observations ignoring the fact that there may be some pairs, or alternatively

randomly pairing unpaired observations and performing the paired samples t-test

(Bedeian and Feild, 2002). These methods distort Type I error rates (Zumbo,

2002) and fail to adequately reflect the design. This emphasizes the need for

research into a statistically valid approach. A method of analysis that takes into

account any pairing but does not lose the unpaired information would be

beneficial.

One analytical approach is to separately perform both the paired samples t-

test on the paired observations and the independent samples t-test on the

independent observations. The results are then combined using Fisher’s (1925)

Chi-square method, or Stouffer’s (Stouffer, et al., 1949) weighted z-test. These

methods have issues with respect to the interpretation of the results. Other

procedures weighting the paired and independent samples t-tests, for the partially

overlapping samples scenario, have been proposed by Bhoj, (1978), Kim et al.

(2005), Martínez-Camblor, Corral, and María de la Hera (2012), and Samawi and

Vogel (2011).

Looney and Jones (2003) proposed a statistic making reference to the

z-distribution that uses all of the available data, without a complex weighting

structure. Their corrected z-statistic is simple to compute and it directly tests the

hypothesis H

: μ

= μ

. They suggest that their test statistic is generally Type I

error robust across the scenarios that they simulated. However, they only consider

normally distributed data with a common variance of 1 and a total sample size of

50 observations. Therefore their simulation results are relatively limited,

simulations across a wider range of parameters would help provide stronger

conclusions. Mehrotra (2004) indicates that the solution provided by Looney and

Jones (2003) may not be Type I error robust for small sample sizes.

COMPARISON OF MEANS FOR TWO SAMPLES

140

Early literature for the partially overlapping samples framework focused on

maximum likelihood estimates, when data are missing by accident rather than by

design. Lin (1973) use maximum likelihood estimates for the specific case where

data is missing from one of the two groups. Lin (1973) uses assumptions such as

the variance ratio is known. Lin and Strivers (1974) apply maximum likelihood

solutions to the more general case, but find that no single solution is applicable.

For normally distributed data, Ekbohm (1976) compared Lin and Strivers

(1974) tests with similar proposals based on maximum likelihood estimators.

Ekbohm (1976) found that maximum likelihood solutions do not always maintain

Bradley’s liberal Type I error robustness criteria. The results suggest that the

maximum likelihood approaches are of little added value compared to standard

methods. Furthermore the proposals by Ekbohm (1976) are complex

mathematical procedures and are unlikely to be considered as a first choice

solution in a practical environment.

A solution available in most standard software is to perform a mixed model

using all of the available data. In a mixed model, effects are assessed using

Restricted Maximum Likelihood estimators (REML). Mehrotra (2004) indicates

that for positive correlation, REML is Type I error robust and more powerful

approach than that proposed by Looney and Jones (2003).

For small sample sizes, an intuitive solution to the comparison of means

with partially overlapping samples, would be a test statistic derived using

concepts similar to that of Zumbo (2002) so that all available data are used

making reference to the t-distribution.

Here, two test statistics are proposed. The proposed solution for equal

variances acts as a linear interpolation between the paired samples t-test and the

independent samples t-test. The consensus in the literature is that Welch’s test is

more Type I error robust than the independent samples t-test, particularly with

unequal variances and unequal samples sizes (Derrick, Toher and White, 2016;

Fay and Proschan, 2010; Zimmerman and Zumbo, 2009). The proposed solution

for unequal variances is a test that acts as a linear interpolation between the paired

samples t-test and Welch’s test.

Standard tests and the proposal by Looney and Jones (2003) are given below.

This is followed by the definition of the presently proposed test statistics. A

worked example using each of these test statistics and REML is provided. The

Type I error rate and power for the test statistics and REML is then explored

using simulation, for partially overlapping samples simulated from a Normal

distribution.

DERRICK ET AL.

141

Notation

Notation used in the definition of the test statistics is given in Table 1.

Table 1. Notation used in this paper.

= number of observations exclusive to Sample 1

= number of observations exclusive to Sample 2

= number of pairs

= total number of observations in Sample 1 (i.e. n

= na + nc)

= total number of observations in Sample 2 (i.e. n

= nb + nc)

= mean of all observations in Sample 1

= mean of all observations in Sample 2

= mean of the independent observations in Sample 1

= mean of the independent observations in Sample 2

= mean of the paired observations in Sample 1

= mean of the paired observations in Sample 2

= variance of all observations in Sample 1

= variance of all observations in Sample 2

= variance of the independent observations in Sample 1

= variance of the independent observations in Sample 2

= variance of the paired observations in Sample 1

= variance of the paired observations in Sample 2

= covariance between the paired observations

= Pearson’s correlation coefficient for the paired observations

All variances above are calculated using Bessel’s correction, i.e. the sample

variance with n

− 1 degrees of freedom (see Kenney and Keeping, 1951, p.161).

COMPARISON OF MEANS FOR TWO SAMPLES

142

As standard notation, random variables are shown in upper case, and derived

sample values are shown are in lower case.

Definition of Existing Test Statistics

Standard approaches for comparing two means making reference to the t-

distribution are given below. These definitions follow the structural form given by

Fradette et al. (2003), adapted to the context of partially overlapping samples.

To perform the paired samples t-test, the independent observations are

discarded so that

1 2 1 2

c c c c

c c c

S S S S

n n n













The statistic T

is referenced against the t-distribution with v

= n

− 1

degrees of freedom.

To perform the independent samples t-test, the paired observations are

discarded so that







where

   

a a b b

n S n S

  



  

The statistic T

is referenced against the t-distribution with v

= n

+ n

− 2

degrees of freedom.

To perform Welch’s test, the paired observations are discarded so that







The statistic T

is referenced against the t-distribution with degrees of

freedom approximated by

DERRICK ET AL.

143

   

/ 1 / 1











   

  

   

   

For large sample sizes, the test statistic for partially overlapping samples

proposed by Looney and Jones (2003) is

 

  

corrected

a c b c a c b c

n n n n n n n n







   

The statistic Z

corrected

is referenced against the standard Normal distribution.

In the extremes of n

= n

= 0, or n

= 0, Z

corrected

defaults to the paired samples

z-statistic and the independent samples z-statistic respectively.

Definition of Proposed Test Statistics

Two new t-statistics are proposed; T

new1

, assuming equal variances, and T

new2

when equal variances cannot be assumed. The test statistics are constructed as the

difference between two means taking into account the covariance structure. The

numerator is the difference between the means of the two samples and the

denominator is a measure of the standard error of this difference. Thus the test

statistics proposed here are directly testing the hypothesis H

: μ

= μ

The test statistic T

new1

is derived so that in the extremes of n

= n

= 0, or

= 0, T

new1

defaults to T

or T

respectively, thus

new1

1 2 1 2

n n n n













where

   

1 1 2 2

n S n S

  



  

The test statistic T

new1

is referenced against the t-distribution with degrees of

freedom derived by linear interpolation between v

and v

so that

COMPARISON OF MEANS FOR TWO SAMPLES

144

   

new1

a b c

c a b

a b c

n n n

v n n n

n n n



  

   







In the extremes, when n

= n

= 0, v

new1

defaults to v

; or when n

= 0, v

new1

defaults to v

Given the superior Type I error robustness of Welch’s test when variances

are not equal, a test statistic is derived making reference to Welch’s approximate

degrees of freedom. This test statistic makes use of the sample variances,

and

. The test statistic T

new2

is derived so that in the extremes of n

= n

= 0, or

= 0, T

new2

defaults to T

or T

respectively, thus

new2

1 2 1 2

S S n

n n n n













The test statistic T

new2

is referenced against the t-distribution with degrees of

freedom derived as a linear interpolation between v

and v

so that

   

new2

c a b

a b c

v n n n

n n n







   







where

   

/ 1 / 1













   

  

   

   

In the extremes, when n

= n

= 0, v

new2

defaults to v

; or when n

= 0, v

new2

defaults to v

Note that the proposed statistics, T

new1

and T

new2

, use all available

observations in the respective variance calculations. The statistic Z

corrected

only

uses the paired observations in the calculation of covariance.

DERRICK ET AL.

145

Worked Example

An applied example is given to demonstrate the calculation of each of the test

statistics defined. In education, for credit towards an undergraduate Statistics

course, students may take optional modules in either Mathematical Statistics, or

Operational Research, or both. The program leader is interested whether the exam

marks for the two optional modules differ. The exam marks attained for a single

semester are given in Table 2.

Table 2. Exam marks for students studying on an undergraduate Statistics course.

Student

Mathematical Statistics

Operational Research

As per standard notion, the derived sample values are given in lower case. In

the calculation of the test statistics,

= 63.300,

= 75.786,

= 263.789,

= 179.874, n

= 2, n

= 6, n

= 8, n

= 10, n

= 14, v

= 7, v

= 6, v

= 6,

γ = 17.095, v

new1

= 12, v

new2

= 10.365, r = 0.366, s

= 78.679.

For the REML analysis, a mixed model is performed with “Module” as a

repeated measures fixed effect and “Student” as a random effect. Table 3 gives

the calculated test statistics, degrees of freedom and corresponding p-values.

Table 3. Test statistic values and resulting p-values (two-sided test).

corrected

REML

new1

new2

estimate of mean difference

-13.375

2.167

-12.486

-12.517

-12.486

t-value

-2.283

0.350

0.582

-2.271

-2.520

-2.370

-2.276

degrees of freedom

7.000

6.000

11.765

12.000

10.365

p-value

0.056

0.739

0.579

0.023

0.027

0.035

0.045

With the exception of REML, the estimates of the mean difference are

simply the difference in the means of the two samples, based on the observations

used in the calculation. It can quickly be seen that the conclusions differ

depending on the test used. It is of note that only the tests using all of the

available data result in the rejection of the null hypothesis at α

nominal

= 0.05. Also

note that the results of the paired samples t-test and the independent samples t-test

have sample effects in different directions. This is only one specific example

COMPARISON OF MEANS FOR TWO SAMPLES

146

given for illustrative purposes, investigation is required into the power of the test

statistics over a wide range of scenarios. Conclusions based on the proposed tests

cannot be made without a thorough investigation into their Type I error robustness.

Simulation Design

Under normality, Monte-Carlo methods are used to investigate the Type I error

robustness of the defined test statistics and REML. Power should only be used to

compare tests when their Type I error rates are equal (Zimmerman and Zumbo,

1993). Monte-Carlo methods are used to explore the power for the tests that are

Type I error robust under normality.

Unbalanced designs are frequent in psychology (Sawilowsky and Hillman,

1992), thus a comprehensive range of values for n

, n

and n

are simulated. These

values offer an extension to the work done by Looney and Jones (2003). Given

the identification of separate test statistics for equal and unequal variances,

multiple population variance parameters {



} are considered. Correlation has

an impact on Type I error and power for the paired samples t-test (Fradette et al.,

2003), hence a range of correlations {ρ} between two normal populations are

considered. Correlated normal variates are obtained as per Kenney and Keeping

(1951). A total of 10,000 replicates of each of the scenarios in Table 4 are

performed in a factorial design.

All simulations are performed in R version 3.1.2. For the mixed model

approach utilizing REML, the R package lme4 is used. Corresponding p-values

are calculated using the R package lmerTest, which uses the Satterthwaite

approximation adopted by SAS (Goodnight, 1976).

For each set of 10,000 p-values, the proportion of times the null hypothesis

is rejected, for a two sided test with α

nominal

= 0.05 is calculated.

Table 4. Summary of simulation parameters

Parameter

Values

0 (under H

); 0.5 (under H

)

1, 2, 4, 8

5, 10, 30, 50, 100, 500

-0.75, -0.50, -0.25, 0.00, 0.25, 0.50, 0.75

DERRICK ET AL.

147

Type I Error Robustness

For each of the test statistics, Type I error robustness is assessed against Bradley’s

(1978) liberal criteria. This criteria is widely used in many studies analyzing the

validity of t-tests and their adaptions. Bradley’s (1978) liberal criteria states that

the Type I error rate α should be within α

nominal

± 0.5 α

nominal

. For α

nominal

= 0.05,

Bradley’s liberal interval is [0.025, 0.075].

Type I error robustness is firstly assessed under the condition of equal

variances. Under the null hypothesis, 10,000 replicates are obtained for the

4 × 6 × 6 × 6 × 7 = 6,048 scenarios where





. Figure 1 shows the Type I

error rates for each of the test statistics under equal variances for normally

distributed data.

Figure 1. Type I error rates where σ

= σ

, reference lines show Bradley’s (1978) liberal

criteria.

Figure 1 indicates that when variances are equal, the statistics T

, T

new1

and T

new2

remain within Bradley’s liberal Type I error robustness criteria

throughout the entire simulation design. The statistic Z

corrected

is not Type I error

robust, thus confirming the smaller simulation findings of Mehrotra (2004).

Figure 1 also shows that REML is not Type I error robust throughout the entire

COMPARISON OF MEANS FOR TWO SAMPLES

148

simulation design. A review of our results shows that for REML the scenarios that

are outside the range of liberal Type I error robustness are predominantly those

that have negative correlation, and some where zero correlation is specified.

Given that negative correlation is rare in a practical environment, the REML

procedure is not necessarily unjustified.

Type I error robustness is assessed under the condition of unequal variances.

Under the null hypothesis, 10,000 replicates were obtained for the

4 × 3 × 6 × 6 × 6 × 7 = 18,144 scenarios where





. For assessment against

Bradley’s (1978) liberal criteria, Figure 2 shows the Type I error rates for unequal

variances for normally distributed data.

Figure 2. Type I error rates when σ

≠ σ

, reference lines show Bradley’s (1978) liberal

criteria.

Figure 2 illustrates that that the statistics defined using a pooled standard

deviation, T

and T

new1

, do not provide Type I error robust solutions when equal

variances cannot be assumed. The statistics T

, T

and T

new2

retain their Type I

error robustness under unequal variances throughout all conditions simulated.

The statistic Z

corrected

maintains similar Type I error rates under equal and

unequal variances. The statistic Z

corrected

was designed to be used only in the case

DERRICK ET AL.

149

of equal variances. For unequal variances, we observe that the statistic Z

corrected

results in an unacceptable amount of false positives when ρ ≤ 0.25 or

max{n

, n

} − min{n

, n

} is large. In addition, the statistic Z

corrected

conservative when ρ is large and positive. The largest observed deviations from

Type I error robustness for REML are when ρ ≤ 0 or

max{n

, n

} − min{n

, n

} is large. Further insight to the Type I error rates

for REML can be seen in Figure 3 showing observed p-values against expected p-

values from a uniform distribution.

Figure 3. P-P plots for simulated p-values using REML procedure. Selected parameter

combinations (na, nb, nc, σ

, σ

, ρ) are as follows; A = (5,5,5,1,1,-0.75),

B = (5,10,5,8,1,0), C = (5,10,5,8,1,0.5), D = (10,5,5,8,1,0.5).

If the null hypothesis is true, for any given set of parameters the p-values

should be uniformly distributed. Figure 3 gives indicative parameter combinations

where the p-values are not uniformly distributed when applying a mixed model

assessed using REML. It can be seen that REML is not Type I error robust when

the correlation is negative. In addition, caution should be exercised if using

REML when the larger variance is associated with the smaller sample size.

COMPARISON OF MEANS FOR TWO SAMPLES

150

REML maintains Type I error robustness for positive correlation and equal

variances or when the larger sample size is associated with the larger variance.

Power of Type I Error Robust Tests under Equal Variances

The test statistics that do not fail to maintain Bradley’s Type I error liberal

robustness criteria are assessed under H

. REML is included in the comparisons

for ρ ≥ 0. The power of the test statistics are assessed where σ

= σ

= 1,

followed by an assessment of the power of the test statistics where σ

> 1 and

= 1.

Table 5 shows the power of T

, T

new1

, T

new2

and REML, averaged

over all sample size combinations where σ

= σ

= 1.

Table 5. Power of Type I error robust test statistics σ

= σ

= 1, α = 0.05, μ

− μ

= 0.5.

new1

new2

REML

= n

0.75

0.785

0.567

0.565

0.887

0.886

0.922

0.50

0.687

0.567

0.565

0.865

0.864

0.880

0.25

0.614

0.567

0.565

0.842

0.841

0.851

0.558

0.567

0.565

0.818

0.829

0.481

0.567

0.565

0.778

≠ n

0.75

0.784

0.455

0.433

0.855

0.847

0.907

0.5

0.687

0.455

0.433

0.84

0.832

0.861

0.25

0.615

0.455

0.433

0.823

0.816

0.832

0.559

0.455

0.433

0.806

0.799

0.816

0.482

0.455

0.433

0.774

0.766

Table 5 shows that REML and the test statistics proposed in this paper, T

new1

and T

new2

, are more powerful than standard approaches, T

, T

and T

, when

variances are equal. Consistent with the paired samples t-test, T

, the power of

new1

and T

new2

is relatively lower when there is zero or negative correlation

between the two populations. Similar to contrasts of the independent samples t-

test, T

, with Welch’s test, T

, for equal variances but unequal sample sizes, T

new1

is marginally more powerful than T

new2

, but not to any practical extent. For each

of the tests statistics making use of paired data, as the correlation between the

paired samples increases, the power increases.

As the correlation between the paired samples increases, the power

advantage of the proposed test statistics relative to the paired samples t-test

becomes smaller. Therefore the proposed statistics T

new1

and T

new2

may be

especially useful when the correlation between the two populations is small.

DERRICK ET AL.

151

To show the relative increase in power for varying sample sizes, Figure 4

shows the power for selected test statistics for small-medium sample sizes,

averaged across the simulation design for equal variances.

Figure 4. Power for Type I error robust test statistics, averaged across all values of ρ

where σ

= σ

and μ

− μ

= 0.5. The sample sizes (na, nb, nc) are as follows:

A = (10,10,10), B = (10,30,10), C = (10,10,30), D = (10,30,30), E = (30,30,30).

From Figure 4 it can be seen that for small-medium sample sizes, the power

of the proposed test statistics T

new1

and T

new2

is superior to standard test statistics.

Power of Type I Error Robust Rests under Unequal

Variances

For the Type I error robust test statistics under unequal variances, Table 6

describes the power of T

, T

new2

and REML, averaged over the simulation

design where μ

− μ

= 0.5. Table 6 shows that T

new2

has superior power properties

to both T

and T

when variances are not equal. In common with the performance

of Welch’s test for independent samples, T

, the power of T

new2

is higher when the

larger variance is associated with the larger sample size. In common with the

performance of the paired samples t-test, T

, the power of T

new2

is relatively lower

when there is zero or negative correlation between the two populations.

COMPARISON OF MEANS FOR TWO SAMPLES

152

Table 6. Power of Type I error robust test statistics where σ

> 1, σ

= 1, α = 0.05,

− μ

= 0.5. Within this table, na > nb represents the larger variance associated with the

larger sample size, and na < nb represents the larger variance associated with the smaller

sample size.

new2

REML

= n

0.75

0.555

0.393

0.692

0.645

0.50

0.481

0.393

0.665

0.588

0.25

0.429

0.393

0.640

0.545

0.391

0.393

0.619

0.515

0.341

0.393

0.582

> n

0.75

0.555

0.351

0.715

0.589

0.50

0.481

0.351

0.688

0.508

0.25

0.429

0.351

0.665

0.459

0.391

0.351

0.642

0.422

0.341

0.351

0.604

< n

0.75

0.555

0.213

0.559

0.693

0.50

0.481

0.213

0.539

0.649

0.25

0.429

0.213

0.522

0.62

0.391

0.213

0.507

0.603

0.341

0.213

0.480

The apparent power gain for REML when the larger variance is associated

with the larger sample size, can be explained by the pattern in the Type I error

rates. REML follows a similar pattern to the independent samples t-test, which is

liberal when the larger variance is associated with the larger sample size, thus

giving the perception of higher power.

To show the relative increase in power for varying sample sizes, Figure 5

shows the power for selected test statistics for small-medium sample sizes,

averaged across the simulation design for unequal variances.

DERRICK ET AL.

153

Figure 5. Power for Type I error robust test statistics σ

> σ

and μ

− μ

= 0.5. The

sample sizes (na, nb, nc) are as follows: A = (10,10,10), B

= (10,30,10), B

= (30,10,10),

C = (10,10,30), D

= (10,30,30), D

= (30,10,30), E = (30,30,30).

Figure 5 shows a relative power advantage when the larger variance is

associated with the larger sample size, as per B

and D

. A comparison of Figure 4

and Figure 5 shows that for small-medium sample sizes, power is adversely

affected for all test statistics when variances are not equal.

Discussion

The statistic T

new2

is Type I error robust across all conditions simulated under

normality. The greater power observed for T

new1

, compared to T

new2

, under equal

variances, is likely to be of negligible consequence in a practical environment.

This is in line with empirical evidence for the performance of Welch’s test, when

only independent samples are present, which leads to many observers

recommending the routine use of Welch’s test under normality (e.g. Ruxton,

2006).

The Type I error rates and power of T

new2

follow the properties of its

counterparts, T

and T

. Thus T

new2

can be seen as a trade-off between the paired

samples t-test and Welch’s test, with the advantage of increased power across all

conditions, due to using all available data.

COMPARISON OF MEANS FOR TWO SAMPLES

154

The partially overlapping samples scenarios identified in this paper could be

considered as part of the missing data framework and all simulations have been

performed under the assumption of MCAR.

The statistics proposed in this paper form less computationally intensive

competitors to REML. The REML procedure does not directly calculate the

difference between the two sample means, in a practical environment this makes

its results hard to interpret. The statistics proposed in this paper also lend

themselves far more easily to the development of non-parametric tests.

Conclusion

A commonly occurring scenario when comparing two means is a combination of

paired observations and independent observations in both samples, this scenario is

referred to as partially overlapping samples. Standard procedures for analyzing

partially overlapping samples involve discarding observations and performing

either the paired samples t-test, or the independent samples t-test, or Welch’s test.

These approaches are less than desirable. In this paper, two new test statistics

making reference to the t-distribution are introduced and explored under a

comprehensive set of parameters, for normally distributed data. Under equal

variances, T

new1

and T

new2

are Type I error robust. In addition they are more

powerful than standard Type I error robust approaches considered in this paper.

When variances are equal, there is a slight power advantage of using T

new1

over

new2

, particularly when sample sizes are not equal. Under unequal variances,

new2

is the most powerful Type I error robust statistic considered in this paper.

We recommend that when faced with a research problem involving partially

overlapping samples and MCAR can be reasonably assumed, the statistic T

new1

could be used when it is known that variances are equal. Otherwise under the

same conditions when equal variances cannot be assumed the statistic T

new2

could

be used.

A mixed model procedure using REML is not fully Type I error robust. In

those scenarios in which this procedure is Type I error robust, the power is similar

to that of T

new1

and T

new2

The proposed test statistics for partially overlapping samples provide a real

alternative method for analysis for normally distributed data, which could also be

used for the formation of confidence intervals for the true difference in two means.

DERRICK ET AL.

155

References

Bedeian, A. G., & Feild, H. S. (2002). Assessing group change under

conditions of anonymity and overlapping samples. Nursing research, 51(1), 63-65.

doi: 10.1097/00006199-200201000-00010

Bhoj, D. (1978). Testing equality of means of correlated variates with

missing observations on both responses. Biometrika, 65(1), 225-228. doi:

10.1093/biomet/65.1.225

Bradley, J. V. (1978). Robustness? British Journal of Mathematical and

Statistical Psychology, 31(2), 144-152. doi: 10.1111/j.2044-8317.1978.tb00581.x

Derrick, B., Dobson-McKittrick, A., Toher, D., & White P. (2015). Test

statistics for comparing two proportions with partially overlapping samples.

Journal of Applied Quantitative Methods, 10(3)

Derrick, B., Toher, D., & White, P. (2016). Why Welch’s test is Type I error

robust. The Quantitative Methods for Psychology, 12(1), 30-38. doi:

10.20982/tqmp.12.1.p030

Ekbohm, G. (1976). On comparing means in the paired case with

incomplete data on both responses, Biometrika, 63(2), 299-304. doi:

10.1093/biomet/63.2.299

Fay, M. P., & Proschan, M. A. (2010). Wilcoxon-Mann-Whitney or t-test?

On assumptions for hypothesis tests and multiple interpretations of decision rules.

Statistics surveys, 4(1). doi: 10.1214/09-SS051

Fisher, R. A. (1925). Statistical methods for research workers. New Delhi,

India: Genesis Publishing Pvt. Ltd.

Fradette, K., Keselman, H. J., Lix, L., Algina, J., & Wilcox, R. (2003).

Conventional and robust paired and independent samples t-tests: Type I error and

power rates. Journal of Modern Applied Statistical Methods, 2(2), 481-496. doi:

10.22237/jmasm/1067646120

Kenney, J. F., & Keeping, E. S. (1951). Mathematics of statistics, Pt. 2 (2nd

Ed). Princeton, NJ: Van Nostrand.

Kim, B. S., Kim, I., Lee, S., Kim, S., Rha, S. Y., & Chung, H. C. (2005).

Statistical methods of translating microarray data into clinically relevant

diagnostic information in colorectal cancer. Bioinformatics, 21(4), 517-528. doi:

10.1093/bioinformatics/bti029

COMPARISON OF MEANS FOR TWO SAMPLES

156

Lin, P. E. (1973). Procedures for testing the difference of means with

incomplete data. Journal of the American Statistical Association, 68(343), 699-

703. doi: 10.1080/01621459.1973.10481407

Lin, P. E., & Strivers L. (1974). Difference of means with incomplete data.

Biometrika, 61(2), 325-334. doi: 10.1093/biomet/61.2.325

Looney, S., & Jones, P. (2003). A method for comparing two normal means

using combined samples of correlated and uncorrelated data. Statistics in

Medicine, 22, 1601-1610. doi: 10.1002/sim.1514

Martínez-Camblor, P., Corral, N., & María de la Hera, J. (2013). Hypothesis

test for paired samples in the presence of missing data. Journal of Applied

Statistics, 40(1), 76-87. doi: 10.1080/02664763.2012.734795

Mehrotra, D. (2004). Letter to the editor, a method for comparing two

normal means using combined samples of correlated and uncorrelated data.

Statistics in Medicine, 23(7), 1179–1180. doi: 10.1002/sim.1693

R Core Team. R: A language and environment for statistical computing. R

Foundation for Statistical Computing, Vienna, Austria. www.R-project.org. 2014;

version 3.1.2.

Ruxton, G. (2006). The unequal variance t-test is an underused alternative to

Student's t-test and the Mann-Whitney U test. Behavioral Ecology, 17(4), 688.

doi: 10.1093/beheco/ark016

Goodnight, J. H. (1976). General linear models procedure. S.A.S. Institute.

Inc.

Samawi, H. M., & Vogel, R. (2011). Tests of homogeneity for partially

matched-pairs data. Statistical Methodology, 8(3), 304-313. doi:

10.1016/j.stamet.2011.01.002

Sawilowsky, S. S., & Blair, R. C. (1992). A more realistic look at the

robustness and type II error properties of the t-test to departures from population

normality. Psychological Bulletin, 111(2), 352. doi: 10.1037/0033-

2909.111.2.352

Sawilowsky, S. S., & Hillman, S. B. (1992). Power of the independent

samples t-test under a prevalent psychometric measure distribution, Journal of

Consulting and Clinical Psychology, 60(2), 240-243. doi: 10.1037/0022-

006X.60.2.240

Stouffer, S. A., Suchman, E. A., DeVinney, L. C., Star, S. A., & Williams Jr,

R. M. (1949). The American soldier: adjustment during army life (Studies in

DERRICK ET AL.

157

Social Psychology in World War II, Vol. I). Princeton, NJ: Princeton University

Press.

Zimmerman, D. W. (1997). A note on the interpretation of the paired

samples t-test. Journal of Educational and Behavioral Statistics, 22(3), 349 – 360.

doi: 10.3102/10769986022003349

Zimmerman, D. W., & Zumbo, B. D. (1993). Significance testing of

correlation using scores, ranks, and modified ranks. Educational and

Psychological Measurement, 53(4), 897-904. doi:

10.1177/0013164493053004003

Zimmerman, D. W., & Zumbo, B. D. (2009). Hazards in choosing between

pooled and separate-variances t tests. Psicológica: Revista de Metodología y

Psicología Experimental, 30(2), 371-390.

Zumbo, B. D. (2002). An adaptive inference strategy: The case of auditory

data. Journal of Modern Applied Statistical Methods, 1(1), 60-68. doi:

10.22237/jmasm/1020255000