Chapter12_EstimatingSampleSize | Chapter 12: Estimating Sample Size

Clear-Sighted Statistics

Chapter 12: Estimating Sample Sizes

“Once upon a time, there was a little girl named Goldilocks. She went for a walk in the forest. Pretty soon, she came upon a house. She knocked and, when no one answered, she walked right in. At the table in the kitchen, there were three bowls of porridge. Goldilocks was hungry. She tasted the porridge from the first bowl. ‘This porridge is too hot!’ she exclaimed. So, she tasted the porridge from the second bowl. ‘This porridge is too cold,’ she said. So, she tasted the last bowl of porridge. ‘Ahhh, this porridge is just right,’ she said happily and she ate it all up.”1

-- “Goldilocks and the Three Bears”

I. Introduction

Researchers are like Goldilocks when it comes to the size of their samples. They want them sized just right; not too large, not too small. When samples are too large, money and time are wasted. But when they are too small, researchers are concerned that they might miss uncovering an important effect. We call this a Type II error. Or, the effect “discovered” may only be due to random sample error, which we call a Type I error. Type I and Type II errors will be discussed in detail in the chapters on Null Hypothesis Significance Testing. We will see that this is often an issue of statistical power, the ability of the selected hypothesis test to uncover important findings, given the variability of the data and the size of the sample. In essence, properly determining the size of the sample helps researchers conduct tests that have sufficient power to find an important effect when one actually exists while avoiding results that show an effect when one does not exist.

In this chapter we will use z-values to determine the sample size for studies about the population mean, μ, and the population proportion, π. The two techniques covered, however, are of limited use. They cannot be used for two-sample tests. And, they cannot be used for a variety of tests that do not follow a normal distributions. These tests include t-tests, F-tests, and chi-square tests. After completing this chapter, you will:

• Understand the basic method to estimate the necessary sample size for the mean using z-values.

• Understand the basic method to estimate the necessary sample size for the proportion using z-values.

• Be cognizant of the problem of shrinkage.

• Be aware that there are more sophisticated methods for estimating sample size.

This chapter is accompanied by two Excel files that you should download and use:

• Chapter12_Estimating_n.xlsx: Excel workbook for estimating sample sizes for the mean and proportion.

• Chapter12_Exercises.xlsx: Excel workbook to be used for the end-of-chapter exercises.

II. Factors Affecting the Size of the Sample

Three factors determine sample size:

1. The Chosen Confidence Level

A 95 percent confidence level is used most frequently. Occasionally we see sample sizes based on 90, 98, 99, or even 99.9 percent confidence levels. The higher the confidence level, the larger the required sample. Here are the z-values for these confidence levels found using the Area Under the Curve table printed on paper and Microsoft Excel. The values for Excel are rounded off to three decimal places passed the decimal point:

Table 1: Common Confidence Level Using in Determining Sample Size

Please note: Excel’s calculations for the critical value are more accurate than those found on a paper critical values table.

2. The Maximum Allowable Error (E)

The maximum allowable error (E) is also known as error tolerance. It can be that same value as the margin of error for the confidence interval. The larger the maximum allowable error, the smaller the sample size. The allowable error is the amount added to and subtracted from the sample mean, X̅, or the sample proportion, p, to locate the end-points of the confidence interval. The smaller the allowable error, the narrower the confidence interval and the larger the sample size. Sample size increases when the allowable error is decreased.

Without knowing the sample size, it is not possible to calculate the margin of error. Researchers, therefore, have to estimate the amount of error they can tolerate.

3. Variability of the Data

The variability of the data, as measured by population standard deviation, σ, also affects the size of the sample. The more variable the data, the larger the required sample size. The problem we face is that the population standard deviation is usually not known. Typically, researchers use one of three methods to estimate the population standard deviation:

a. Estimate the population standard deviation based on available studies on the topic under investigation.

b. Conduct a pilot study, and use the sample standard deviation, s, ay the estimate for the population standard deviation.

c. Based the estimate of the population standard deviation on the empirical rule, which states that nearly all observations will lie plus or minus three standard deviations from the mean. Our estimate of the population standard deviation would be the difference between the highest and lowest values divided by six: (H – L)/6.

A better option is to use the t-distribution instead of the normal distribution. With a tdistribution, we can use the sample standard deviation.

As we shall see, the variability of the data can also be measured when we are estimating the sample size for a proportion.

III. Estimating the Sample Size for Means

Here is the formula for determining the sample size for means:

Equation 1: Formula for Calculation the Sample Size for the Mean

Where: z = z-value that corresponds to the selected confidence level

σ = Sample standard deviation

E = Allowable error

n = Sample size

Here is an example of estimating the sample size for the mean. The Metropolitan Transit Authority wants to determine how much the average passenger spends on public transportation during a 90-day period. Available data suggest that the standard deviation is $20. The allowable error is $4.50. How large a sample is required?

Table 2: Sample Size Calculations for the Mean at a 95% and 99% Confidence Levels

The required sample size is 76 passengers when using a 95 percent confidence level and 132 when using a 99 percent confidence level. Please note: Whenever the answer is not a whole number, we round up to the next highest whole number. Never round down: Doing so will result in a sample size that is too small.

While Microsoft Excel does not have built-in sample size functions, it is still a very useful tool for calculating sample sizes. Figure 1 shows the sample size calculations done using Excel.

Figure 1: Sample Size Calculations for the Mean in Microsoft Excel

IV. Estimating the Sample Size for Proportions

Here is the formula for determining the sample size for proportions:

Equation 2: Formula for Calculation the Sample Size for the Proportion

Where: z = z-value that corresponds to the selected confidence level

p = Estimate of the proportion based on available data or a pilot study

E = the allowable error

n = Sample size, n

A major pet food company wants to survey dog owners in large metropolitan areas about their new raw meat line of dog food. Available data suggest that 25 percent of households have at least one dog. Researchers at the company want the sample to have an allowable error of 2.5 percent. They also intend to use a 95 percent confidence level. How large a sample is needed?

Here are the inputs for our formula:

p = 0.25

z = 1.96 for a 95 percent CL, 2.58 for a 99 percent CL

E = 0.025

Table 3: Sample Size Calculations for the Proportion at a 95% and, 99% Confidence Levels

The answer: The required sample size is 1,153 when using a 95 percent confidence level and 1,997 when using a 99 percent confidence level. Remember: Whenever the result of the calculation is not a whole number, round up any fractional number to the next highest whole number. Never round down: Doing so will result in a sample size that is too small.

Figure 2 shows the sample size calculations for proportions at various confidence levels.

Figure 2: Calculation for Sample Size for the Proportion

Sample Size Estimation for the Proportion using Excel

When we have no estimate of the population proportion, we use 50 percent, 0.50. This will result in the largest possible sample size at the selected level of confidence.

Table 4: Size Calculation for the Proportion Where There is No Estimate of the Proportion

The required sample size is 1,537.

V. Shrinkage

The formulas shown above provide the minimum sample sizes for the assumption used in the calculations. Remember that whenever we conduct research with human subjects, we get non-response errors because people drop-out of the study or they fail to answer important questions for a variety of reasons. Non-response error, in effect, reduces the size of the sample. This is called shrinkage. Good researchers must ensure that the sample size accounts for non-response errors. Techniques used to do this are beyond the level of an introductory statistics course.

VI. Limitations of these Methods

The two methods shown above for determining sample size are limited to normal distributions for the mean and proportion for only one sample when using z-values is appropriate. As we will discuss in future chapters on Null Hypothesis Significance Testing (NHST), using z-values is not always appropriate. In fact there are many significance tests for which the two method just covered are not appropriate.

In addition, important considerations for NHST like statistical power and the probability of a Type II error are not considered. Statistical power is the ability of a test to detect an effect when one exists. A Type II error is a false negative or the failure to reject the null hypothesis when there is an effect. These terms will be explained in the next chapter.

There are many options when setting sample size. In the following chapters, we will focus on two of the simplest. G*Power and a sample size calculators on the website called Statistics Kingdom.

G*Power is a free software package that calculates sample size requirements for a wide variety of analyses: z-tests, t-tests, F-tests, Chi-Square tests, linear regression, logistic regression, as well as a variety of nonparametric tests. G*Power reports the sample size needed to achieve a certain level of statistical power. We will discuss what statistical power means is in our next chapter. Statistical power is the ability of a statistical analysis to detect an important effect when one actually exists. We will use G*Power to select sample size in the context of NHST.

A shortcoming of G*Power is that it only runs on Windows and Macintosh devices. If we are using a Chromebook, iPad, or smart phone, we can determine sample size for a wide variety of significance tests using the calculators on the Statistics Kingdom website.

VII. Summary

We have completed our discussion of calculating the sample size for analyses of the mean and proportion for a one-sample test. In Chapter 13, we will begin our discussion of null hypothesis significance testing.

VIII. Exercises

Complete the following problems. You can use Excel to calculate your solutions by using the file titled 12_Exercises.xlxs.

Exercise 1: The Association of Used Car Dealers wants to study the maintenance costs for the first year after cars have come off the manufacturer’s warranty. How large a sample is needed when s = $575 and E = $100? Calculate the sample size using 90%, 95%, 98%, and 99% confidence levels.

Exercise 2: The national law firm of Dewey, Cheatem, and Howe intends to conduct a study of salaries for newly graduated lawyers who have just passed a state bar exam. How large a sample is needed if we assume the standard deviation is $20,555 and the allowable error is $3,500? Calculate sample sizes using 90%, 95%, 98%, and 99% confidence levels.

Exercise 3: The American Association of Realtors and the U.S. Association for the Paranormal are interested in conducting a survey among people who believe that ghosts are real. The results of a HuffPost/YouGov poll suggest that 43% of the population think ghost are real. How large a sample is required if you set the proportion at 43% and the allowable error at 3%? Calculate 90, 95, 98, and 99% levels of confidence. Then conduct the calculations assuming that you have no information of the proportion of the population that believes in ghosts.

Exercise 4: The American Association of Realtors and the US Association for the Paranormal are interested in conducting a survey among people who believe ghosts are real and not dangerous. The results of a HuffPost/YouGov poll suggest that 30% of the population thinks ghosts are real and not dangerous. These people would be willing to buy a home that is haunted. How large a sample is required? Calculate the sample size using 90, 95, 98, and 99% levels of confidence with an allowable error of 0.03.

Except where otherwise noted, Clear-Sighted Statistics is licensed under a
Creative Commons License. You are free to share derivatives of this work for
non-commercial purposes only. Please attribute this work to Edward Volchok.

References

1 “Goldilocks and the Three Bears,” https://www.dltk-teach.com/rhymes/goldilocks_story.htm.

Show the following:

Adjust appearance:

Notes

Annotate