Clear-Sighted Statistics
Chapter 13: Introduction to Null Hypothesis Significance Testing (NHST)
…I shall not require of a scientific system that it shall be capable of being singled out once and for all, in a positive sense; but I shall require that its logical form shall be such that it can be singled out, by means of empirical tests, in a negative sense: it must be possible for an empirical scientific system to be refuted by experience [italics added].¹
-- Karl R. Popper in The Logic of Scientific Discovery
I. Introduction
Karl Popper, the twentieth-century philosopher of science, argued that any system of ideas that cannot be falsified—refuted or nullified—by empirical tests is not a science. For Popper and the framers of Null Hypothesis Significance Testing (NHST), scientists do not prove the “truth” of scientific propositions; they use empirical evidence to falsify or disprove them. The fictional detective, Sherlock Holmes, made this point to his sidekick Dr. Watson in the 1946 movie Dressed to Kill when he declared, “The truth is only arrived at by the painstaking process of eliminating the untrue.”²
NHST is a widely used method of falsifying scientific propositions. It does not, however, verify or prove propositions. This is why we should avoid saying any hypothesis is true or correct. In this chapter, we will review the basics of Null Hypothesis Significance Testing. In subsequent chapters a variety of null hypothesis significance tests will be covered in detail.
The null hypothesis is a proposition that is tested through a process of nullification or falsification. The phrase “null hypothesis” may sound odd today. The word “null” means without value, effect, consequence, or significance. As we shall see, the null hypothesis means that there is nothing of importance, that the data are merely the result of random sampling error or that there is no important effect or difference.
II. The Origins of NHST
In 1925, the British statistician Ronald A. Fisher published the first edition of Statistical Methods for Research Workers. In his ground-breaking book, Fisher laid the foundation for “significance tests.”³ In 1928, Jerzy Neyman and Egon Pearson, the son of Karl Pearson, wrote a series of papers to correct what they considered flaws in Fisher’s approach.⁴ They called their approach “hypothesis testing.” Among their innovations were the introduction of the alternate hypothesis, and errors of the first and second type. Type I or alpha errors (α) are false positives, and Type II or beta (β) errors are false negatives. The innovations of R. A. Fisher and Neyman and Pearson provide much of the structure of NHST used today. Fisher and Neyman waged an acrimonious debate about the merits of their competing positions until Fisher’s death in 1962.
Over the years, textbook authors have combined Fisher’s significance testing and Neyman-Pearson’s hypothesis testing into a unified approach even though some commentators have argued that these approaches are inherently contradictory. In particular, some argue that Fisher’s concept of p-values is not compatible with the Neyman-Pearson hypothesis test, which it has become embedded.⁵ Geoff Cumming, the author of Understanding the New Statistics, who we will discuss in Chapter 19, argued that consolidation of the Fisher and Neyman-Pearson approaches is a muddled amalgam:
To some extent students may be expected to learn one rationale and procedure (Neyman-Pearson), but see a quite different one (Fisher) modeled in the journal articles they read…. It might be tempting to regard a mixture of the two approaches as possibly combining the best of both worlds, but the two frameworks are based on incompatible conceptions of probability. The mixture is indeed incoherent, and so it’s not surprising that misconceptions about NHST are so widespread.⁶
For better or worse, we will follow convention and consider the Fisher and NeymanPearson approaches as a unified approach. In Chapter 19, we will discuss the shortcomings of NHST as practiced since the middle of the twentieth century.
NHST is a set of procedures that uses sample data and probability theory to determine whether a proposition about a population based on evidence derived from a sample should be rejected; which is to say, nullified. At the final step of a hypothesis test, one of two decisions will be reached:
1. There is insufficient evidence to reject the null hypothesis. In such cases, any difference between the sample statistics and the population parameter, between the samples, or between the observed and expected frequencies are considered merely random sampling error; and, therefore, not statistically significant.
2. There is sufficient evidence to reject the null hypothesis. In these cases, we declare the results “statistically significant.” This means that there is a low probability the results are due to random sampling error—the p-value or the probability of committing a Type I error—is very low.
III. What a Hypothesis is Not
In the early sixteenth century under the watchful eye of an impatient Pope Julius II, the great renaissance artist Michelangelo labored non-stop for four years painting the ceiling of the Sistine Chapel in Vatican City. Michelangelo’s frescoes are among the world’s greatest artistic achievements. A section of his masterpiece, shown in Figure 1, illustrates God’s creation of Adam. The creation story found in Genesis, the first book of the Old Testament, is not a hypothesis because it contains propositions that the Divine Being exists, formed Adam from dust, and then gave him the breath of life.⁷ Because the existence of the God of the Old and New Testament and the life of Adam are not subject to empirical verification, the existence of God and Adam are matters of faith, not science.
Figure 1: God Creating Adam in Michelangelo’s The Creation of Adam
IV. What a Hypothesis is and How It Differs from a Theory
A hypothesis is often considered an “educated guess.” This is not wrong. Hypotheses can be guesses, educated or otherwise. They can even be based on mean-spirited and ill-informed bigotry. But hypotheses are more than everyday opinions, be they benign or malicious. A hypothesis is a preliminary proposition that provides an explanation of some phenomenon, which can be tested. In statistics, a hypothesis is generally a tentative statement about a population developed from a sample. The null hypothesis is a preliminary inference based on limited evidence that can be refuted through testing. This is what the movie version of Sherlock Holmes meant by the elimination of the untrue.
The Difference Between Theories and Hypotheses
A theory was once a hypothesis that has withstood repeated falsification and has become a well-established principle. Theories are propositions that provide a unified explanation of phenomena that have withstood repeated attempts to refute them. Common theories in the natural sciences include Albert Einstein’s theory of relativity, quantum mechanics, and Nicolaus Copernicus’ heliocentric solar system in which the Earth revolves around the Sun. In economics we have John Maynard Keynes’ theories concerning macroeconomics. In psychology there is Ivan Petrovitj Pavlov’s classical conditioning theory.
The theory of natural selection was once a mere hypothesis developed by Charles Darwin, who devoted his life to collecting evidence that would elevate his hypothesis to a theory. Darwin’s work inspired British statisticians in the late nineteenth and early twentieth centuries. In 1901, Darwin’s younger cousin, Francis Galton⁸, who created the concept of correlation, launched Biometrika along with statistician Karl Pearson, and biologist Raphael Weldon. This journal, which still exists today, is dedicated to the statistical study of Darwin’s theory of natural selection as its starting point.⁹ In the first issue of Biometrika, the editors asked, “…may we not ask how it came about that the founder of our modern theory of descent [Charles Darwin] made so little appeal to statistics?”¹⁰ The goal of this journal was to give Darwin’s natural selection stronger statistical support.
Like hypotheses, theories can be falsified, but this happens infrequently, and the implications are far more important than the mere rejection of a null hypothesis. When a theory is falsified, a paradigm shift may result.¹¹ Paradigm shifts open new approaches to understanding phenomena that previously have not been considered. Darwin’s theory of evolution based on natural selection, which he presented in 1859 in On the Origin of the Species, ushered in a paradigm shift that changed our view of humanity’s place in nature. After Darwin, many people began to think that humanity is a part of nature, not above it.¹²
A paradigm shift is a concept developed by the twentieth century philosopher of science Thomas S. Kuhn.¹³ A paradigm shift is a scientific revolution in which the core concepts underlying a scientific discipline fundamentally change. In Chapter 19, we will briefly consider whether the science of statistics and the notion of NHST is in the midst of a paradigm shift.
V. A Non-Mathematical NHST
You will recall that John W. Tukey likened descriptive statistics, or exploratory data analysis, to detective work and inferential statistics to a trial before a jury or judge.¹⁴ A detective working in law enforcement gathers evidence about a crime. The detective may collect sufficient evidence to warrant an indictment, which is a formal initiation of a criminal case against a person. In an indictment, the person charged, the defendant, stands trial before a jury or a judge. The central proposition of our criminal justice system is that the defendant is presumed innocent of the crime until the prosecutor can convince the jury or judge beyond a reasonable doubt of the defendant’s guilt.
A criminal trial and NHST share the same premise. In the trial, the defendant is presumed “not guilty.” This is, in essence, what statisticians call the null hypothesis, H0, which is pronounced either as H-zero, H-oh, H-null, or H-nought. The assumption underlying it is that the sample statistics, (X-bar, p, s2, or r) equals the corresponding population parameter (μ, π, σ2, or ρ). In the case of chi-square tests, we compare observed distributions with expected distributions. The null hypothesis states that any observed difference between the statistics and the parameter or between the observed and expected distributions is merely the result of random sampling error or chance. The null hypothesis is based on the assumption that there is no difference, nothing happened, there is no effect, or no credible evidence. It is a provisional explanation that will be tested and possibly refuted. In Chapters 14 through 18 we will deal with a variety of hypothesis tests. With most tests, the null and alternate hypotheses show only the population parameter, never the sample statistic. The null hypothesis also includes an equal sign: =, ≤, or ≥.
The prosecutor presents evidence to the court that he or she hopes will prove the case against the defendant beyond a reasonable doubt.¹⁵ The prosecutor’s case is what statisticians call the alternate hypothesis, which is symbolized as H1 or HA. The alternate hypothesis is sometimes called the research hypothesis because it embodies the research question. With NHST there can only be two hypotheses. Taken together, the null hypothesis and alternate hypothesis are an “either/or” proposition. The two hypotheses are mutually exclusive and collectively exhaustive. When the jury decides that the defendant is guilty beyond a reasonable doubt, the null hypothesis has been falsified and the defendant is declared guilty.
With NHST, the alternate hypothesis is “accepted” only when there is sufficient evidence to falsify the null hypothesis. As Ronald Fisher wrote in 1935, “…the null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation. Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis.”¹⁶ When the null hypothesis is not rejected, we do not say that it has been proven. We make a more cautious statement by saying “we fail to reject the null hypothesis.” In addition, when the null hypothesis is rejected, we never say the alternate hypothesis is true.
Like the null hypothesis, the alternate hypothesis deals with parameters: μ, π, σ2, or ρ. It, like the null hypothesis, never shows the symbol for the sample statistic. In addition, with z and t tests, the alternate hypothesis has one of three signs that are the opposite of the sign in the null hypothesis, ≠, >, or <. These signs point to the rejection region or regions on the normal or t-distribution. With a one-way ANOVA test, the alternate hypothesis does not use mathematical symbols. It uses a variant of this simple phrase: “The treatment means are not all equal.” With chi-square tests, the alternate hypothesis states that the observed set of frequencies do not match the expected set of frequencies.
The verdict in a jury trial is based on the “beyond a reasonable doubt” standard, which cannot be quantified. In contrast, the decision to reject the null hypothesis is based on a quantitative measure called a level of significance, which is often called an alpha (α) level. As we discussed when we reviewed confidence levels, significance levels are found by (1 – the confidence level). Fisher introduced this concept in his book, Statistical Methods for Research Workers. Neyman and Pearson modified it. Today we set the significance level when we begin testing. Both Fisher and Neyman-Pearson gave tacit approval to using five percent significance levels. The significance level is the researchers’ tolerance of committing a Type I error. The 5 percent significance level is still the most commonly used significance level today, although, as we shall see in Chapter 19, the habit of using five percent significance levels has come under increased scrutiny. Sometimes a one percent significance level may be used, which makes it harder to reject the null hypothesis. When the test is a preliminary or pilot study, a 10 percent significance level may be used, which makes it easier to reject the null hypothesis.
To repeat: Whenever we reject the null hypothesis, we do not consider the alternate hypothesis to be “true.” The alternate hypothesis becomes the new null hypothesis, and is subject to falsification. When we do not reject the null hypothesis, we never say that it is “true.” In fact, we should never say we “accept the null hypothesis.” We say, “we failed to reject the null hypothesis.” We always doubt the “truth” of our hypotheses. Similarly, in a jury trial, when the defendant is not convicted, the judgment is that the defendant is “not guilty.” The defendant is never declared “innocent.”
As with jury trials, null hypothesis tests are subject to two kinds of errors. Søren Kierkegaard, the nineteenth century Danish existentialist philosopher who was not a statistician, got very close to the essence of Neyman-Pearson’s Type I and Type II errors when he wrote, “…one can be deceived in believing what is untrue [Type I Error], but on the other hand, one is also deceived in not believing what is true [Type II Error].”¹⁷
We can represent criminal trials and NHST with a 2 by 2 matrix to delineate the range of decisions regarding the null hypothesis and the two types of errors that can be made. See Table 1.
Table 1: Type I and Type II Errors
A jury or a research analyst can make one of two correct decisions.
Correct Decision #1: Correctly fail to reject the null hypothesis. A man who did not commit the crime is acquitted. The difference between the population parameter and sample statistic is considered random sampling error or the expected and observed frequencies are found to be roughly equal.
Correct Decision #2: Correctly convict a guilty man, or the analysts correctly decide that the parameter and statistic are not equal or the expected and observed frequencies are unequal.
A jury or research analysts can also make one of two incorrect decisions.
Type I Errors - False Positives: A Type I error occurs when we reject the Null Hypothesis when we should have failed to reject it. In a jury trial, a Type I error occurs when the jury convicts an innocent man. A Type I error happens when we reject the null hypothesis when we should have failed to reject it. To repeat, the researchers’ tolerance of committing a Type I error is set by the level of significance or α. For any hypothesis test, however, we cannot tell if we committed a Type I error. All you can know is the probability of committing such an error. The calculated probability of committing a Type I error is called the pvalue, which we will discuss shortly. Here is an Example of a Type I error. Suppose a person has a physical examination. At the end of the exam, the doctor tells his patient that he needs immediate surgery to remove a cancerous tumor. After the surgery, the pathology report shows the growth to be noncancerous. The doctor has committed a Type I error.
Type II Errors - False Negatives: In a jury trial, a Type II error occurs when the jury fails to convict a guilty man. In NHST, a Type II error is often due to lack of statistical power. Statistical power is defined as the power of a test to appropriately reject the null hypothesis, which is to say, find a significant result when one exists. Statistical power is found by 1 – P(β), or 1 minus the probability of a Type II Error. The maximum statistical power is 1 or 100 percent and the minimum is zero. A power of 1 means that there is a 100 percent chance of finding a statistically significant event, and a zero means that it is impossible to find such an event. If a test has 100% statistical power, it would be impossible not to reject the null hypothesis. Such tests are considered over-powered. The generally accepted minimum level of statistical power is 80 percent. An 80 percent power means that the test has an 80 percent probability of correctly rejecting the null hypothesis.
Figure 2 shows the relationship between Type II errors and Type I errors, population variance or σ2, and sample size or n. When the probability of a Type II error goes up, the probability of a Type I error goes down. When variance goes up, the probability of a Type II error increases, and the probability of a Type II error increases when the sample size, n, decreases.
Figure 2: The Relationship of Type II Errors to Type I Errors, Variance, and Sample Size
Sadly, most introductory textbooks devote little, if any, attention to calculating statistical power. Worse still, statisticians like John Ioannidis have demonstrated that scholarly literature is plagued with underpowered studies that lead to false findings.18 In Clear-Sighted Statistics, we will review how to calculate statistical power, the probability of a Type II error, and effect size (ES), which, broadly speaking, measures the strength of the relationship under investigation. We will do this using a popular program called G*Power, which is available for free for Windows and Macintosh computers. We will also use online power calculators.
Four factors affect statistical power.
- The higher the selected significance level (α), the lower the statistical power.
- The larger the sample size, n, the greater the statistical power.
- The more variable the data, the lower the statistical power.
- The smaller the effect, the lower the statistical power.
VI. Parametric vs. Nonparametric Tests
There are two broad categories of null hypothesis significance tests: Parametric tests and nonparametric tests. With parametric tests we assume that the data come from a known family of distributions. These distributions include: Normal distributions, t-distributions, and F-distributions. With nonparametric tests, the data are not assumed to follow a known family of distributions and no assumptions about the population parameter are made. Chisquare tests are the only nonparametric tests covered in Clear-Sighted Statistics.
VII. Statistical Power
Statisticians like John Ioannidis have demonstrated that scholarly literature is plagued with underpowered studies that lead to false findings.¹⁸ Sadly, most introductory textbooks devote little, if any, attention to calculating statistical power and the probability of a Type II error. In Clear-Sighted Statistics, we will review how to calculate statistical power and effect size (ES), which, broadly speaking, measures the strength of the relationship under investigation.
Statistical power is the ability of a null hypothesis significance test to detect an important effect. Or, stated another way, statistical power is the probability of avoiding a Type II error. Type II errors, as we have discussed, are false negatives. A false negative occurs when the null hypothesis is not rejected when it should have been. In essence, a false negative means that the test was too weak to detect a real effect.
Here is an example of a non-statistical false negative. Imagine that a friend of yours is having chest pains. He goes to the emergency room and tells the doctors that he thinks he is having a heart attack. The doctors run a battery of tests and conclude that his heart is fine. They tell him that the most likely cause of his chest pain, given his medical history, is acid reflux. Relieved, he thanks the doctors and walks out of the emergency room. As soon as he steps onto the street he collapses and dies of a massive heart attack. The emergency room doctors have just committed a Type II error. As we shall see, some false negatives in statistics can have very serious consequences. This is why we should always calculate the statistical power of our significance tests.
Statistical Power is Related to the Probability of a Type II Error
Statistical power is the complement of the probability of a Type II Error. The relationship between statistical power and a Type II Error can be expressed with these simple formulas:
Equation 1: The relationship of Statistical Power and Type II Errors
Statistical Power = 1 – P(Type II)
P(Type II) = 1 – Statistical Power
Statistical Power + P(Type II) = 1.00 or 100%
Statistical power and the probability of a Type II error, therefore, are mutually exclusive and collectively exhaustive.
The Problem of Under-Powered and Over-Powered Tests.
Technically, under-powered tests are unreliable estimators because the results from under-powered tests are more widely distributed and have thicker tails than tests with sufficient statistical power. Data analysts say the results of under-powered tests are extremely “noisy.” Noisy data means the data have a distorted signal that may lead to erroneous conclusions. See Figure 3 for a graphic representation of this phenomenon:
Figure 3: Distribution of Test Results – Under-Powered test vs. Tests with Sufficient Power
Under-powered tests can have unfortunate consequences because such tests have low probability of detecting an effect of practical importance. When an underpowered test fails to uncover an important effect, there is a risk that promising research may be abandoned prematurely.
But, there is a more serious problem with underpowered tests. When the null hypothesis is rejected with an under-powered test, the test has a higher probability of having a larger effect size than tests with greater statistical power. Essentially the effect uncovered may be merely the result of random sampling error. As a consequence, the test results are unreliable because other researchers will have difficulty reproducing the study’s findings. This means that researchers will be wasting their time and energy trying to replicate the results of dubious research.
In addition, under-powered studies with statistically significant results are likely to get published due to publication bias. This bias occurs when authors are more likely to submit such studies for publication and editors of peer reviewed publications are more likely to accept articles with positive results. Another consequence of publication bias is that studies without statistical significance end up in the researchers’ file drawer. The problem of getting non-significant results published is called the “file drawer problem.” Peer review is a process in which articles are reviewed prior to being accepted or rejected for publication by people with subject matter expertise.
Over-powered tests also pose serious problems because they make it almost certain that the null hypothesis will be rejected even when the uncovered effect size is negligible and lacks practical importance. Practical importance is often called practical significance or clinical significance. Practical significance and statistical significance are not the same thing. Practical significance means the results offer useful guidance on policies and practices. Overpowered tests waste respondents’ and researchers’ time and the research sponsors’ resources. Consequently, over-powered tests are considered unethical. When a test with high power has a non-negligible effect size, it should not be considered overpowered.
Good researchers seek to avoid the pitfalls of both under-powered and overpowered tests. They seek to balance the risk of Type I errors, false positives, and Type II errors, false negatives. The general rule of thumb is that the probability of a Type II error should be around 20 percent (80 percent statistical power). The tolerance of a Type I error is set when the researchers select the significance level. For research in the social sciences and business, the significance level is typically set at 5 percent. This means that researchers are more tolerant of Type II errors than Type I errors because a false positive is usually considered a more serious error than a false negative. Should the significance level be set at 1 percent, the tolerance of a Type II error would be 5 percent.
When reporting the findings of a NHST, good researchers include the calculated probability of a Type I error or p-value, the achieved effect size, and statistical power.
A Priori and Post Hoc Statistical Power
The two most common types of statistical power analyses are a priori, or before the study, and post hoc, or after the study, statistical power. An a priori power analysis is extremely important. It is used to determine the necessary sample size to achieve the desired level of statistical power. An a priori power analysis, therefore, should always be performed at the initial stage of the research process before data are collected. This type of statistical power analysis depends heavily on the researchers’ tolerance of Type I and Type II errors and the estimated effect size. Researchers estimate effect size based on their expertise and reviewing similar studies. When the estimated effect size is higher than the achieved effect size, the test will lose statistical power. When the estimated effect size is lower than the achieved effect size, the test will gain statistical power.
As discussed in Chapter 12, researchers must allow for shrinkage when drawing a sample from human subjects. Shrinkage means that the sample size may shrink due to nonresponse errors. As mentioned in Chapter 3, a non-response error occurs when respondents fail to answer important questions. Researchers will use their expertise to estimate shrinkage and adjust their sample size accordingly.
The second kind of statistical power calculation is post hoc statistical power, which is also called “retrospective” or “achieved statistical power.” Advocates of post hoc power calculations argue that it is useful for studies of data that have already been collected. As noted in Chapter 3, these studies are called “retrospective studies.” The calculation of post hoc statistical power is based on the achieved effect size, the researchers’ tolerance of Type I and Type II errors, and whether the test is a left-tail, two-tail, or right-tail test.
There is a widespread and strongly held view among contemporary statisticians that post hoc statistical power analyses are not useful. Andrew Gelman, Professor of Statistics and Political Science at Columbia University, writes “post-hoc power calculation is like a shit sandwich.”¹⁹ While I do not recall Columbia University faculty using such language when I took my doctorate there, John M. Hoenig and Dennis H. Heisey offered a more restrained critique of post hoc statistical power in their article published in The American Statistician. Hoenig and Heisey write:
“There is also a large literature advocating that power calculations be made whenever one performs a statistical test of a hypothesis and one obtains a statistically nonsignificant result. Advocates of such post-experiment power calculations [post hoc statistical power] claim the calculations should be used to aid in the interpretation of the experimental results. This approach, which appears in various forms, is fundamentally flawed.”²⁰
So, while the intent of post hoc statistical power is to show that test results that lack statistical significance result from having too small a sample, this type of statistical power calculation is problematic at best. Without getting into the technical details, the problem with post hoc statistical power is that it is “…completely determined by the p-value. High pvalues (i.e., non-significance) will always have low power. Low p-values will always have high power. Nothing is learned from post hoc power calculations.”²¹
What Affects Statistical Power?
In addition to p-value, other factors affect statistical power. The first is effect size. Broadly speaking, effect size is a quantitative measure of the strength of the relationship between or among variables in the population or sample. As the effect size increases, statistical power increases. As the effect size decreases, statistical power will also decrease.
Figure 4: Effect Size and Statistical Power
Throughout the Chapters 14 through 18, a number of effect size measures will be used.
Statistical power is affected by sample size. Large samples have more statistical power than small samples.
Figure 5: Sample Size and Statistical Power
The chosen significance level affects statistical power. The higher the significance level the lower the statistical power.
Figure 6: Significance Level and Statistical Power
The variability of the data also affects statistical power. The more variable the data the lower the statistical power.
Figure 7: Data Variability and Statistical Power
In addition, two-tail tests have lower statistical power than one-tail tests.
How to Calculate Statistical Power
When reviewing null hypothesis significance tests, we will calculate a priori statistical power using G*Power and online power calculators like the ones found on the Statistics Kingdom website. To repeat, a priori power analyses are used to determine the sample size required to achieve the desired level of statistical power.
What is G*Power?
G*Power is a free statistical power calculator published by Heinrich Heine Universität Düsseldorf. It is widely used because it can calculate statistical power for a variety of ztests, t-tests, F-tests, Chi-Square tests, and Exact Tests. Exact tests are a collection of tests for measures of association. Exact tests are not covered in Clear-Sighted Statistics. G*Power runs on Windows and Macintosh computers. It cannot be used with Chromebooks or tablets and other mobile devices that do not run on the Windows or Macintosh operating systems. Apple’s iPad and iPhone do not use the Macintosh or Windows operating systems, so these devices cannot run G*Power.
G*Power offers five types of power calculations:
1. A Priori Power: Sample size or n is calculated as a function of the desired level of statistical power, the significance level, and the estimated effect size.
2. Compromise Power: Both α and 1 – β are computed as a function of effect size, sample size or n, and the error probability ratio q = α/β.
3. Criterion Power: α and the associated decision criterion are computed as a function of statistical power (1 - β), the effect size, and the sample size or n.
4. Post Hoc Power: Statistical power (1 - β) is determined as a function of effect size and sample size or n.
5. Sensitivity Power: Population effect size is computed as a function of α,
1 - β, and sample size or n.
We will focus only on a priori statistical power.
There are four basic steps to calculating a priori statistical power with G*Power:
1. Choose the test family: There are five options: z tests, t tests, F tests, ChiSquare (χ2) tests, and Exact tests.
2. Enter the statistical test from the drop-down menu.
3. Select the type of power analysis: a priori.
4. Enter the parameters: Depending on the test, G*Power requires the number of tails, degrees of freedom, effect size, the level of significance, which G*Power calls α err prob, and the desire level of statistical power, which G*Power calls Power (1 – β err prob).
5. Click on the calculate button.
For some tests, G*Power can calculate the effect size for you. Click on the “Determine” button just under “Input parameters.” In the window that opens enter the appropriate data and click on the button that reads “Calculate and transfer to the main window.” G*Power will enter the calculated effect size. This effect size, in fact, would be the actual effect size, not the estimated effect size. Of course, you cannot calculate actual effect size before the data are collected.
What is Statistics Kingdom?
Statistics Kingdom is a website that offers a variety of free statistical tools. Because it is a website, you can use it with Chromebooks, tablets, and smart phones as well as on devices running the Windows and Macintosh operating systems. Statistics Kingdom offers a variety of sample size calculators for a priori statistical power: https://www.statskingdom.com/sample_size_all.html.
We will explore how to use the G*Power and Statistics Kingdom statistical power calculators in subsequent chapters.
VIII. Null Hypothesis Testing: Step-By-Step
All null hypothesis tests use six basic steps. Figure 8 shows the steps in the NHST cycle.
Figure 8: NHST Cycle
Step 1: Test Step-Up
As with any research project, we must state the problem in the form of a question. This will help us with the third step: Stating the Null and Alternate Hypotheses. Once we have the research question articulated, we spell out the research design; which is to say, specify what procedures we will follow to answer the research question. A research design has six basic components:
1. The Research Method: What techniques will be used to collect the data: Secondary data, surveys, experiments, etc.
2. Operational Definitions: How the variables of interest will be measured.
3. The Data Collection Process: The way the data will be collected: Interviews, surveys, experiments, etc.
4. Sampling Methods: What sampling methods will be employed.
5. Determining Sample Size: How large a sample is needed to achieve sufficient statistical power? The sample size needs to be large enough to uncover an effect should one actually exist. An a priori statistical power should be estimated before the data is collected. As previously stated, this power analysis depends on the estimated effect size (ES). In social sciences, most effect sizes are small. Clear-Sighted Statistics will review the most commonly used effect sizes.
6. Collect data.
At this stage, we should select the test statistic. The term test statistic has a dual meaning. It means the formula for the test, and the number that results from calculating the test statistic. In the remaining chapters, we will review the following test statistics:
A. One-Sample Tests (Parametric Tests)
1. z-test for the mean
2. t-test for the mean
3. z-test for the proportion
B. Two-Sample Tests (Parametric Tests)
1. z-test for two independent mean
2. z-test for two independent proportions
3. F-test for two independent variances
4. Pooled variance t-test for two independent mean
5. Unequal variance t-test for two independent means
6. t-test for two dependent samples
C. One-Way ANOVA test for two or more independent means (Parametric Tests)
D. Chi-Square Tests (Nonparametric Tests)
1. Goodness-of-fit test
2. Contingency table test
3. Test for Normality
E. Tests for Correlation and Regression (Parametric Tests)
Step 2: Select the Level of Significance, α
We select the level of significance. The significance level is symbolized by the lower-case Greek letter alpha, α. Be careful. The p-value is frequently symbolized with the letter α as well. Remember: The significance level is the researchers’ tolerance of a Type I error while the p-value is the calculated probability of committing a Type I error given the value of the test statistic. The significance level is used to determine whether the value of the test statistic is statistically significant, which is to say, the probability the test result is not due to random sampling error. The null hypothesis is rejected whenever the results are statistically significant. Results are considered statistically significant when the p-value is equal to or less than the significance level.
It is important to distinguish between statistical significance and practical significance. Practical or clinical significance refers to whether the test findings have important implications for policymakers. We say something has practical significance when its application affects current practices or policies. Having statistical significance does not mean that you will have practical significance. A test can lack statistical significance and have practical significance. As sample sizes increase, the hypothesis test has more power to uncover an effect. Large sample sizes can uncover negligible effects that are statistically significant and the null hypothesis is rejected. The effect uncovered, however, may be so small that it has no practical significance. In these cases, the test may be over-powered.
Closely aligned with the significance level is the critical value. The critical value is the value or values of the test statistic that marks the boundary between the rejection region of the probability distribution—the area where the null hypothesis is rejected—from the region where the null hypothesis is not rejected. Critical values can be z-values, tvalues, F-values, or chi-square values.
Step 3: Write the Null Hypothesis (H0) and Alternate Hypothesis (H1)
As previously stated, with parametric tests null and alternate hypotheses are always about population parameters. With the chi-square nonparametric tests covered in Chapter 17, the null and alternate hypotheses are about the observed and expected frequencies. The null and alternate hypotheses are two mutually exclusive hypotheses and collectively exclusive. The null hypothesis is often considered a straw man that the researcher seeks to reject or nullify. It states that there is no statistically significant difference or effect, the alternate hypothesis is sometimes called the research hypothesis. The alternate hypothesis states that there is a statistically significant difference between the sample statistic and the population parameter or between the samples or between the observed and expected frequencies. This means that the test results are greater than what we would expect from sampling error.
With parametric tests the null and alternate hypotheses refer to population parameters not sample statistics. They are written using population symbols not sample symbols: μ, π, σ2, and ρ, not Xbar, p, s2, or r. Remember: Population parameters are symbolized with Greek letters while sample statistics use Latin letters.
Tests using z-values and t-values are directional. This means that the rejection region can be placed on the left or lower tail, both tails, or right or upper tail. You can tell the direction of the test by looking at the direction of the sign in the alternate hypothesis. A “less than sign,” <, signals a left tail test, a “not equal sign,” ≠, marks a two-tailed test. A right-tail test is used when a “greater than sign,” >, is present. The rejection regions for a two-tailed test are marked by dividing the level of significance into two equal parts with half placed on the right-tail and half on the left-tail. One-tailed tests have only one rejection region, which is marked by placing the entire significance level in the appropriate tail.
Drawing these curves will help you to visualize the difference between left-tail, two-tail, and right-tail tests and the location of the rejection regions. Table 2 shows examples of null and alternate hypotheses and the curves for left-tail, two-tail, and right-tail z or t tests. The shaded areas of the curves are the rejection regions. The critical value or values are the z-values or t-values at the lines that separate the rejection region from the rest of the curve. Two-tail tests have two critical values, one positive and one negative. Left-tail tests have one critical value, which is negative. Right-tail test have one positive critical value.
Table 2: Syntax for Left-Tailed, Two-Tailed, and Right-Tailed Tests
F-Tests and chi-square tests are right-tailed tests.
Step 4: Write the Decision Rule
Decision rules state the criterion for rejecting the null hypothesis, which is based on the critical value or values for the test statistic. The critical value is determined using the appropriate critical value tables for z, t, F, and chi-square or by Microsoft Excel.
All decision rules follow a simple structure: “Reject the null hypothesis if [name of the test statistic] is ‘less than,’ ‘greater than,’ or ‘less than or greater than’ the critical value or values.” Decision rules should not be longer than a single sentence. Critical values may vary slightly depending upon whether you find them using a printed table or Microsoft Excel. Table 3 shows the difference in z-values for left, right, and two-tailed tests using paper tables and Excel. The values used by Excel are in parentheses:
Table 3: Critical Values for z for left, right, and two-tailed tests
The values found using Excel are more precise than those found on paper critical values tables.
Figure 9 shows the decision rules for left-tailed and right-tailed tests for z and t distributions using a 5 percent level of significance. For the t-test, there are 20 degrees of freedom, found by the sample size, n, minus the number of independent samples. Writing the decision rule and drawing a curve showing the rejection region, or regions, by hand will help you visualize when to reject the null hypothesis.
Figure 9: Decision Rules for One-Tailed z and t tests using a 5% α
Figure 10 shows the decision rules for two-tailed tests for z and t distributions using a 5 percent level of significance.
Figure 10: Decision Rules for Two-Tailed z and t tests using a 5% α
Please note: The critical values for t-tests are more extreme than the critical values for z-tests. This makes it more difficult to reject the null hypothesis for t-tests. It also means that t-tests have less statistical power than z-tests. In addition, two-tailed tests have more extreme critical values than one-tailed tests. This makes it harder to reject the null hypothesis for two-tailed tests, and two-tailed tests have lower statistical power than onetailed tests.
Many researchers no longer use critical values for the selected test statistic. They use p-values instead. When using the p-value, the decision rule is: Reject the null hypothesis if the p-value is less than or equal to the significance level. The value of the significance level should be identified. For example, reject the null hypothesis if the p-value is less than or equal to 0.05.
Step 5: Calculate the value of the test statistic, p-value, and Effect Size
Each null hypothesis significance test has its own test statistic or formula. For most tests, the test statistics are complex fractions. A complex fraction is a fraction where the numerator or denominator contains a fraction. The numerator measures sampling error; the sample statistic minus the population parameter. For a one-sample test, Xbar – μ or p – π. The denominator is the standard error. The standard error of the mean for a one-sample test is written as:
Equation 2: Standard Error of the Mean (SEM)
The standard error of the proportion for a one-sample test is written as:
Equation 3: Standard Error of the Proportion (SEP)
Once we have calculated the value of the test statistic, we find the p-value using Microsoft Excel. In some cases, we may be able to find or estimate the p-value using the critical values table. To repeat, the p-value, or probability value, is the calculated probability of committing a Type I error. The p-value is a slippery concept that many people get wrong. It tells you the probability of getting a value for the test statistic that is as extreme, or more extreme, than the one you just calculated. It is a measure of how compatible your result is with the null hypothesis.
At this stage the achieved Effect Size should also be calculated.
Step 6: Make a decision regarding the H0, and Report Results
After we calculate the value of the test statistic and find the p-value, we plan regarding the null hypothesis and the results of the test are reported. It is never sufficient to merely state that we reject or fail to reject the null hypothesis. We must state what the decision regarding the null hypothesis means in the context of the research question.
We can make this decision on the basis of the decision rule. It is strongly recommended, however, that the decision to reject or fail to reject the null hypothesis be based on the p-value because it tells us the probability of getting a test statistic as extreme, or more extreme, than the one we found. To repeat yet again, the p-value is the calculated probability of committing a Type I error.
Here is how we interpret p-values: When the p-value is greater than the level of significance, we fail to reject the null hypothesis and the higher the p-value the more confident we are in this decision. For example, a p-value of 0.051 or 0.50 would lead us to not reject the null hypothesis if the significance level was set at 0.05. Yet, we would be far more confident in this decision if the p-value were 0.50 than when it is only 0.051, or barely above the significance level.
When the p-value is less than or equal to the significance level, we reject the null hypothesis. The smaller the p-value the more confident we are in our decision to reject the null hypothesis. At a 0.05 significance level, we reject the null hypothesis when the pvalue is 0.05 or lower. We would not, however, have a high level of confidence in our decision to reject the null hypothesis when the p-value is 0.05, but with a p-value of <0.001 we would be very confident that our findings are statistically significant. Please note: We report tiny p-values like 0.000000006 as less than 0.001.
Figure 11 shows how to use the p-value to decide whether to reject the null hypothesis.
Figure 11: How to Interpret p-values
Over the years there has been a lot of confusion about what p-values are and how to use them. In March 2016, the American Statistical Association (ASA) addressed this confusion in a statement on p-values and statistical significance.²² The ASA listed six principles on the use of p-values:
1. P-values can indicate how incompatible the [sample] data are with a specified statistical model [or the null hypothesis].
2. P-values do not measure the probability that the studied [null] hypothesis is true, or the probability that the data were produced by random chance alone.
3. Scientific conclusions and business or policy decisions should not be based solely on whether a p-value passes a specific threshold [significance level].
4. Proper inference requires full reporting and transparency.
5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result. [A p-value cannot tell you whether your results have practical significance.]
6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.
Jessica Utts, who served as the ASA’s president in 2016, concluded the ASA’s statement by writing:
The contents of the ASA statement and the reasoning behind it are not new—statisticians and other scientists have been writing on the topic for decades. But this is the first time that the community of statisticians, as represented by the ASA Board of Directors, has issued a statement to address these issues.²³
We should consider p-values as a measure of how surprising our test statistic is. But remember: While we reject the null hypothesis when the p-value is less than or equal to the significance level, a low p-value does not tell us:
1. The alternate hypothesis is true, or
2. Whether the test results have any practical significance.
IX. Summary
Let’s review: You may consider the NHST process a cycle. The process can require multiple testing for continuous refinement of hypotheses through the process of falsification. It is through this process that a hypothesis may eventually become a theory. The six steps are:
- Test set-up
- Select the level of significance
- State the Null and Alternate Hypotheses
- Compose the decision rule
- Calculate the test statistic, effect size, and p-value
- Decide on whether or not to reject a null hypothesis and report the results
We will use this six-step process whenever we conduct a NHST. Throughout Clear-Sighted Statistics, we will assume that the first step, test set-up—setting up the research design—has been properly conducted. But, we will also conduct an a priori statistical power analysis. The issue of practical significance will be discussed when appropriate.
NHST is a process based on probability and sample statistics to determine whether the difference between the statistic and the parameter, or the difference between or among samples, or the difference between the expected and observed frequencies are the result of random sampling error. Regarding the significance level, the researcher sets the long-term tolerance for mistakenly rejecting the null hypothesis. Only the null hypothesis is subject to falsification. With NHST, we do not test the alternate hypothesis.
When the p-value is less than or equal to the significance level, the null hypothesis is rejected, and the results are considered statistically significant. To repeat, the p-value indicates the probability of making a Type I error, given the test results. The smaller the pvalue, the more likely the results are statistically significant; and not due to random sampling error.
When we fail to reject the null hypothesis, we do not conclude that it is true. We also must be aware that we could be committing a Type II error. The goal of good hypothesis testing is to have sufficient statistical power. Generally speaking, we aim for at least 80 percent statistical power, or the probability of a Type II error of 20 percent or less when the significance level is set at 5 percent. Type I errors are considered a more serious flaw than Type II errors, which is why researchers set the significance level, or tolerance for a Type I error, at a lower level than their tolerance for a Type II error.
X. What is Next
We have reviewed the essential features of NHST, defined key terms, and outlined the steps to conduct such tests. In Chapter 14, we will cover one-sample tests of hypothesis using normal and t-distributions. In Chapter 15, we will explore two sample tests of hypothesis using normal and t-distributions. We will distinguish between independent and dependent or conditional samples. We will also introduce F-distributions and the F-test for Equality of Variance to determine equality of variance between two independent samples. In Chapter 16, the One-Way ANOVA test will be introduced. This test allows you to simultaneously compare two or more population means. This test uses the Fdistribution. In Chapter 17, chi-square tests, the only nonparametric tests we will cover, will be reviewed. Unlike parametric tests, nonparametric tests make no assumption about the parameters in the population under investigation. Chapter 18, covers linear correlation and regression. We will examine a variety of hypothesis tests used with linear correlation and regression. Figure 12 shows the type of null hypothesis significance tests we will cover in Clear-Sighted Statistics.
Figure 12: Types of NHST
XI. Exercises
Answers to the following questions can be found by carefully reading this chapter.
Exercise 1: What are the meanings of the terms null hypothesis (H0) and alternate hypothesis (H1)?
Exercise 2: What is the statistical significance level, α?
Exercise 3: What does the term “statistically significant” mean? How is it different from practical or clinical significance?
Exercise 4: What is the difference between Type I (α) and Type II (β) errors and what factors affect their probability?
Exercise 5: Which errors are considered more serious? Type I or Type II errors?
Exercise 6: What is Effect Size (ES)?
Exercise 7: How are Type II errors and Statistical Power related and why are low powered and overly powered tests a problem?
Exercise 8: What is the difference between a priori and post hoc statistical power? Why is a priori statistical power considered far more important than post hoc statistical power? hat tools are available to calculate a priori and post hoc statistical power?
Exercise 9: What are p-values? What is the difference between the significance level and a p-value?
Exercise 10: P-values are widely misused. What are the American Statistical Association’s key guidelines on p-values?
Except where otherwise noted, Clear-Sighted Statistics is licensed under a
Creative Commons License. You are free to share derivatives of this work for
non-commercial purposes only. Please attribute this work to Edward Volchok.
Endnotes
¹ Karl R. Popper, The Logic of Scientific Discovery, (Mansfield Centre, CT: Martino Publishing, 2014), pp. 40-41.
² “Dressed to Kill,” Universal Pictures, 1946, 15:42-15:48. This movie is available on Amazon Prime.
³ Ronald A. Fisher, Statistical Methods for Research Workers. (London: Oliver & Boyd, 1925). The 14th and final edition was published in 1970, 8 years after Fisher’s death.
⁴ Jerzy Neyman, and Egon S. Pearson, E. S. “On the Use and Interpretation of Certain Test Criteria for Purposes of Statistical Inference: Part I.” Biometrika 20A, 1928, pp. 175–240.
⁵ E. L. Lehmann, “The Fisher, Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two?” Journal of the American Statistical Association, Volume 88, No. 424, December 1993, p. 1248. Steven N. Goodman, “P Values, Hypothesis Tests, and Likelihood: Implications for Epidemiology of a Neglected Historical Debate.” American Journal of Epidemiology, Vol. 137, No. 5. 1993, pp. 485-496.
⁶ Geoff Cumming, Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. (New York: Routledge, 2012), p. 25.
⁷ Genesis, 2:7.
⁸ Galton made many contributions in a wide variety of areas. He created the concept of regression toward the mean. He was among the first to use standard deviation. He promoted the use of fingerprints to identify individuals. He devised the first weather map. He was, like Karl Pearson, an early advocate of eugenics, or the idea that the human race can be improved by getting more fit people to have more children and unfit people to have fewer children. Eugenics was exposed by many people across the political spectrum in the late nineteenth and early twentieth centuries. The eugenics movement inspired the forced sterilization programs in Nazi Germany. These programs lead to the demise of this movement.
⁹ “The Scope of Biometrika,” Biometrika, Volume 1, No. 1, October 1901, pp. 1-2.
¹⁰ “The Spirit of Biometrika, “Biometrika, Volume 1, No. 1, October 1901, p. 3.
¹¹ Thomas S. Kuhn, The Structure of Scientific Revolutions, (Chicago: University of Chicago Press, 2012).
¹² Tim M. Berra, “Charles Darwin’s Paradigm Shift,” The Beagle, Records of the Museums and Art Galleries of the Northern Territory, Volume 5, 2008, pp. 1-5.
¹³ Thomas S. Kuhn, The Structure of Scientific Revolutions. (Chicago: University of Chicago Press, 2012).
¹⁴ John W. Tukey, Exploratory Data Analysis, (Reading, MA: Addison-Wesley, 1977), pp. 1-3, 21.
¹⁵ http://www.nycourts.gov/judges/cji/5-SampleCharges/SampleCharges.shtml. The phase “beyond reasonable doubt is a “term of art” that is difficult to define. In New York State, a statement like this is read to juries before they decide to convict or acquit a defendant:
We now turn to the fundamental principles of our law that apply in all criminal trials–the presumption of innocence, the burden of proof, and the requirement of proof beyond a reasonable doubt.
Throughout these proceedings, the defendant is presumed to be innocent. As a result, you must find the defendant not guilty, unless, on the evidence presented at this trial, you conclude that the People [who are represented by the prosecutor] have proven the defendant guilty beyond a reasonable doubt.
What does our law mean when it requires proof of guilt “beyond a reasonable doubt”?
The law uses the term, “proof beyond a reasonable doubt,” to tell you how convincing the evidence of guilt must be to permit a verdict of guilty. The law recognizes that, in dealing with human affairs, there are very few things in this world that we know with absolute certainty. Therefore, the law does not require the People to prove a defendant guilty beyond all possible doubt. On the other hand, it is not sufficient to prove that the defendant is probably guilty. In a criminal case, the proof of guilt must be stronger than that. It must be beyond a reasonable doubt.
A reasonable doubt is an honest doubt of the defendant’s guilt for which a reason exists based upon the nature and quality of the evidence. It is an actual doubt, not an imaginary doubt. It is a doubt that a reasonable person, acting in a matter of this importance, would be likely to entertain because of the evidence that was presented or because of the lack of convincing evidence.
Proof of guilt beyond a reasonable doubt is proof that leaves you so firmly convinced of the defendant’s guilt that you
¹⁶ Ronald Aylmer Fisher, The Design of Experiments, (Edinburgh, UK: Oliver and Boyd, 1935), p. 19. https://archive.org/details/in.ernet.dli.2015.502684/page/n31.
¹⁷ Søren Kierkegaard, Works of Love: Some Christian Reflections in Form of Discourse. Translated by Howard V. and Enda H, Hong, (New York: Harper Torchbooks, 1962), p. 23.
¹⁸ John P. A. Ioannidis, “Why Most Published Research Findings are False,” PLoS Medicine, August 2005, pp. 696-701. https://journals.plos.org/plosmedicine/article/file?id=10.1371/journal.pmed.0020124&type=printable
¹⁹ Andrew Gelman, “How post-hoc power calculation is like a shit sandwich.” Statistical Modeling, Causal Inference, and Social Science. January 13, 2019. https://statmodeling.stat.columbia.edu/2019/01/13/post-hoc-power-calculation-like-shit-sandwich/.
²⁰ John M. Hoenig and Dennis M. Heisey, “The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis.” Volume 55, No. 1, February 2001. The American Statistician.
²¹ “Post Hoc Power Calculations are Not Useful,” University of Virginia Library, Research Data Services + Sciences. https://
library.virginia.edu/post-hoc-power-calculations-are-not-useful/.
²² “American Statistical Association Releases Statement on Statistical Significance and P-Values,” March 7, 2016. https://www.amstat.org/asa/files/pdfs/P-ValueStatement.pdf.
²³ “American Statistical Association Releases Statement on Statistical Significance and P-Values,” March 7, 2016. https://www.amstat.org/asa/files/pdfs/P-ValueStatement.pdf.