Skip to main content

Reproducible Research Using R: 7 Correlation

Reproducible Research Using R
7 Correlation
  • Show the following:

    Annotations
    Resources
  • Adjust appearance:

    Font
    Font style
    Color Scheme
    Light
    Dark
    Annotation contrast
    Low
    High
    Margins
  • Search within:
    • Notifications
    • Privacy
  • Project HomeBrooklyn Civic Data Lab
  • Projects
  • Learn more about Manifold

Notes

table of contents
  1. About
    1. 0.1 What You’ll Learn
    2. 0.2 What You Should Know First
    3. 0.3 What This Book Does Not Cover
  2. How to Use This Book
    1. 0.4 Chapter Anatomy
    2. 0.5 Code, Data, and Reproducibility
    3. 0.6 Acknowledgments
  3. 1 Getting Started with R
    1. 1.1 Learning Objectives
    2. 1.2 RStudio
    3. 1.3 R as a Calculator
      1. 1.3.1 Basic Math
      2. 1.3.2 Built-in mathematical Functions
    4. 1.4 Creating Variables and Assigning Objects
    5. 1.5 Vectors
      1. 1.5.1 Numeric Vectors
      2. 1.5.2 Character Vectors
      3. 1.5.3 Logical Vectors
      4. 1.5.4 Factors (categorical)
      5. 1.5.5 Indexing (1-based in R!)
      6. 1.5.6 Type Coercion
    6. 1.6 Data Frames
      1. 1.6.1 Creating Your Own Data Frame
      2. 1.6.2 Functions to Explore Datasets
      3. 1.6.3 Working With Columns Within Data Frames
    7. 1.7 Reading & Writing data
    8. 1.8 Packages
    9. 1.9 Getting Comfortable Making Mistakes - Help and Advice
    10. 1.10 Key Takeaways
    11. 1.11 Checklist: Before Moving On
    12. 1.12 Key Functions & Commands
    13. 1.13 💡 Reproducibility Tip:
  4. 2 Introduction to tidyverse
    1. 2.1 Learning Objectives {tidyverse-objectives}
    2. 2.2 Using Packages
      1. 2.2.1 Installing Packages
      2. 2.2.2 Loading Packages
    3. 2.3 Meet the tidyverse
      1. 2.3.1 The Pipe
    4. 2.4 Manipulating Data in tidyverse
      1. 2.4.1 Distinct
      2. 2.4.2 Select
      3. 2.4.3 Filter
      4. 2.4.4 Arrange
      5. 2.4.5 Mutate
      6. 2.4.6 If Else
      7. 2.4.7 Renaming Columns
      8. 2.4.8 Putting them all together
    5. 2.5 Insights Into Our Data
      1. 2.5.1 Count
      2. 2.5.2 Summarizing and Grouping
    6. 2.6 Common Gotchas & Quick Fixes
      1. 2.6.1 = vs ==
      2. 2.6.2 NA-aware math
      3. 2.6.3 Pipe position
      4. 2.6.4 Conflicting function names
    7. 2.7 Key Takeaways
    8. 2.8 Checklist
    9. 2.9 Key Functions & Commands
    10. 2.10 💡 Reproducibility Tip:
  5. 3 Visualizations
    1. 3.1 Introduction
    2. 3.2 Learning Objectives
    3. 3.3 Base R
    4. 3.4 ggplot2
      1. 3.4.1 Basics
      2. 3.4.2 Scatterplot - geom_point()
      3. 3.4.3 Bar Chart (counts) and Column Chart (values)
      4. 3.4.4 Histograms and Density Plots (distribution)
      5. 3.4.5 Boxplot - geom_boxplot()
      6. 3.4.6 Lines (time series) - geom_line()
      7. 3.4.7 Put text on the plot - geom_text()
      8. 3.4.8 Error bars (requires summary stats) - geom_errorbar()
      9. 3.4.9 Reference lines
    5. 3.5 Key Takeaways
    6. 3.6 Checklist
    7. 3.7 ggplot2 Visualization Reference
      1. 3.7.1 Summary of ggplot Geometries
      2. 3.7.2 Summary of other ggplot commands
    8. 3.8 💡 Reproducibility Tip:
  6. 4 Comparing Two Groups: Data Wrangling, Visualization, and t-Tests
    1. 4.1 Introduction
    2. 4.2 Learning Objectives {means-objectives}
    3. 4.3 Creating a Sample Dataset
    4. 4.4 Merging Data
      1. 4.4.1 Binding our data
      2. 4.4.2 Joining Data
      3. 4.4.3 Wide Format
      4. 4.4.4 Long Format (Reverse Demo)
    5. 4.5 Comparing Means
      1. 4.5.1 Calculating the means
      2. 4.5.2 t.test
    6. 4.6 Key Takeaways
    7. 4.7 Checklist
      1. 4.7.1 Data Creation & Import
      2. 4.7.2 Comparing Two Means
    8. 4.8 Key Functions & Commands
    9. 4.9 Example APA-style Write-up
    10. 4.10 💡 Reproducibility Tip:
  7. 5 Comparing Multiple Means
    1. 5.1 Introduction
    2. 5.2 Learning Objectives {anova-objectives}
    3. 5.3 Creating Our Data
    4. 5.4 Descriptive Statistics
    5. 5.5 Visualizing Relationships
    6. 5.6 Running a T.Test
    7. 5.7 One-Way ANOVA
    8. 5.8 Post-hoc Tests
    9. 5.9 Adding a Second Factor
    10. 5.10 Model Comparison With AIC
    11. 5.11 Key Takeaways
    12. 5.12 Checklist
    13. 5.13 Key Functions & Commands
    14. 5.14 Example APA-style Write-up
    15. 5.15 💡 Reproducibility Tip:
  8. 6 Analyzing Categorical Data
    1. 6.1 Introduction
    2. 6.2 Learning Objectives {cat-objectives}
    3. 6.3 Loading Our Data
    4. 6.4 Contingency Tables
    5. 6.5 Visualizations
    6. 6.6 Chi-Square Test
    7. 6.7 Cross Tables
    8. 6.8 Contribution
    9. 6.9 CramerV
    10. 6.10 Interpretation
    11. 6.11 Key Takeaways
    12. 6.12 Checklist
    13. 6.13 Key Functions & Commands
    14. 6.14 Example APA-style Write-up
    15. 6.15 💡 Reproducibility Tip:
  9. 7 Correlation
    1. 7.1 Introduction
      1. 7.1.1 Learning Objectives
    2. 7.2 Loading Our Data
    3. 7.3 Cleaning our data
    4. 7.4 Visualizing Relationships
    5. 7.5 Running Correlations (r)
    6. 7.6 Correlation Matrix
    7. 7.7 Coefficient of Determination (R^2)
    8. 7.8 Partial Correlations
    9. 7.9 Biserial and Point-Biserial Correlations
    10. 7.10 Grouped Correlations
    11. 7.11 Conclusion
    12. 7.12 Key Takeaways
    13. 7.13 Checklist
    14. 7.14 Key Functions & Commands
    15. 7.15 Example APA-style Write-up
      1. 7.15.1 Bivariate Correlation
      2. 7.15.2 Positive Correlation
      3. 7.15.3 Partial Correlation
    16. 7.16 💡 Reproducibility Tip:
  10. 8 Linear Regression
    1. 8.1 Introduction
    2. 8.2 Learning Objectives {lin-reg-objectives}
    3. 8.3 Loading Our Data
    4. 8.4 Cleaning Our Data
    5. 8.5 Visualizing Relationships
    6. 8.6 Understanding Correlation
    7. 8.7 Linear Regression Model
    8. 8.8 Checking the residuals
    9. 8.9 Adding more variables
      1. 8.9.1 Bonus code
    10. 8.10 Conclusion
    11. 8.11 Key Takeaways
    12. 8.12 Checklist
    13. 8.13 Key Functions & Commands
    14. 8.14 Example APA-style Write-up
    15. 8.15 💡 Reproducibility Tip:
  11. 9 Logistic Regression
    1. 9.1 Introduction
    2. 9.2 Learning Objectives {log-reg-objectives}
    3. 9.3 Load and Preview Data
    4. 9.4 Exploratory Data Analysis
    5. 9.5 Visualize Relationships
    6. 9.6 Train and Test Split
    7. 9.7 Build Logistic Regression Model
      1. 9.7.1 McFadden’s Pseudo-R²
      2. 9.7.2 Variable Importance
      3. 9.7.3 Multicollinearity check
    8. 9.8 Make Predictions
    9. 9.9 Evaluate Model
    10. 9.10 ROC Curve + AUC
    11. 9.11 Interpretation
    12. 9.12 Key Takeaways
    13. 9.13 Checklist
    14. 9.14 Key Functions & Commands
    15. 9.15 Example APA-style Write-up
    16. 9.16 💡 Reproducibility Tip:
  12. 10 Reproducible Reporting
    1. 10.1 Introduction
    2. 10.2 Learning Objectives {r-markdown-objectives}
    3. 10.3 Creating an R Markdown File
    4. 10.4 Parts of an R Markdown File
      1. 10.4.1 The YAML
      2. 10.4.2 Text
      3. 10.4.3 R Chunks
      4. 10.4.4 Sections
    5. 10.5 Knitting an R Markdown File
    6. 10.6 Publishing an R Markdown file
    7. 10.7 Extras
      1. 10.7.1 Links
      2. 10.7.2 Pictures
      3. 10.7.3 Checklists
      4. 10.7.4 Standout sections
      5. 10.7.5 Changing Setting of Specific R Chunks
    8. 10.8 Key Takeaways
    9. 10.9 Checklist
    10. 10.10 Key Functions & Commands
    11. 10.11 Summary of Common R Markdown Syntax
    12. 10.12 💡 Reproducibility Tip:
  13. Appendix: Reproducibility Checklist for Data Analysis in R
    1. 10.13 Project & Environment
    2. 10.14 Data Integrity & Structure
    3. 10.15 Data Transformation & Workflow
    4. 10.16 Merging & Reshaping Data
    5. 10.17 Visualization & Communication
    6. 10.18 Statistical Reasoning
    7. 10.19 Modeling & Inference
    8. 10.20 Randomness & Evaluation
    9. 10.21 Reporting & Execution
    10. 10.22 Final Check
  14. Packages & Functions Reference

7 Correlation

7.1 Introduction

In everyday communication, the word correlation is thrown around a lot, but, what does it actually mean? In this lesson, we will explore how to use correlations to better understand how our variables interact with each other.

Today we’ll be trying to answer the question:

How do anxiety scores and studying time influence exam scores?

We’ll use a dataset called Exam_Data.xlsx, which includes students’ exam scores, anxiety levels just before the test, and total hours spent studying.

7.1.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Interpret the direction and strength of correlations.
  • Compute correlations in R using cor() and cor.test().
  • Explain and calculate R² as variance explained.
  • Run and interpret partial and point-biserial correlations.
  • Visualize relationships using scatterplots and correlation matrices.

7.2 Loading Our Data

We start by importing the necessary packages, reading in our Excel file, and quickly doing an overview of the data. For a quick review on loading data into R, refer back to section 1.7.

library(readxl)
library(tidyverse)

examData <- read_xlsx("Exam_Data.xlsx")

library(skimr)

skim(examData)
Table 7.1: Data summary
NameexamData
Number of rows100
Number of columns5
_______________________
Column type frequency:
character2
numeric3
________________________
Group variablesNone

Variable type: character

skim_variablen_missingcomplete_rateminmaxemptyn_uniquewhitespace
Student Name01390990
First Generation College Student0123020

Variable type: numeric

skim_variablen_missingcomplete_ratemeansdp0p25p50p75p100hist
Studying Hours0120.1518.290.008.0015.0024.2598.00▇▃▁▁▁
Exam Score0157.3825.725.0040.0061.5080.00100.00▃▃▅▇▅
Anxiety Score0174.0517.330.0669.3778.2484.6997.58▁▁▁▆▇
colnames(examData)
#> [1] "Student Name"                    
#> [2] "Studying Hours"                  
#> [3] "Exam Score"                      
#> [4] "Anxiety Score"                   
#> [5] "First Generation College Student"

Our data has 5 columns:

  1. Student Name: The name of the student
  2. Studying Hours: The number of hours they studied for the exam
  3. Exam Score: The score they earned on their exam
  4. Anxiety Score: The measure of their anxiety levels before the exam
  5. First Generation College Student: If the student is or is not a first generation college student (like me)

While we have all of the data needed to begin answering our question, there is one flaw: there are spaces in our column headers.

7.3 Cleaning our data

Not only does R love it when our column headers do not have spaces, it is also best practice for making our analyzes reproducible.

We could use the rename() command from the dplyr package like in Section 2.4.7, but there is more than one column, so we can streamline this process. Using the clean_names() function from the janitor package, we can easily remove the spaces in our column names

library(janitor)
#> 
#> Attaching package: 'janitor'
#> The following objects are masked from 'package:stats':
#> 
#>     chisq.test, fisher.test
colnames(examData) # Before we clean the data
#> [1] "Student Name"                    
#> [2] "Studying Hours"                  
#> [3] "Exam Score"                      
#> [4] "Anxiety Score"                   
#> [5] "First Generation College Student"

examData <- clean_names(examData)

colnames(examData) # After we clean the data
#> [1] "student_name"                    
#> [2] "studying_hours"                  
#> [3] "exam_score"                      
#> [4] "anxiety_score"                   
#> [5] "first_generation_college_student"

Looking at the before and after, we see that the clean_names() function replaced every space within our column headers and replaced them with underscores. Notice how it also turned all of the letters to lowercase (also best practice for reproducibility). Now we can begin our journey.

7.4 Visualizing Relationships

Before we start running any statistical analyses, we always want to begin by graphing our data, providing us with a visual understanding of what story our data is telling us. With correlations, we want to focus on scatterplots. As a reminder, a scatterplot needs two numerical values.

Since we are trying to better understand exam data, let’s create three scatterplots: one showing the relationship between studying and exam scores, one between anxiety and exam scores, and one between studying and anxiety. If you haven’t already, take some time to review scatterplots in Section 3.4.2.

p1 <- ggplot(examData, aes(x = studying_hours, y = exam_score)) +
  geom_point(aes(color = first_generation_college_student), size = 3, alpha = 0.8) +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal() +
  labs(
    title = "Studying Time vs Exam Performance",
    subtitle = "Do students who study more score higher?",
    x = "Studying Time (hours)", y = "Exam Score (%)"
  )

p1
#> `geom_smooth()` using formula = 'y ~ x'
Scatterplot showing the relationship between studying time and exam performance, with points colored by if they're a first generation college student and a fitted linear trend line. The upward trend suggests a positive linear association, indicating that students who spend more time studying tend to achieve higher exam scores. This visualization motivates the use of a correlation coefficient to quantify the strength of the relationship.

Figure 7.1: Scatterplot showing the relationship between studying time and exam performance, with points colored by if they’re a first generation college student and a fitted linear trend line. The upward trend suggests a positive linear association, indicating that students who spend more time studying tend to achieve higher exam scores. This visualization motivates the use of a correlation coefficient to quantify the strength of the relationship.

p2 <- ggplot(examData, aes(x = anxiety_score, y = exam_score)) +
  geom_point(aes(color = first_generation_college_student), size = 3, alpha = 0.8) +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal() +
  labs(
    title = "Exam Anxiety vs Exam Performance",
    subtitle = "Does anxiety relate to exam performance?",
    x = "Exam Anxiety (0–100)", y = "Exam Score (%)"
  )

p2
#> `geom_smooth()` using formula = 'y ~ x'
Scatterplot illustrating the relationship between exam anxiety and exam performance, with a fitted linear trend line. The downward trend indicates a negative linear association, suggesting that higher anxiety levels are associated with lower exam scores. This visualization supports the use of correlation to formally assess the direction and strength of the relationship.

Figure 7.2: Scatterplot illustrating the relationship between exam anxiety and exam performance, with a fitted linear trend line. The downward trend indicates a negative linear association, suggesting that higher anxiety levels are associated with lower exam scores. This visualization supports the use of correlation to formally assess the direction and strength of the relationship.

p3 <- ggplot(examData, aes(x = studying_hours, y = anxiety_score)) +
  geom_point(aes(color = first_generation_college_student), size = 3, alpha = 0.8) +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal() +
  labs(
    title = "Studying vs Exam Anxiety",
    subtitle = "Do students who study more have more anxiety?",
    x = "Studying Time (hours)", y = "Exam Anxiety (0–100)"
  )

p3
#> `geom_smooth()` using formula = 'y ~ x'
Scatterplot displaying the relationship between studying time and exam anxiety, with points colored by if they're a first generation college student and a fitted linear trend line. The negative linear pattern suggests that increased study time is associated with lower anxiety levels, providing visual evidence of a linear relationship prior to computing a correlation coefficient.

Figure 7.3: Scatterplot displaying the relationship between studying time and exam anxiety, with points colored by if they’re a first generation college student and a fitted linear trend line. The negative linear pattern suggests that increased study time is associated with lower anxiety levels, providing visual evidence of a linear relationship prior to computing a correlation coefficient.

Intuitively, these graphs make sense (which is what we want to see). From a visual perspective:

  1. Graph 1 indicates that as someone studies more for the exam, they do better on the exam (a Christmas miracle!).
  2. Graph 2 indicates that the more anxiety someone has, the worse they’ll perform on the exam.
  3. Graph 3 indicates that the more time you spend studying for an exam, the less anxiety you will have before the exam.

One of the main reasons why we visualize our data is to see if it is linear. We are only going to be talking about linear relationships in this chapter.

Note: If your scatterplot ever shows a curve, use a non-parametric alternative like Spearman’s rank correlation instead.

For today, we are focusing on studying time, anxiety scores, and exam scores. If we did not want to manually create 3 different scatterplots, we could utilize the pairs command.

pairs(examData[, c("studying_hours", "exam_score", "anxiety_score")])
Scatterplot matrix displaying pairwise relationships among studying time, exam performance, and exam anxiety. Each panel shows the relationship between two variables, allowing for simultaneous assessment of direction, strength, and linearity prior to formal correlation analysis.

Figure 7.4: Scatterplot matrix displaying pairwise relationships among studying time, exam performance, and exam anxiety. Each panel shows the relationship between two variables, allowing for simultaneous assessment of direction, strength, and linearity prior to formal correlation analysis.

This creates a scatterplot for all of the columns we specify. It may take a while to get used to reading this, but the way to read these is where the names would intersect is the scatterplot that represents those two variables. For example, the exam vs studying scatterplot would be the top middle scatterplot.

We can get even fancier and use the command ggpairs from the GGally function.

library(GGally)

ggpairs(examData[, c("studying_hours", "exam_score", "anxiety_score")])
Enhanced scatterplot matrix displaying pairwise relationships among studying time, exam performance, and exam anxiety. The diagonal panels show variable distributions, while off-diagonal panels display scatterplots and correlation coefficients. This visualization allows for simultaneous assessment of linearity, direction, strength of association, and distributional properties prior to formal correlation analysis.

Figure 7.5: Enhanced scatterplot matrix displaying pairwise relationships among studying time, exam performance, and exam anxiety. The diagonal panels show variable distributions, while off-diagonal panels display scatterplots and correlation coefficients. This visualization allows for simultaneous assessment of linearity, direction, strength of association, and distributional properties prior to formal correlation analysis.

This not only creates the three scatterplots, but also creates a histogram in the form of a line, and spoiler, the correlation coefficient for all combinations!

7.5 Running Correlations (r)

Since the cat is out of the bag, it is now time for us to officially run some correlations.

The end result of running a correlation is to get the correlation coefficient (r), which is a number between -1 and 1. There are two things we look for in a correlation coefficient:

  • Direction: this is denoted by whether the coefficient is negative or positive. A negative number does not mean it is bad, and a positive number does not mean it is good. It is simply saying that if it is:
    • Negative- then when one variable increases, the other decreases (Anxiety vs Exam)
    • Positive- when one variable increases, the other also increases (Studying vs Exam)
  • Strength: this is how close to -1 or 1 it is. The closer to -1 or 1, the stronger the correlation between the two variables. The inverse is also true: the closer to 0 the number is, the weaker the correlation is.

By definition, the correlation coefficient measures the strength and direction of a linear relationship between two variables.

We can use the following general guidelines for interpreting correlation strength. However, different disciplines may have different guidelines.

Absolute value of r Strength of relationship
r < 0.25 No relationship
0.25 < r < 0.5 Weak relationship
0.5 < r < 0.75 Moderate relationship
r > 0.75 Strong relationship

Now that we understand what the correlation coefficient (r) represents, let’s calculate it! We can use the cor command for this.

cor(examData$studying_hours, examData$exam_score)     # Study time vs performance
#> [1] 0.3900419

cor(examData$anxiety_score, examData$exam_score)    # Anxiety vs performance
#> [1] -0.4449023

cor(examData$studying_hours, examData$anxiety_score)  # Study time vs anxiety
#> [1] -0.7122237

Success! We have three different correlation coefficients. Let us look at all three:

  1. A positive, somewhat weak correlation
  2. A negative, somewhat weak correlation
  3. A negative, somewhat strong correlation

Now, we can also utilize the cor.test command, which not only will give us the correlation coefficient, but also the p-value, so we can identify if the correlation is statistically significant or not.

cor.test(examData$studying_hours, examData$exam_score)     # Study time vs performance
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  examData$studying_hours and examData$exam_score
#> t = 4.1933, df = 98, p-value = 6.034e-05
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.2096883 0.5447277
#> sample estimates:
#>       cor 
#> 0.3900419

cor.test(examData$anxiety_score, examData$exam_score)    # Anxiety vs performance
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  examData$anxiety_score and examData$exam_score
#> t = -4.9178, df = 98, p-value = 3.524e-06
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  -0.5897813 -0.2722777
#> sample estimates:
#>        cor 
#> -0.4449023

cor.test(examData$studying_hours, examData$anxiety_score)  # Study time vs anxiety
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  examData$studying_hours and examData$anxiety_score
#> t = -10.044, df = 98, p-value < 2.2e-16
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  -0.7971286 -0.5996997
#> sample estimates:
#>        cor 
#> -0.7122237

Turns out all three are statistically significant! There’s something just as important as the r value and p value: the method used to calculate it. By default cor and cor.test use Pearson’s method. Going back to visualizing our data, we can only use Pearson’s method if our data is linear.

Question: What if we do not want to run all the correlations one by one…

7.6 Correlation Matrix

Right now, we only have three variables we want to run correlations with. But, what if we had 50? Are we going to run 50 lines of code? No! We can, instead, run a correlation matrix, which conducts a correlation between any and all numeric values. The key here is that your data must only contain numeric values, so make sure you clean your data first. Again, as before, we can utilize the cor command.

# Selecting numeric variables only (if dataset contains non-numeric columns)

examData_numeric <- examData %>% select(where(is.numeric))

cor(examData_numeric)
#>                studying_hours exam_score anxiety_score
#> studying_hours      1.0000000  0.3900419    -0.7122237
#> exam_score          0.3900419  1.0000000    -0.4449023
#> anxiety_score      -0.7122237 -0.4449023     1.0000000

# Hint: use = "pairwise.complete.obs" handles any missing values safely
corr_matrix <- cor(examData_numeric, use = "pairwise.complete.obs")

corr_matrix
#>                studying_hours exam_score anxiety_score
#> studying_hours      1.0000000  0.3900419    -0.7122237
#> exam_score          0.3900419  1.0000000    -0.4449023
#> anxiety_score      -0.7122237 -0.4449023     1.0000000

Boom! We were able to calculate the correlation coefficients in one line of code for all three variables. We are always aiming to get the most done with as little code as possible.

7.7 Coefficient of Determination (R^2)

Once you have a correlation coefficient, what’s next? Well, with the correlation coefficient, we can then calculate R^2, otherwise known as the coefficient of determination, which measures the proportion of variance in one variable that is explained by the other. In simple correlations, R² is just r², the square of the correlation coefficient.

R^2 tells us how much of the variance in Y is explained by X. We can use a combination of the cor command and base R.

# R^2 tells us the percentage of variance shared between two variables.

# Calculating the R^2 value
cor(examData$anxiety_score, examData$exam_score)^2
#> [1] 0.197938

# Making it look pretty
round(cor(examData$anxiety_score, examData$exam_score)^2*100,2)
#> [1] 19.79

# What about the others?
cor(examData$studying_hours, examData$exam_score)^2*100
#> [1] 15.21327

cor(examData$studying_hours, examData$anxiety_score)^2*100
#> [1] 50.72625

The above values are saying:

  1. 19.45% of all the variation of exam scores is associated with anxiety.
  2. 15.74% of all the variation of exam scores is associated with studying time.
  3. 50.30% of all the variation of anxiety scores is associated with studying time.

Number three seems particularly strong. There may be more to investigate here.

7.8 Partial Correlations

When we were looking at our R^2 values, we saw a decent percentage of the variation in exam scores was due to both anxiety scores and studying time. We also saw that half of the variation in anxiety scores was due to studying time. So, maybe, there is some overlap between anxiety+studying time and exam scores.

To account for this, we can utilize the ppcor library and run a partial correlation using the pcor.test command.

Hint: when running a partial correlation using pcor.test, the last variable in the command is the one being controlled for.

library(ppcor)

# Partial correlation between Anxiety and Exam controlling for Studying
pcor.test(examData$anxiety_score, examData$exam_score, examData$studying_hours)
#>     estimate    p.value statistic   n gp  Method
#> 1 -0.2585344 0.00977182 -2.635883 100  1 pearson

# Uno reverse: controlling for Anxiety instead of Studying
pcor.test(examData$studying_hours, examData$exam_score, examData$anxiety_score)
#>    estimate   p.value statistic   n gp  Method
#> 1 0.1163946 0.2512544  1.154199 100  1 pearson

After controlling for studying time, we now see that:

  1. The correlation coefficient is still negative between anxiety scores and exam scores, but it is cut by nearly half!
  2. The correlation coefficient is still positive between studying time and exam scores, but now, not only is it not as strong, but it is not even statistically significant.

This confirms our suspicion that the relationship between anxiety and exam performance partly overlaps with study time. Once we control for study time, the unique relationship between anxiety and exam scores becomes weaker — showing that part of the correlation was actually due to study habits.

Logically, this makes sense: the more you study, the less anxiety you will have! So maybe, just maybe, you’ll consider studying for your next exam.

7.9 Biserial and Point-Biserial Correlations

Now, what about if you have logical variables? T/F, Yes/No, M/F, etc? How can we run a correlation on them if they are not numeric? Great question! What we can do is turn them into “numeric” values, turning them into 0’s and 1’s, and then run a correlation.

# What do you do when you have a biserial (you're either dead or alive)
# Or when you have a point-biserial (failed by 1 pts vs failed by 42 pts vs passed by 4pts)
examData$college_binary <- ifelse(examData$first_generation_college_student=="No",0,1)

cor.test(examData$exam_score, examData$college_binary)
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  examData$exam_score and examData$college_binary
#> t = 0.30181, df = 98, p-value = 0.7634
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  -0.1669439  0.2255416
#> sample estimates:
#>       cor 
#> 0.0304735

From this, we can see that being a first generation college student does not significantly correlate with exam scores.

7.10 Grouped Correlations

While being a first generation college student may not be a strong correlation, maybe there are differences within the groups, and a grouped correlation should be conducted. To do this, all we need to do is first group by first generation college student, and then run correlations on our desired variables.

# This allows us to see if relationships differ by if they're first generation
examData %>%
  group_by(first_generation_college_student) %>%
  summarise(
    r_rev_exam = cor(studying_hours, exam_score),
    r_anx_exam = cor(anxiety_score, exam_score))
#> # A tibble: 2 × 3
#>   first_generation_college_student r_rev_exam r_anx_exam
#>   <chr>                                 <dbl>      <dbl>
#> 1 No                                    0.446     -0.386
#> 2 Yes                                   0.340     -0.511

Our results show that direction does not change within the groups, but the strength of the correlations does. For instance, in students that are a first generation college student, anxiety scores have a deeper impact on exam scores than students who are not a first generation college student.

In chapter (8), we’ll use what we learned here to build predictive models — moving from describing relationships to forecasting outcomes.

7.11 Conclusion

Congratulations! You have visualized correlations, obtained the correlation coefficient, and even computed how much variability is shared between two variables.

In the next chapter, we will use what we learned today to not only measure how related two variables are, but figure out ways to use one variable to predict another!

Key Terms

  • Correlation coefficient (r): Measures the strength and direction of a linear relationship.
  • Coefficient of determination (R²): Proportion of variance in Y explained by X.
  • Partial correlation: Correlation between two variables after controlling for another.
  • Point-biserial correlation: Correlation between a continuous and dichotomous variable.

7.12 Key Takeaways

  • Always visualize relationships before interpreting numbers!!!
  • Pearson correlations measure linear relationships between numeric variables.
  • Correlation ≠ causation
  • R² tells us how much variance in one variable is shared with another.
  • Partial correlations show unique relationships after controlling for other variables.
  • Point-biserial correlations apply when one variable is dichotomous.

7.13 Checklist

When running correlations, have you:

7.14 Key Functions & Commands

The following functions and commands are introduced or reinforced in this chapter to support correlation analysis and related exploratory techniques.

  • cor() (stats)
    • Computes the Pearson correlation coefficient between two numeric variables.
  • cor.test() (stats)
    • Calculates the correlation coefficient and tests whether the relationship is statistically significant.
  • pairs() (base R)
    • Creates a matrix of scatterplots for exploring relationships among multiple numeric variables.
  • ggpairs() (GGally)
    • Produces an enhanced scatterplot matrix that includes distributions and correlation coefficients.
  • pcor.test() (ppcor)
    • Computes partial correlations while controlling for one or more additional variables.
  • ifelse() (base R)
    • Recodes variables for use in point-biserial or conditional correlation analyses.
  • clean_names() (janitor)
    • Renames column headers by replacing all spaces with underscores and making all letters lowercase.

7.15 Example APA-style Write-up

The following examples demonstrate one acceptable way to report correlation results in APA style.

7.15.1 Bivariate Correlation

A Pearson correlation was conducted to examine the relationship between exam anxiety and exam performance. There was a statistically significant negative correlation between anxiety scores and exam scores, r = −.44, p < .001, indicating that higher anxiety levels were associated with lower exam performance. Approximately 19% of the variance in exam scores was shared with anxiety levels (r² = .19).

7.15.2 Positive Correlation

A Pearson correlation analysis revealed a statistically significant positive relationship between study time and exam performance, r = .40, p < .001. This indicates that students who spent more time studying tended to earn higher exam scores. Study time accounted for approximately 16% of the variance in exam performance (r² = .16).

7.15.3 Partial Correlation

A partial correlation was conducted to examine the relationship between exam anxiety and exam performance while controlling for study time. After controlling for study time, the negative association between anxiety and exam performance was reduced but remained statistically significant, r = −.23, p = .02. This indicates that part of the relationship between anxiety and exam scores overlaps with students’ study habits.

7.16 💡 Reproducibility Tip:

When interpreting correlations, statistical strength alone is not enough—results must also make theoretical and real-world sense.

With the right dataset, it is easy to find strong correlations between variables that have no meaningful connection (for example, volcanic eruptions and how often my grandma goes to the supermarket). While such relationships may be statistically strong, they are not scientifically meaningful.

For correlations to be reproducible, the variables involved should have a plausible relationship grounded in theory, logic, or prior evidence. Always ask whether the correlation makes sense before interpreting or reporting it.

Annotate

Next Chapter
8 Linear Regression
PreviousNext
Textbook
Powered by Manifold Scholarship. Learn more at
Opens in new tab or windowmanifoldapp.org