Skip to main content

Reproducible Research Using R: 4 Comparing Two Groups: Data Wrangling, Visualization, and t-Tests

Reproducible Research Using R
4 Comparing Two Groups: Data Wrangling, Visualization, and t-Tests
  • Show the following:

    Annotations
    Resources
  • Adjust appearance:

    Font
    Font style
    Color Scheme
    Light
    Dark
    Annotation contrast
    Low
    High
    Margins
  • Search within:
    • Notifications
    • Privacy
  • Project HomeBrooklyn Civic Data Lab
  • Projects
  • Learn more about Manifold

Notes

table of contents
  1. About
    1. 0.1 What You’ll Learn
    2. 0.2 What You Should Know First
    3. 0.3 What This Book Does Not Cover
  2. How to Use This Book
    1. 0.4 Chapter Anatomy
    2. 0.5 Code, Data, and Reproducibility
    3. 0.6 Acknowledgments
  3. 1 Getting Started with R
    1. 1.1 Learning Objectives
    2. 1.2 RStudio
    3. 1.3 R as a Calculator
      1. 1.3.1 Basic Math
      2. 1.3.2 Built-in mathematical Functions
    4. 1.4 Creating Variables and Assigning Objects
    5. 1.5 Vectors
      1. 1.5.1 Numeric Vectors
      2. 1.5.2 Character Vectors
      3. 1.5.3 Logical Vectors
      4. 1.5.4 Factors (categorical)
      5. 1.5.5 Indexing (1-based in R!)
      6. 1.5.6 Type Coercion
    6. 1.6 Data Frames
      1. 1.6.1 Creating Your Own Data Frame
      2. 1.6.2 Functions to Explore Datasets
      3. 1.6.3 Working With Columns Within Data Frames
    7. 1.7 Reading & Writing data
    8. 1.8 Packages
    9. 1.9 Getting Comfortable Making Mistakes - Help and Advice
    10. 1.10 Key Takeaways
    11. 1.11 Checklist: Before Moving On
    12. 1.12 Key Functions & Commands
    13. 1.13 💡 Reproducibility Tip:
  4. 2 Introduction to tidyverse
    1. 2.1 Learning Objectives {tidyverse-objectives}
    2. 2.2 Using Packages
      1. 2.2.1 Installing Packages
      2. 2.2.2 Loading Packages
    3. 2.3 Meet the tidyverse
      1. 2.3.1 The Pipe
    4. 2.4 Manipulating Data in tidyverse
      1. 2.4.1 Distinct
      2. 2.4.2 Select
      3. 2.4.3 Filter
      4. 2.4.4 Arrange
      5. 2.4.5 Mutate
      6. 2.4.6 If Else
      7. 2.4.7 Renaming Columns
      8. 2.4.8 Putting them all together
    5. 2.5 Insights Into Our Data
      1. 2.5.1 Count
      2. 2.5.2 Summarizing and Grouping
    6. 2.6 Common Gotchas & Quick Fixes
      1. 2.6.1 = vs ==
      2. 2.6.2 NA-aware math
      3. 2.6.3 Pipe position
      4. 2.6.4 Conflicting function names
    7. 2.7 Key Takeaways
    8. 2.8 Checklist
    9. 2.9 Key Functions & Commands
    10. 2.10 💡 Reproducibility Tip:
  5. 3 Visualizations
    1. 3.1 Introduction
    2. 3.2 Learning Objectives
    3. 3.3 Base R
    4. 3.4 ggplot2
      1. 3.4.1 Basics
      2. 3.4.2 Scatterplot - geom_point()
      3. 3.4.3 Bar Chart (counts) and Column Chart (values)
      4. 3.4.4 Histograms and Density Plots (distribution)
      5. 3.4.5 Boxplot - geom_boxplot()
      6. 3.4.6 Lines (time series) - geom_line()
      7. 3.4.7 Put text on the plot - geom_text()
      8. 3.4.8 Error bars (requires summary stats) - geom_errorbar()
      9. 3.4.9 Reference lines
    5. 3.5 Key Takeaways
    6. 3.6 Checklist
    7. 3.7 ggplot2 Visualization Reference
      1. 3.7.1 Summary of ggplot Geometries
      2. 3.7.2 Summary of other ggplot commands
    8. 3.8 💡 Reproducibility Tip:
  6. 4 Comparing Two Groups: Data Wrangling, Visualization, and t-Tests
    1. 4.1 Introduction
    2. 4.2 Learning Objectives {means-objectives}
    3. 4.3 Creating a Sample Dataset
    4. 4.4 Merging Data
      1. 4.4.1 Binding our data
      2. 4.4.2 Joining Data
      3. 4.4.3 Wide Format
      4. 4.4.4 Long Format (Reverse Demo)
    5. 4.5 Comparing Means
      1. 4.5.1 Calculating the means
      2. 4.5.2 t.test
    6. 4.6 Key Takeaways
    7. 4.7 Checklist
      1. 4.7.1 Data Creation & Import
      2. 4.7.2 Comparing Two Means
    8. 4.8 Key Functions & Commands
    9. 4.9 Example APA-style Write-up
    10. 4.10 💡 Reproducibility Tip:
  7. 5 Comparing Multiple Means
    1. 5.1 Introduction
    2. 5.2 Learning Objectives {anova-objectives}
    3. 5.3 Creating Our Data
    4. 5.4 Descriptive Statistics
    5. 5.5 Visualizing Relationships
    6. 5.6 Running a T.Test
    7. 5.7 One-Way ANOVA
    8. 5.8 Post-hoc Tests
    9. 5.9 Adding a Second Factor
    10. 5.10 Model Comparison With AIC
    11. 5.11 Key Takeaways
    12. 5.12 Checklist
    13. 5.13 Key Functions & Commands
    14. 5.14 Example APA-style Write-up
    15. 5.15 💡 Reproducibility Tip:
  8. 6 Analyzing Categorical Data
    1. 6.1 Introduction
    2. 6.2 Learning Objectives {cat-objectives}
    3. 6.3 Loading Our Data
    4. 6.4 Contingency Tables
    5. 6.5 Visualizations
    6. 6.6 Chi-Square Test
    7. 6.7 Cross Tables
    8. 6.8 Contribution
    9. 6.9 CramerV
    10. 6.10 Interpretation
    11. 6.11 Key Takeaways
    12. 6.12 Checklist
    13. 6.13 Key Functions & Commands
    14. 6.14 Example APA-style Write-up
    15. 6.15 💡 Reproducibility Tip:
  9. 7 Correlation
    1. 7.1 Introduction
      1. 7.1.1 Learning Objectives
    2. 7.2 Loading Our Data
    3. 7.3 Cleaning our data
    4. 7.4 Visualizing Relationships
    5. 7.5 Running Correlations (r)
    6. 7.6 Correlation Matrix
    7. 7.7 Coefficient of Determination (R^2)
    8. 7.8 Partial Correlations
    9. 7.9 Biserial and Point-Biserial Correlations
    10. 7.10 Grouped Correlations
    11. 7.11 Conclusion
    12. 7.12 Key Takeaways
    13. 7.13 Checklist
    14. 7.14 Key Functions & Commands
    15. 7.15 Example APA-style Write-up
      1. 7.15.1 Bivariate Correlation
      2. 7.15.2 Positive Correlation
      3. 7.15.3 Partial Correlation
    16. 7.16 💡 Reproducibility Tip:
  10. 8 Linear Regression
    1. 8.1 Introduction
    2. 8.2 Learning Objectives {lin-reg-objectives}
    3. 8.3 Loading Our Data
    4. 8.4 Cleaning Our Data
    5. 8.5 Visualizing Relationships
    6. 8.6 Understanding Correlation
    7. 8.7 Linear Regression Model
    8. 8.8 Checking the residuals
    9. 8.9 Adding more variables
      1. 8.9.1 Bonus code
    10. 8.10 Conclusion
    11. 8.11 Key Takeaways
    12. 8.12 Checklist
    13. 8.13 Key Functions & Commands
    14. 8.14 Example APA-style Write-up
    15. 8.15 💡 Reproducibility Tip:
  11. 9 Logistic Regression
    1. 9.1 Introduction
    2. 9.2 Learning Objectives {log-reg-objectives}
    3. 9.3 Load and Preview Data
    4. 9.4 Exploratory Data Analysis
    5. 9.5 Visualize Relationships
    6. 9.6 Train and Test Split
    7. 9.7 Build Logistic Regression Model
      1. 9.7.1 McFadden’s Pseudo-R²
      2. 9.7.2 Variable Importance
      3. 9.7.3 Multicollinearity check
    8. 9.8 Make Predictions
    9. 9.9 Evaluate Model
    10. 9.10 ROC Curve + AUC
    11. 9.11 Interpretation
    12. 9.12 Key Takeaways
    13. 9.13 Checklist
    14. 9.14 Key Functions & Commands
    15. 9.15 Example APA-style Write-up
    16. 9.16 💡 Reproducibility Tip:
  12. 10 Reproducible Reporting
    1. 10.1 Introduction
    2. 10.2 Learning Objectives {r-markdown-objectives}
    3. 10.3 Creating an R Markdown File
    4. 10.4 Parts of an R Markdown File
      1. 10.4.1 The YAML
      2. 10.4.2 Text
      3. 10.4.3 R Chunks
      4. 10.4.4 Sections
    5. 10.5 Knitting an R Markdown File
    6. 10.6 Publishing an R Markdown file
    7. 10.7 Extras
      1. 10.7.1 Links
      2. 10.7.2 Pictures
      3. 10.7.3 Checklists
      4. 10.7.4 Standout sections
      5. 10.7.5 Changing Setting of Specific R Chunks
    8. 10.8 Key Takeaways
    9. 10.9 Checklist
    10. 10.10 Key Functions & Commands
    11. 10.11 Summary of Common R Markdown Syntax
    12. 10.12 💡 Reproducibility Tip:
  13. Appendix: Reproducibility Checklist for Data Analysis in R
    1. 10.13 Project & Environment
    2. 10.14 Data Integrity & Structure
    3. 10.15 Data Transformation & Workflow
    4. 10.16 Merging & Reshaping Data
    5. 10.17 Visualization & Communication
    6. 10.18 Statistical Reasoning
    7. 10.19 Modeling & Inference
    8. 10.20 Randomness & Evaluation
    9. 10.21 Reporting & Execution
    10. 10.22 Final Check
  14. Packages & Functions Reference

4 Comparing Two Groups: Data Wrangling, Visualization, and t-Tests

4.1 Introduction

A tale as old as time. You have two different groups, and you want to really figure out not only if there is a difference, but also which is best. In New York City, this question materializes as Mets vs Yankees (Mets obviously).

When you have two different groups with a numeric data point, one of the first (and best) things to do is to compare the two means. There is more to uncover than just seeing if there is a difference between the actual means, which is exactly what we are going to be covering today.

In this chapter, we’ll walk through a complete workflow for comparing two groups. Starting from creating and reshaping data, we will move to visualizing differences, then to formally testing whether those differences are statistically meaningful. While this chapter uses baseball teams as the example, the same workflow applies to any two groups you might want to compare.

4.2 Learning Objectives {means-objectives}

By the end of this chapter, you will be able to:

  • Create simple data frames in R to represent grouped numeric data
  • Combine datasets by binding rows and joining tables using base R and tidyverse tools
  • Calculate and interpret group means using mean()
  • Visualize differences between two groups using appropriate plots (e.g., boxplots with jittered points)
  • Conduct and interpret an independent samples t-test using t.test()
  • Evaluate statistical significance using p-values and distinguish statistical significance from differences in averages

4.3 Creating a Sample Dataset

There are many different ways to get data into R. The most common are:

  • Loading an excel file or CSV
  • Calling data into R using an API
  • Loading data from a particular package (see chapter 7)

What happens if you do not have data to insert into R? What if you want to create your own data in R to run analyses, merge, and graph?

R makes it very simple to do exactly that! Using the automatically installed command data.frame R makes it possible to make your very own data frame with whatever you want inside. Here, we are going to create two datasets:

  1. mets: This is going to be a data frame with 10 rows, each row being a different game, and the number of runs scored by the Mets in that game.
  2. yankees: This is going to be a data frame with 10 rows, each row being a different game, and the number of runs scored by the Yankees in that game.

In this scenario, we can imagine that for whatever reason, we are unable or do not want to upload any data to R, but instead, create the data for ourselves. We will create our tables using 2025 Major League Baseball (MLB) data, and will have two columns:

  1. Game: A number representing which game it is.
  2. Score: The total runs the respective team scored in that game.

To summarize, we will be taking the scores from the first 10 Yankees games and the first 10 Mets games and creating a data frame with our data, and utilizing similar techniques introduced in section 1.6.1.

mets <- data.frame(
  Game= c(1,2,3,4,5,6,7,8,9,10),
  Mets_Score= c(4,20,12,5,3,9,9,10,4,2)) # This is where we manually put our numbers in

head(mets) # This calls the first 6 rows of the data
#>   Game Mets_Score
#> 1    1          4
#> 2    2         20
#> 3    3         12
#> 4    4          5
#> 5    5          3
#> 6    6          9

yankees <- data.frame(
  Game= c(1:10), # The colon tells R "any number between 1 and 10."
  Yankees_Score= c(3,8,2,3,2,6,5,3,2,7))

tail(yankees) # This calls the last 6 rows of the data
#>    Game Yankees_Score
#> 5     5             2
#> 6     6             6
#> 7     7             5
#> 8     8             3
#> 9     9             2
#> 10   10             7

Perfect! We created two different data frames, mets and yankees with each having the two columns we were looking for.

To get more practice, let’s do it again, but instead, we will do it with the next ten games of the 2025 season.

mets_second_ten <- data.frame(
  Game= c(11:20),
  Mets_Score= c(10,0,7,1,8,5,3,3,4,1))

yankees_second_ten <- data.frame(
  Game= c(11:20),
  Yankees_Score= c(0,4,1,8,4,4,4,4,6,1))

Now, we successfully have datasets for the first 10 and first 20 games of the 2025 MLB season for both the Mets and the Yankees.

4.4 Merging Data

What if we want our two datasets together?

4.4.1 Binding our data

We have data for the first 20 games for the mets and yankees but they’re split into the first 10 and second 10 games. What if we want to see all 20 games for each individual team? We can use a command in base R called rbind. The r stands for “rows” so this command is literally telling R to “bind the rows.”

Note: rbind only works when the column names in the binding datasets are the same.

When combining datasets, no matter which method, it is imperative that we check our data before and after to make sure the combining process worked how we anticipated. In this case, each of our original datasets contain 10 rows, which means our combined datasets should have 20 rows. We will be checking using nrow, and any number besides 20 means that there is an issue.

# Mets
nrow(mets) # Seeing how many rows are in mets
#> [1] 10

nrow(mets_second_ten) # Seeing how many rows are in mets_second_ten
#> [1] 10

mets_all<- rbind(mets, mets_second_ten) # Combining the rows of mets and mets_second_ten

nrow(mets_all) # Seeing how many rows are in mets_all
#> [1] 20

# Yankees
nrow(yankees)
#> [1] 10

nrow(yankees_second_ten)
#> [1] 10

yankees_all<- rbind(yankees, yankees_second_ten)

Success! Both of our datasets have 20 rows.

4.4.2 Joining Data

Our two datasets, mets_all and yankees_all now have the data from the first 20 games of the 2025 MLB season, since we successfully combined our data. Now, what if we want one dataset that had the scores of each of the first 20 games for the Mets and Yankees?

There are a few different ways to do this. The first is to use the merge command which is part of base R. To check that our data has merged successfully, we need to make sure there are 20 rows and 3 columns: Game, Mets_Score, and Yankees_Score.

baseball_data <- merge(yankees_all, mets_all, by = "Game")

nrow(baseball_data) # Checks the number of rows
#> [1] 20

ncol(baseball_data) # Checks the number of columns
#> [1] 3

baseball_data
#>    Game Yankees_Score Mets_Score
#> 1     1             3          4
#> 2     2             8         20
#> 3     3             2         12
#> 4     4             3          5
#> 5     5             2          3
#> 6     6             6          9
#> 7     7             5          9
#> 8     8             3         10
#> 9     9             2          4
#> 10   10             7          2
#> 11   11             0         10
#> 12   12             4          0
#> 13   13             1          7
#> 14   14             8          1
#> 15   15             4          8
#> 16   16             4          5
#> 17   17             4          3
#> 18   18             4          3
#> 19   19             6          4
#> 20   20             1          1

Fantastic! We now have one complete dataset with the scores of the Mets and Yankees in the first 20 games.

4.4.2.1 Using SQL Joins

We previously used the merge command, however, R also allows for SQL code! Here is a quick overview of the main joins using SQL in R:

  • inner_join: Keeps only rows that match in both datasets
    • Overlap
  • full_join: Keeps all rows from both datasets, matching when possible
    • Everything
  • left_join: Keeps all rows from the left dataset, and matches from the right when possible
  • All left
  • right_join: Keeps all rows from the right dataset, and matches from the left when possible
    • All right
  • anti_join: Keeps rows from the left dataset that do NOT appear in the right
  • What’s missing

In this case, we are going to use inner_join, since we know that both of our datasets have the same 20 games.

library(tidyverse)
#> ── Attaching core tidyverse packages ──── tidyverse 2.0.0 ──
#> ✔ dplyr     1.1.4     ✔ readr     2.1.5
#> ✔ forcats   1.0.1     ✔ stringr   1.6.0
#> ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
#> ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
#> ✔ purrr     1.2.0     
#> ── Conflicts ────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

baseball_data_sql <- inner_join(yankees_all, mets_all, by = "Game")

nrow(baseball_data_sql) # Checks the number of rows
#> [1] 20

ncol(baseball_data_sql) # Checks the number of columns
#> [1] 3

baseball_data_sql
#>    Game Yankees_Score Mets_Score
#> 1     1             3          4
#> 2     2             8         20
#> 3     3             2         12
#> 4     4             3          5
#> 5     5             2          3
#> 6     6             6          9
#> 7     7             5          9
#> 8     8             3         10
#> 9     9             2          4
#> 10   10             7          2
#> 11   11             0         10
#> 12   12             4          0
#> 13   13             1          7
#> 14   14             8          1
#> 15   15             4          8
#> 16   16             4          5
#> 17   17             4          3
#> 18   18             4          3
#> 19   19             6          4
#> 20   20             1          1

One of the beautiful things in R is that you can get the same results using different code - just like we did here. Our two baseball datasets are identical even though we used two different commands to get them. This is an example of how R puts the r in Artist.

4.4.3 Wide Format

Right now, our data is considered wide format, that is, there are 20 rows, and separate columns for the scores of the Yankees and scores of the Mets. In some cases, we may need our data to be in a long format. In this case, there will be 40 rows instead of 20, since each Yankees score and each Mets score will represent a row, instead of each game representing a row.

Using the pivot_longer function from the tidyverse package, we can turn our data from wide to long.

# Lets say we want to turn out data from wide format into long format, so we can run 
baseball_data_long <- baseball_data %>% pivot_longer(cols = c(Yankees_Score, Mets_Score),
                                                    names_to = "Team",
                                                    values_to = "Score")

nrow(baseball_data_long)
#> [1] 40

baseball_data_long
#> # A tibble: 40 × 3
#>     Game Team          Score
#>    <int> <chr>         <dbl>
#>  1     1 Yankees_Score     3
#>  2     1 Mets_Score        4
#>  3     2 Yankees_Score     8
#>  4     2 Mets_Score       20
#>  5     3 Yankees_Score     2
#>  6     3 Mets_Score       12
#>  7     4 Yankees_Score     3
#>  8     4 Mets_Score        5
#>  9     5 Yankees_Score     2
#> 10     5 Mets_Score        3
#> # ℹ 30 more rows

4.4.4 Long Format (Reverse Demo)

Our original baseball_data (and baseball_data_sql) are in wide format already so we don’t have to do this, but just in case you ever start with long data and want to convert it to wide, here is some code to help. Below, we are turning our baseball_data_long into a wide format.

# Lets say we want to turn out data from wide format into long format, so we can run 
baseball_data_wide <- baseball_data_long %>% pivot_wider(names_from = "Team",
                                                    values_from = "Score")
nrow(baseball_data_wide)
#> [1] 20

baseball_data_wide
#> # A tibble: 20 × 3
#>     Game Yankees_Score Mets_Score
#>    <int>         <dbl>      <dbl>
#>  1     1             3          4
#>  2     2             8         20
#>  3     3             2         12
#>  4     4             3          5
#>  5     5             2          3
#>  6     6             6          9
#>  7     7             5          9
#>  8     8             3         10
#>  9     9             2          4
#> 10    10             7          2
#> 11    11             0         10
#> 12    12             4          0
#> 13    13             1          7
#> 14    14             8          1
#> 15    15             4          8
#> 16    16             4          5
#> 17    17             4          3
#> 18    18             4          3
#> 19    19             6          4
#> 20    20             1          1

4.5 Comparing Means

Now that we have our data all together, in both wide and long format, we are ready to start comparing means. First things first: we need to calculate the means!

4.5.1 Calculating the means

Using the mean function, we can simply calculate the mean/s of our data. Just know that mean and average mean (pun intended) the same thing and are interchangeable words.

# Calculating the individual means of the teams scores, we can use our wide format
yankees_average_score <- mean(baseball_data$Yankees_Score) 

# Note, we are saving this as a variable so it can be called anytime, but you could just run mean(baseball_data$Yankees_Score)
yankees_average_score
#> [1] 3.85

mets_average_score <- mean(baseball_data$Mets_Score)
# If we want to round our numbers, we can use the code below
# mets_average_score<- round(mean(baseball_data$Mets_Score),0)
mets_average_score
#> [1] 6

# Calculating the means of all the scores, we can use our long format
overall_average <- mean(baseball_data_long$Score)

overall_average
#> [1] 4.925

Here is a breakdown of the means:

  • Yankees Average Scores: 3.85
  • Mets Average Scores: 6
  • Overall Average Scores: 4.925

It isn’t necessary, but let us take this time to round our means

Right now, we can see that the Mets average runs scored is higher than both the Yankees and overall average scores. While we see that there is a difference, we do not know if this difference is significant or not. This is where a t.test comes into play.

Before running a t-test, it’s often helpful to visualize the data to see how much the two groups overlap. An excellent way to do this is to create a boxplot, first introduced in Section 3.4.5.

library(ggplot2)
ggplot(baseball_data_long, aes(Team, Score)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(width = 0.15, alpha = 0.4, size = 1.5) +
  coord_flip()+
  labs(
  title = "Runs Scored by Team Across the First 20 Games",
  x = "Team",
  y = "Runs Scored"
  ) +
  theme_classic()
Distribution of runs scored by the New York Yankees and New York Mets across the first 20 games of the 2025 season. Boxplots summarize the central tendency and spread of scores for each team, while individual points represent runs scored in each game. This visualization highlights the overlap between groups, motivating the use of an independent samples *t*-test to evaluate whether observed differences in mean runs scored are statistically significant.

Figure 4.1: Distribution of runs scored by the New York Yankees and New York Mets across the first 20 games of the 2025 season. Boxplots summarize the central tendency and spread of scores for each team, while individual points represent runs scored in each game. This visualization highlights the overlap between groups, motivating the use of an independent samples t-test to evaluate whether observed differences in mean runs scored are statistically significant.

There has already been a lot of work done with our data, and through both visualizations and summary statistics, we have an idea of where our groups differ. The next step of comparing two means is to dive even deeper: statistics.

4.5.2 t.test

A t-test compares the difference between two group means relative to the variability in the data. Depending on your data, you will either be running a Paired or Unpaired t.test:

  • Paired t-test: use when measurements come in matched pairs (same participant twice, same game for both teams).
    • Example: Ten people first run a race just drinking water. Then, those same ten people run the same race but drinking coffee.
  • Unpaired t-test: use when the two groups are independent (different participants, no pairing).
    • Example: Ten people drink water, then ten different people drink coffee, and all of them run a race at the same time.
  • Pairing is about the design of the data, not how the table looks.
    • Rule of thumb: If you can draw a line connecting observations across groups (same person, matched pair, same unit measured twice), it’s paired.

For our data, we need to conduct an unpaired t.test.

# Unpaired t.test
baseball_t_test <- t.test(baseball_data$Yankees_Score, baseball_data$Mets_Score, paired = F)
# From a structural standpoint, if we needed to run a paired t.test, all we would need to do is change "paired = F" to "paired = T"

baseball_t_test
#> 
#>  Welch Two Sample t-test
#> 
#> data:  baseball_data$Yankees_Score and baseball_data$Mets_Score
#> t = -1.823, df = 27.274, p-value = 0.07928
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  -4.568735  0.268735
#> sample estimates:
#> mean of x mean of y 
#>      3.85      6.00

The output when we call baseball_t_test tells us something extremely important: p-value. When looking at a p-value, we are looking to see if it is above or below the 0.05 mark.

  • If p < 0.05: reject the null hypothesis (evidence of a difference)
  • If p ≥ 0.05: fail to reject the null (not enough evidence of a difference)

So, while the two averages are not the same, our p-value is 0.07928, which is above the 0.05 threshold. This means that there is not a statistical difference between the average runs scored in the first 20 games between the Yankees and Mets. Even though the Mets scored more runs on average, this difference was not statistically significant, reminding us that magnitude and statistical significance are not the same thing. In real research, this distinction matters. Decisions should not be made on differences in averages alone.

4.6 Key Takeaways

  • To create data frames in R, use the data.frame command.
  • To merge data in R, you can:
    • Bind rows (rbind())
    • Join (merge or any of the SQL commands, depending on the desired outcome)
  • Use pivot_longer() to turn wide formatted data into long format
  • Use pivot_wider() to turn long formatted data into wide format
  • Just because the means/averages of two variables are different does not mean that they are statistically different.
  • To check to see if the difference between two means is statistically significant, perform a t.test using the t.test() command
    • Depending on how the data is structured, this will either be a paired or unpaired t.test.

4.7 Checklist

4.7.1 Data Creation & Import

4.7.2 Comparing Two Means

4.8 Key Functions & Commands

The following functions and commands are introduced or reinforced in this chapter to support data restructuring, joining, and basic statistical comparison.

  • data.frame() (base R)
    • Creates a data frame object for storing tabular data.
  • rbind() (base R)
    • Combines multiple data frames by binding rows together.
  • merge() (base R)
    • Joins two data frames together based on a shared key variable.
  • inner_join() (dplyr)
    • Performs a SQL-style inner join, keeping only rows that match in both datasets.
  • pivot_longer() (tidyr)
    • Converts data from wide format to long format.
  • pivot_wider() (tidyr)
    • Converts data from long format back to wide format.
  • mean() (base R)
    • Calculates the average value of a numeric variable.
  • t.test() (stats)
    • Performs a hypothesis test to evaluate whether the means of two groups differ significantly.

4.9 Example APA-style Write-up

The following example demonstrates one acceptable way to report a comparison of two means in APA style.

Independent Samples t-Test

An independent samples t-test was conducted to examine whether the average number of runs scored differed between the New York Yankees and the New York Mets across the first 20 games of the season. The Mets scored more runs on average (M = 6.00, SD = X.XX) than the Yankees (M = 3.85, SD = X.XX); however, this difference was not statistically significant, t(38) = X.XX, p = .079. Although the Mets had a higher mean score, this result indicates insufficient evidence that the teams differed in average runs scored.

Note: Placeholder values (e.g., X.XX) are used here to emphasize APA formatting rather than specific results.

4.10 💡 Reproducibility Tip:

Merging data is a crucial step in many analysis projects. Whether you are using SQL-style joins or base R functions, it is essential to check the structure of your data before and after every merge.

A simple but powerful habit is to verify the number of rows using functions like nrow(). Joins that are not specified correctly can lead to what are often called “data explosions,” where a dataset unexpectedly grows. A incorrectly specified merge can easily take a dataset from 1,000 rows to 3,000,000, when only 1,000 were expected.

Checking row counts before and after a merge helps ensure that the join behaved as intended and can save your analysis (and your computer) from serious errors.

Your analysis cannot be reproducible if its underlying data are incorrect. Developing the habit of validating merges is especially important here, as this is the first chapter where data merging is introduced.

Annotate

Next Chapter
5 Comparing Multiple Means
PreviousNext
Textbook
Powered by Manifold Scholarship. Learn more at
Opens in new tab or windowmanifoldapp.org