Skip to main content

Reproducible Research Using R: 8 Linear Regression

Reproducible Research Using R
8 Linear Regression
  • Show the following:

    Annotations
    Resources
  • Adjust appearance:

    Font
    Font style
    Color Scheme
    Light
    Dark
    Annotation contrast
    Low
    High
    Margins
  • Search within:
    • Notifications
    • Privacy
  • Project HomeBrooklyn Civic Data Lab
  • Projects
  • Learn more about Manifold

Notes

table of contents
  1. About
    1. 0.1 What You’ll Learn
    2. 0.2 What You Should Know First
    3. 0.3 What This Book Does Not Cover
  2. How to Use This Book
    1. 0.4 Chapter Anatomy
    2. 0.5 Code, Data, and Reproducibility
    3. 0.6 Acknowledgments
  3. 1 Getting Started with R
    1. 1.1 Learning Objectives
    2. 1.2 RStudio
    3. 1.3 R as a Calculator
      1. 1.3.1 Basic Math
      2. 1.3.2 Built-in mathematical Functions
    4. 1.4 Creating Variables and Assigning Objects
    5. 1.5 Vectors
      1. 1.5.1 Numeric Vectors
      2. 1.5.2 Character Vectors
      3. 1.5.3 Logical Vectors
      4. 1.5.4 Factors (categorical)
      5. 1.5.5 Indexing (1-based in R!)
      6. 1.5.6 Type Coercion
    6. 1.6 Data Frames
      1. 1.6.1 Creating Your Own Data Frame
      2. 1.6.2 Functions to Explore Datasets
      3. 1.6.3 Working With Columns Within Data Frames
    7. 1.7 Reading & Writing data
    8. 1.8 Packages
    9. 1.9 Getting Comfortable Making Mistakes - Help and Advice
    10. 1.10 Key Takeaways
    11. 1.11 Checklist: Before Moving On
    12. 1.12 Key Functions & Commands
    13. 1.13 💡 Reproducibility Tip:
  4. 2 Introduction to tidyverse
    1. 2.1 Learning Objectives {tidyverse-objectives}
    2. 2.2 Using Packages
      1. 2.2.1 Installing Packages
      2. 2.2.2 Loading Packages
    3. 2.3 Meet the tidyverse
      1. 2.3.1 The Pipe
    4. 2.4 Manipulating Data in tidyverse
      1. 2.4.1 Distinct
      2. 2.4.2 Select
      3. 2.4.3 Filter
      4. 2.4.4 Arrange
      5. 2.4.5 Mutate
      6. 2.4.6 If Else
      7. 2.4.7 Renaming Columns
      8. 2.4.8 Putting them all together
    5. 2.5 Insights Into Our Data
      1. 2.5.1 Count
      2. 2.5.2 Summarizing and Grouping
    6. 2.6 Common Gotchas & Quick Fixes
      1. 2.6.1 = vs ==
      2. 2.6.2 NA-aware math
      3. 2.6.3 Pipe position
      4. 2.6.4 Conflicting function names
    7. 2.7 Key Takeaways
    8. 2.8 Checklist
    9. 2.9 Key Functions & Commands
    10. 2.10 💡 Reproducibility Tip:
  5. 3 Visualizations
    1. 3.1 Introduction
    2. 3.2 Learning Objectives
    3. 3.3 Base R
    4. 3.4 ggplot2
      1. 3.4.1 Basics
      2. 3.4.2 Scatterplot - geom_point()
      3. 3.4.3 Bar Chart (counts) and Column Chart (values)
      4. 3.4.4 Histograms and Density Plots (distribution)
      5. 3.4.5 Boxplot - geom_boxplot()
      6. 3.4.6 Lines (time series) - geom_line()
      7. 3.4.7 Put text on the plot - geom_text()
      8. 3.4.8 Error bars (requires summary stats) - geom_errorbar()
      9. 3.4.9 Reference lines
    5. 3.5 Key Takeaways
    6. 3.6 Checklist
    7. 3.7 ggplot2 Visualization Reference
      1. 3.7.1 Summary of ggplot Geometries
      2. 3.7.2 Summary of other ggplot commands
    8. 3.8 💡 Reproducibility Tip:
  6. 4 Comparing Two Groups: Data Wrangling, Visualization, and t-Tests
    1. 4.1 Introduction
    2. 4.2 Learning Objectives {means-objectives}
    3. 4.3 Creating a Sample Dataset
    4. 4.4 Merging Data
      1. 4.4.1 Binding our data
      2. 4.4.2 Joining Data
      3. 4.4.3 Wide Format
      4. 4.4.4 Long Format (Reverse Demo)
    5. 4.5 Comparing Means
      1. 4.5.1 Calculating the means
      2. 4.5.2 t.test
    6. 4.6 Key Takeaways
    7. 4.7 Checklist
      1. 4.7.1 Data Creation & Import
      2. 4.7.2 Comparing Two Means
    8. 4.8 Key Functions & Commands
    9. 4.9 Example APA-style Write-up
    10. 4.10 💡 Reproducibility Tip:
  7. 5 Comparing Multiple Means
    1. 5.1 Introduction
    2. 5.2 Learning Objectives {anova-objectives}
    3. 5.3 Creating Our Data
    4. 5.4 Descriptive Statistics
    5. 5.5 Visualizing Relationships
    6. 5.6 Running a T.Test
    7. 5.7 One-Way ANOVA
    8. 5.8 Post-hoc Tests
    9. 5.9 Adding a Second Factor
    10. 5.10 Model Comparison With AIC
    11. 5.11 Key Takeaways
    12. 5.12 Checklist
    13. 5.13 Key Functions & Commands
    14. 5.14 Example APA-style Write-up
    15. 5.15 💡 Reproducibility Tip:
  8. 6 Analyzing Categorical Data
    1. 6.1 Introduction
    2. 6.2 Learning Objectives {cat-objectives}
    3. 6.3 Loading Our Data
    4. 6.4 Contingency Tables
    5. 6.5 Visualizations
    6. 6.6 Chi-Square Test
    7. 6.7 Cross Tables
    8. 6.8 Contribution
    9. 6.9 CramerV
    10. 6.10 Interpretation
    11. 6.11 Key Takeaways
    12. 6.12 Checklist
    13. 6.13 Key Functions & Commands
    14. 6.14 Example APA-style Write-up
    15. 6.15 💡 Reproducibility Tip:
  9. 7 Correlation
    1. 7.1 Introduction
      1. 7.1.1 Learning Objectives
    2. 7.2 Loading Our Data
    3. 7.3 Cleaning our data
    4. 7.4 Visualizing Relationships
    5. 7.5 Running Correlations (r)
    6. 7.6 Correlation Matrix
    7. 7.7 Coefficient of Determination (R^2)
    8. 7.8 Partial Correlations
    9. 7.9 Biserial and Point-Biserial Correlations
    10. 7.10 Grouped Correlations
    11. 7.11 Conclusion
    12. 7.12 Key Takeaways
    13. 7.13 Checklist
    14. 7.14 Key Functions & Commands
    15. 7.15 Example APA-style Write-up
      1. 7.15.1 Bivariate Correlation
      2. 7.15.2 Positive Correlation
      3. 7.15.3 Partial Correlation
    16. 7.16 💡 Reproducibility Tip:
  10. 8 Linear Regression
    1. 8.1 Introduction
    2. 8.2 Learning Objectives {lin-reg-objectives}
    3. 8.3 Loading Our Data
    4. 8.4 Cleaning Our Data
    5. 8.5 Visualizing Relationships
    6. 8.6 Understanding Correlation
    7. 8.7 Linear Regression Model
    8. 8.8 Checking the residuals
    9. 8.9 Adding more variables
      1. 8.9.1 Bonus code
    10. 8.10 Conclusion
    11. 8.11 Key Takeaways
    12. 8.12 Checklist
    13. 8.13 Key Functions & Commands
    14. 8.14 Example APA-style Write-up
    15. 8.15 💡 Reproducibility Tip:
  11. 9 Logistic Regression
    1. 9.1 Introduction
    2. 9.2 Learning Objectives {log-reg-objectives}
    3. 9.3 Load and Preview Data
    4. 9.4 Exploratory Data Analysis
    5. 9.5 Visualize Relationships
    6. 9.6 Train and Test Split
    7. 9.7 Build Logistic Regression Model
      1. 9.7.1 McFadden’s Pseudo-R²
      2. 9.7.2 Variable Importance
      3. 9.7.3 Multicollinearity check
    8. 9.8 Make Predictions
    9. 9.9 Evaluate Model
    10. 9.10 ROC Curve + AUC
    11. 9.11 Interpretation
    12. 9.12 Key Takeaways
    13. 9.13 Checklist
    14. 9.14 Key Functions & Commands
    15. 9.15 Example APA-style Write-up
    16. 9.16 💡 Reproducibility Tip:
  12. 10 Reproducible Reporting
    1. 10.1 Introduction
    2. 10.2 Learning Objectives {r-markdown-objectives}
    3. 10.3 Creating an R Markdown File
    4. 10.4 Parts of an R Markdown File
      1. 10.4.1 The YAML
      2. 10.4.2 Text
      3. 10.4.3 R Chunks
      4. 10.4.4 Sections
    5. 10.5 Knitting an R Markdown File
    6. 10.6 Publishing an R Markdown file
    7. 10.7 Extras
      1. 10.7.1 Links
      2. 10.7.2 Pictures
      3. 10.7.3 Checklists
      4. 10.7.4 Standout sections
      5. 10.7.5 Changing Setting of Specific R Chunks
    8. 10.8 Key Takeaways
    9. 10.9 Checklist
    10. 10.10 Key Functions & Commands
    11. 10.11 Summary of Common R Markdown Syntax
    12. 10.12 💡 Reproducibility Tip:
  13. Appendix: Reproducibility Checklist for Data Analysis in R
    1. 10.13 Project & Environment
    2. 10.14 Data Integrity & Structure
    3. 10.15 Data Transformation & Workflow
    4. 10.16 Merging & Reshaping Data
    5. 10.17 Visualization & Communication
    6. 10.18 Statistical Reasoning
    7. 10.19 Modeling & Inference
    8. 10.20 Randomness & Evaluation
    9. 10.21 Reporting & Execution
    10. 10.22 Final Check
  14. Packages & Functions Reference

8 Linear Regression

8.1 Introduction

In this lesson, we will explore how to use linear regression to not only understand, but to also predict relationships between variables.

Everyone loves pizza (cowabunga!) and like many pizza establishments, we will be trying to answer the question:

Can we predict the amount of pizza sold based on its price?

We’ll use a dataset called Pizza_Prices.xlsx, which contains weekly pizza sales and pricing data.

8.2 Learning Objectives {lin-reg-objectives}

By the end of this chapter, you will be able to:

  • Explain the purpose of linear regression as a tool for modeling and prediction
  • Visualize and assess linear relationships between numeric variables using scatterplots
  • Use correlation to evaluate whether variables are appropriate for regression modeling
  • Fit a simple linear regression model using lm() and interpret the slope and intercept
  • Interpret key model outputs including coefficients, p-values, R², adjusted R², and the F-statistic
  • Generate predicted values and calculate residuals from a fitted regression model
  • Diagnose model assumptions by visually and statistically evaluating residuals
  • Test for heteroscedasticity using the Breusch–Pagan test
  • Extend simple regression to multiple regression by adding additional predictors
  • Compare competing regression models using adjusted R² and AIC
  • Select a parsimonious model that balances explanatory power and complexity

8.3 Loading Our Data

We start by importing the necessary packages, reading in our Excel file, and quickly doing an overview of the data.

library(tidyverse)
library(readxl)

pizza_sales<- read_xlsx("Pizza_Prices.xlsx")

library(skimr)

skim(pizza_sales)
Table 7.1: Data summary
Namepizza_sales
Number of rows156
Number of columns3
_______________________
Column type frequency:
numeric3
________________________
Group variablesNone

Variable type: numeric

skim_variablen_missingcomplete_ratemeansdp0p25p50p75p100hist
Week0178.5045.181.0039.7578.50117.25156.00▇▇▇▇▇
Total Volume0188023.5022863.9455031.0074347.0083889.0095113.00227177.00▇▃▁▁▁
Total Price012.660.132.312.582.672.752.96▁▃▇▆▂

8.4 Cleaning Our Data

In Section 7.3 we had to clean our data because of the spaces in the column headers. Our current data has the same issue, as “Total Volume” and “Total Price” both have a space in them. Looks like the clean_names() command will come in handy again. Let’s also do some other cleaning while we are at it.

library(janitor)

# Let's get rid of any spaces in the column names
pizza_sales<- clean_names(pizza_sales)

# We just want to see the relationship between volume and price, so week does not matter
pizza_sales<- pizza_sales %>% select(-1)

# Let's rename our column names to make things easier
pizza_sales <- pizza_sales %>%
  rename(price = total_price, volume = total_volume)

We now have a cleaner dataset we can use!

8.5 Visualizing Relationships

As emphasized in every chapter, it is imperative that we first graph our data to see what it looks like before we run any statistical analyses. Visuals really help us understand our data. Is our data linear? Can we see any patterns? Let’s find out!

For regression models, we use scatterplots. For a scatterplot, all we need are two numerical variables.

# Now, let us graph our data using ggplot
# As always with ggplot, we have our data, our aesthetics, and our geometry.

pizza_plot <- ggplot(pizza_sales, aes(x = price, y = volume)) +
  geom_point(size = 3)  +
  geom_smooth(method = "lm", se = FALSE) + # this adds a "line of best fit"
  theme_minimal() +
  labs(
    title = "Analysis of Pizza Sales",
    subtitle = "You know the rule, one bite...",
    x = "Price ($)", y = "Volume"
  )

pizza_plot
Scatterplot showing the relationship between pizza price and weekly sales volume, with a fitted linear trend line. The downward slope indicates a negative linear association, suggesting that higher prices are associated with lower sales volume. This visualization is used to assess linearity and motivate the use of correlation and linear regression.

Figure 8.1: Scatterplot showing the relationship between pizza price and weekly sales volume, with a fitted linear trend line. The downward slope indicates a negative linear association, suggesting that higher prices are associated with lower sales volume. This visualization is used to assess linearity and motivate the use of correlation and linear regression.

If we wanted to use another package instead of ggplot, we could also use the library ggformula.

library(ggformula)
pizza_plot_ggformula <- gf_point(volume ~ price, data = pizza_sales) %>% gf_lm(color ="purple")

pizza_plot_ggformula
Scatterplot of pizza price and weekly sales volume created using the ggformula package, with a fitted linear regression line. This plot conveys the same information as the ggplot version, demonstrating that different visualization frameworks can be used to explore linear relationships.

Figure 8.2: Scatterplot of pizza price and weekly sales volume created using the ggformula package, with a fitted linear regression line. This plot conveys the same information as the ggplot version, demonstrating that different visualization frameworks can be used to explore linear relationships.

Here is a side by side comparison of what they look like. Very similar indeed.

library(patchwork)

pizza_plot + pizza_plot_ggformula
#> `geom_smooth()` using formula = 'y ~ x'
Side-by-side comparison of scatterplots generated using ggplot2 and ggformula. Both visualizations display the same negative linear relationship between pizza price and sales volume, illustrating that different plotting systems can yield equivalent analytical insights.

Figure 8.3: Side-by-side comparison of scatterplots generated using ggplot2 and ggformula. Both visualizations display the same negative linear relationship between pizza price and sales volume, illustrating that different plotting systems can yield equivalent analytical insights.

Our data is already telling us a lot. From both of our graphics, we can see that as price is going up, volume is going down, indicating a negative correlation. Our line also seems pretty straight, indicating what seems to be a moderately strong correlation.

8.6 Understanding Correlation

Before we get to a linear regression, it is best to first understand how (if at all) our variables are correlated with each other. For a review on correlation, go back to chapter 7.

We are first trying to figure out what variables, if any, should be included in our linear regression model, and this is exactly where correlation comes into play.

cor.test(pizza_sales$volume,pizza_sales$price)
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  pizza_sales$volume and pizza_sales$price
#> t = -10.8, df = 154, p-value < 2.2e-16
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  -0.7375315 -0.5567684
#> sample estimates:
#>        cor 
#> -0.6564736

When we run a correlation, we are looking for three things:

  1. The direction: is a positive or negative interaction?
  2. The strength: how close is it to -1, 0, 1?
  3. The p-value: Is it statistically significant?

The code above answers all three questions:

  1. The direction: is negative.
  2. The strength: about −0.66, indicating a moderately strong negative correlation.
  3. The p-value: < 0.05 indicates that the relationship between price and volume is unlikely due to chance.

With a correlation coefficient of -0.66, we have a moderately strong negative correlation between price and volume. As price goes up, volume goes down - just what we saw in our graphs!

8.7 Linear Regression Model

We have identified that there is a significant negative correlation between the two variables. Since we want to be able to predict the amount (volume) of pizza sold using price, our next step is to run a linear regression model.

With a linear regression model, we are trying to build the line of best fit and figure out the equation for the line. The formula for a line is y = mx + b, where:

  • y is the predicted value (in this case, volume)
  • m is the slope
  • x is the given value (in this case, price)
  • b is the y intercept (the y value when x is 0)

When we are able to get the slope and the intercept, we can then begin to predict values.

To run a linear regression model, we can use the lm command. It follows the structure:

  • lm(y ~ x, data = dataset)
  • lm(what we want to predict ~ the predictor, data = our dataset)
pizza_lm <- lm(volume ~ price, data=pizza_sales)
# The code above is saving our linear regression model as pizza_lm

summary(pizza_lm)
#> 
#> Call:
#> lm(formula = volume ~ price, data = pizza_sales)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -28602 -11685  -1824   8379 110857 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   385442      27575   13.98   <2e-16 ***
#> price        -111669      10340  -10.80   <2e-16 ***
#> ---
#> Signif. codes:  
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 17300 on 154 degrees of freedom
#> Multiple R-squared:  0.431,  Adjusted R-squared:  0.4273 
#> F-statistic: 116.6 on 1 and 154 DF,  p-value: < 2.2e-16
# When we call summary on our model, we can find out our key insights

Congratulations! We have just run our first linear regression model! We have some key insights, including:

  • Intercept: 385,442. This is our b value on our line equation. This means that when the price of pizza is 0 (free pizza would be nice, but really just a dream), the y value would be 385442.
  • price: -111,669. This is our slope, or our m value on our line equation. This tells us that for every one-unit increase in price, sales volume decreased by approximately 111,669 units. This makes sense, since we have a negative correlation.
  • Multiple R-squared: As we have seen before, this tells us how much of the variability our model explains. For our example, about 43% of the variance in pizza sales volume is explained by price alone. Warning: Multiple R-squared in linear regression models always increases with each new predictor added to a model, regardless of how potent it is.
  • Adjusted R-squared: This is almost identical to the Multiple R-squared value, except that it takes into account each new predictor added, and does not go up just because a new predictor was added to the model.
  • p-value: There are three p-values here. The first two, on the intercept and price lines, indicate if the relationships are statistically significant, which they are. The bottom one is if the model itself is significant, which it is.
  • F-statistic: This tells us how strong our model is. 116.6 is a high value.

All of these insights together mean that this is a very good model to predict volume of pizza sold.

In R, we always want to explore different packages. The broom package in R is very helpful when making statistical objects into nicer tibbles.

# If we want to get the same information from our lm model using a different package
library(broom)

tidy(pizza_lm)
#> # A tibble: 2 × 5
#>   term        estimate std.error statistic  p.value
#>   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
#> 1 (Intercept)  385442.    27575.      14.0 3.44e-29
#> 2 price       -111669.    10340.     -10.8 1.36e-20

glance(pizza_lm)
#> # A tibble: 1 × 12
#>   r.squared adj.r.squared  sigma statistic  p.value    df
#>       <dbl>         <dbl>  <dbl>     <dbl>    <dbl> <dbl>
#> 1     0.431         0.427 17303.      117. 1.36e-20     1
#> # ℹ 6 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>,
#> #   deviance <dbl>, df.residual <int>, nobs <int>

Our finalized equation is y (-111,669 * price) + 385,442. All we need to do is plug in any price value, and we can predict the volume of pizza sold.

8.8 Checking the residuals

Now, we need to make sure that our data for the line looks good, answering the question:

How close are our predicted values to our actual values?

To answer this question, we need to figure out the predicted values and then their residuals (the difference between the actual vs the predicted value).

We can do this in R with the two commands below.

pizza_sales <- pizza_sales %>%
  mutate(predicted = predict(pizza_lm),
         residuals = residuals(pizza_lm))

To visually see the difference between the actual values and the predicted values, we can create a similar scatterplot as before, but add lines connecting our line of best fit and our points.

# What if we want to see actual vs predicted on our plot?
pizza_plot_residuals <- ggplot(pizza_sales, aes(x = price, y = volume)) +
  geom_point(size = 3)  +
  geom_smooth(method = "lm", se = FALSE) +
  geom_segment(aes(xend = price, yend = predicted), color = "gray", linewidth = 0.7) +
  theme_minimal() +
  labs(
    title = "Analysis of Pizza Sales",
    subtitle = "Gray lines show residuals (differences between actual and model predictions)",
    x = "Price ($)", y = "Volume"
  )

pizza_plot_residuals
#> `geom_smooth()` using formula = 'y ~ x'
Scatterplot of pizza price and sales volume with residual lines connecting observed values to model predictions. Each gray line represents a residual, illustrating the difference between actual sales and values predicted by the linear regression model.

(#fig:scatterplot with residual lines)Scatterplot of pizza price and sales volume with residual lines connecting observed values to model predictions. Each gray line represents a residual, illustrating the difference between actual sales and values predicted by the linear regression model.

With our residuals, we want them to be random. Visually, that means that the residual values should be scattered on an x-y plane without a pattern. To see this, we can graph the residuals themselves.

# Let us graph our residuals to make sure they are random
ggplot(pizza_sales, aes(x = predicted, y = residuals)) +
  geom_point(color = "blue") +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_smooth(se = FALSE, color = "red", linetype = "dotted")+
  labs(
    title = "Residuals Plot",
    x = "Predicted Volume",
    y = "Residuals"
  )
#> `geom_smooth()` using method = 'loess' and formula = 'y ~
#> x'
Residuals plotted against predicted sales volume from the linear regression model. Random scatter around the horizontal zero line indicates homoscedasticity, while systematic patterns would suggest violations of regression assumptions.

Figure 8.4: Residuals plotted against predicted sales volume from the linear regression model. Random scatter around the horizontal zero line indicates homoscedasticity, while systematic patterns would suggest violations of regression assumptions.

The red line in the graph above has a slight curve, showing that our model is less accurate at the extremes (low and high prices). Thankfully, this pattern isn’t very pronounced.

To accompany the visual, we want to statistically test for heteroscedasticity, meaning:

Are the residuals roughly equal in variance across all predicted values?

We will either see:

  • Homoscedasticity: residuals are equally spread (good)
  • Heteroscedasticity: residuals get wider or narrower as predictions change (bad)

The Breusch-Pagan test provides insights for this.

library(lmtest)

bptest(pizza_lm)
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  pizza_lm
#> BP = 7.7392, df = 1, p-value = 0.005403

The main focus point here is the p-value:

  • 0.05 Fail to reject H₀ → residuals are homoscedastic (good)

  • < 0.05 Reject H₀ → residuals are heteroscedastic (not ideal)

Since our p-value is less than .05, our residuals are heteroscedastic.

This does not invalidate our model, and for right now, we do not need to change it. It is just showing us that it is not perfect, and not all prices predict equally well. In reality, this might happen if higher-priced pizzas are sold less frequently, giving us fewer data points and more variability.

Congratulations, we have just completed our first linear regression model!

8.9 Adding more variables

Throughout this chapter, we have only looked at 2 variables: price and volume to see if the former can predict the latter. But, what if we have more variables? Can we also utilize them in our regression model?

Let’s find out!

Let us create two new variables to add to our data: one where David Portnoy visited the pizza shop and another on how much was spent on advertising.

# Now, let's add some other variables to our data
set.seed(42)

# 1) Dave Portnoy visit.
pizza_sales <- pizza_sales %>%
  mutate(portnoy_visit = rbinom(n(), size = 1, prob = 0.06))  # about 3–6% of weeks

# 2) Ad spend
pizza_sales <- pizza_sales %>%
  mutate(ad_spend = round(runif(n(), 800, 6000), 0))

Now, since we have multiple variables, we can run what’s called multivariate regression model. To add more predictors, we simply use the + sign in our lm command after our first predictor.

m1 <- lm(volume ~ price, data = pizza_sales)                      # baseline

m2 <- lm(volume ~ price + portnoy_visit, data = pizza_sales)      # add portnoy visit

m3 <- lm(volume ~ price + ad_spend, data = pizza_sales)           # add ad spending

m4 <- lm(volume ~ price + ad_spend + portnoy_visit, data = pizza_sales)  # both

Boom! In the code above, we have now run 4 regression models, with every combination included.

Now, we can run the summary or glance command on each model individually, compare the models manually, but that would require a lot of working memory power on our end. Or we could utilize some commands in R to make our lives much easier.

# How do we know which model is the best?
library(AICcmodavg)

AIC(m1, m2, m3, m4) %>% arrange(AIC)
#>    df      AIC
#> m2  4 3486.244
#> m4  5 3488.239
#> m1  3 3491.395
#> m3  4 3493.327
# This code provides us with an AIC (Akaike Information Criterion) value.
# The smaller the AIC, the better the model’s trade-off between complexity and fit.

# Let's say we want to use the broom function again to do a model comparison
models <- list(m1, m2, m3, m4)
# In the code above, we are taking all of our models and putting them into a "list" of models

names(models) <- c("m1","m2","m3","m4")
# In the code above, we are just giving each item in the list a name

# In the code below, we are saying "Hey, run glance on every single item on the models list and create a data frame

map_df(models, ~glance(.x), .id = "model") %>% select(model, r.squared,adj.r.squared,p.value,AIC)
#> # A tibble: 4 × 5
#>   model r.squared adj.r.squared  p.value   AIC
#>   <chr>     <dbl>         <dbl>    <dbl> <dbl>
#> 1 m1        0.431         0.427 1.36e-20 3491.
#> 2 m2        0.456         0.449 5.57e-21 3486.
#> 3 m3        0.431         0.424 1.79e-19 3493.
#> 4 m4        0.456         0.446 5.07e-20 3488.

With both of these, we now have:

  1. R²: A higher R squared means the model is a better predictor. Remember, R^2 always goes up with each added predictor, no matter how salient they actually are.
  2. Adjusted R²: A higher adjusted R squared means the model is a better predictor. Adjusted R² always takes into account the number of predictors and does not inherently increase with more predictors.
  3. p-value: If these models are statistically significant.
  4. AIC: tells us how good our models are. A lower number means a better model.

Overall, the goal is to create the most parsimonious regression model as possible. That is, the model with the fewest necessary predictors. As such, the model that is most parsimonious is m1, the model with just price. Yes, other models have higher adjusted R^2 values and lower AIC numbers, but only slightly. In this case, m1 is the model to go with.

8.9.1 Bonus code

If you are looking to compare multiple regression models in a faster way, you could also utilize the code below.

full_model <- lm(volume ~ price + ad_spend + portnoy_visit, data = pizza_sales)

step_model <- step(full_model, direction = "both")
#> Start:  AIC=3043.53
#> volume ~ price + ad_spend + portnoy_visit
#> 
#>                 Df  Sum of Sq        RSS    AIC
#> - ad_spend       1 1.4575e+06 4.4042e+10 3041.5
#> <none>                        4.4041e+10 3043.5
#> - portnoy_visit  1 2.0473e+09 4.6088e+10 3048.6
#> - price          1 3.1199e+10 7.5240e+10 3125.1
#> 
#> Step:  AIC=3041.54
#> volume ~ price + portnoy_visit
#> 
#>                 Df  Sum of Sq        RSS    AIC
#> <none>                        4.4042e+10 3041.5
#> + ad_spend       1 1.4575e+06 4.4041e+10 3043.5
#> - portnoy_visit  1 2.0659e+09 4.6108e+10 3046.7
#> - price          1 3.1252e+10 7.5294e+10 3123.2

8.10 Conclusion

Congratulations! You have successfully run correlations (to help see what variables we want to build our regression model with), regression models, multiple regression models, compared models, and even analyzed the differences between our actual values and our predicted values.

Don’t underestimate how powerful regression models can be. With a line equation, you can help predict any variable!

8.11 Key Takeaways

  • Linear regression helps us predict one variable (Y) from another (X) using a line of best fit.
  • The equation of the line is y = mx + b, where:
    • m = slope (how much Y changes for every 1-unit change in X)
    • b = intercept (the value of Y when X = 0)
  • The slope tells us both the direction and strength of the relationship.
  • Residuals = actual - predicted values; smaller residuals = better model fit.
  • A good model has random, evenly scattered residuals (homoscedasticity).
  • R² tells us how much variance in Y is explained by X.
  • Adjusted R² penalizes unnecessary predictors in multiple regression models.
  • AIC helps compare models: lower AIC = better balance between fit and simplicity.
  • Parsimonious models (simpler ones that still explain the data well) are preferred.
  • Stepwise regression automatically selects the most parsimonious model using AIC.
  • Correlation shows association; regression goes further by predicting and quantifying impact.
  • Always visualize both your model fit and your residuals before interpreting results!

8.12 Checklist

When running linear regressions, have you:

8.13 Key Functions & Commands

The following functions and commands are introduced or reinforced in this chapter to support linear regression modeling, diagnostics, and model selection.

  • lm() (stats)
    • Fits linear and multiple linear regression models using a formula interface.
  • summary() (base R)
    • Summarizes regression model results, including coefficients, R², F-statistic, and p-values.
  • cor.test() (stats)
    • Computes and tests correlations to assess whether predictors are suitable for regression.
  • predict() (stats)
    • Generates predicted values from a fitted regression model.
  • residuals() (stats)
    • Extracts residuals (actual − predicted values) from a regression model.
  • tidy() (broom)
    • Converts regression coefficients into a clean, tidy data frame.
  • glance() (broom)
    • Extracts model-level statistics (e.g., R², adjusted R², AIC) into a single-row summary.
  • bptest() (lmtest)
    • Performs the Breusch–Pagan test to assess heteroscedasticity of residuals.
  • AIC() (stats)
    • Compares regression models using Akaike Information Criterion to balance fit and complexity.
  • step() (stats)
    • Performs stepwise model selection to identify a parsimonious regression model.

8.14 Example APA-style Write-up

The following example demonstrates one acceptable way to report the results of a simple linear regression in APA style.

Simple Linear Regression

A simple linear regression was conducted to examine whether pizza price predicted weekly pizza sales volume. The overall regression model was statistically significant, F(1, 148) = 116.60, p < .001, and explained approximately 43% of the variance in pizza sales volume (R² = .43).

Pizza price was a significant negative predictor of sales volume, b = −111,669, SE = 10,347, t = −10.80, p < .001, indicating that higher pizza prices were associated with lower weekly sales volume. Specifically, for each one-unit increase in pizza price, weekly pizza sales decreased by approximately 111,669 units.

8.15 💡 Reproducibility Tip:

Linear regression models involve many analytic choices, including which variables to include, how to assess relationships, and how to evaluate model fit. To support reproducibility, make each of these decisions explicit and evidence-based.

Before fitting a model, visualize relationships and examine correlations to justify why predictors are included. After fitting the model, report diagnostics—such as residual plots and measures of fit—to demonstrate that assumptions were checked.

Remember: Adding more variables does not always lead to a better model.

Regression results are only reproducible when both the code and the reasoning behind the model are transparent.

Annotate

Next Chapter
9 Logistic Regression
PreviousNext
Textbook
Powered by Manifold Scholarship. Learn more at
Opens in new tab or windowmanifoldapp.org