Skip to main content

The 2025 Brooklyn Open Data Collection: Analyst Portfolios: 7 Social Infrastructure & Well-Being

The 2025 Brooklyn Open Data Collection: Analyst Portfolios
7 Social Infrastructure & Well-Being
  • Show the following:

    Annotations
    Resources
  • Adjust appearance:

    Font
    Font style
    Color Scheme
    Light
    Dark
    Annotation contrast
    Low
    High
    Margins
  • Search within:
    • Notifications
    • Privacy
  • Project HomeBrooklyn Civic Data Lab
  • Projects
  • Learn more about Manifold

Notes

table of contents
  1. About
    1. 0.1 How to Use This Book
    2. 0.2 Companion Textbook
    3. 0.3 Instructor Note
    4. 0.4 Why NYC Open Data?
    5. 0.5 Contributors
    6. 0.6 Acknowledgments
    7. 0.7 How to Cite This Volume
  2. 1 Toxic Homes: Exploring Mold Exposure Complaint and Domestic Violence Report Trends in NYC
    1. 1.1 Loading, Prepping, Cleaning, & Aggregating
      1. 1.1.1 Data Preparation & Cleaning
      2. 1.1.2 Aggregating Mold Data & DV Data
    2. 1.2 Exploring the Data
      1. 1.2.1 Domestic Violence Data
      2. 1.2.2 Mold Exposure Data
      3. 1.2.3 Summary Stats
      4. 1.2.4 Borough/Year Distributions
      5. 1.2.5 Heat Map
      6. 1.2.6 Preliminary Correlation
    3. 1.3 Temporal Trends
      1. 1.3.1 Exploring Mold Resolution
      2. 1.3.2 Quick Look at Resolution Time
      3. 1.3.3 Average Resolution Delay per Month
      4. 1.3.4 Lagged Data
    4. 1.4 Statistical Analysis
    5. 1.5 Regression Models
    6. 1.6 Discussion & Insights
  3. 2 Beating Around the Bush: Uncovering the Hidden Link Between Urban Trees and Wildlife Activity
    1. 2.1 Required Packages
    2. 2.2 Data and Methods
      1. 2.2.1 Data Sources
      2. 2.2.2 Data Cleaning and Preparation
    3. 2.3 Descriptive Analysis (Plots)
      1. 2.3.1 Street Tree Distribution Across Boroughs (Bar chart)
      2. 2.3.2 Wildlife Incidents Across Boroughs (Bar chart)
      3. 2.3.3 Combining Tree and Wildlife Data at the Borough Level (Table)
      4. 2.3.4 Wildlife Incidents Relative to Street Tree Availability (Standardized bar chart / rate per 10,000 trees)
      5. 2.3.5 Spatial Distribution of Street Trees (Binned spatial density plot / heatmap)
      6. 2.3.6 Park-Level Patterns in Wildlife Incidents (Faceted horizontal bar chart)
      7. 2.3.7 Species Involved in Wildlife Incidents (Faceted horizontal bar chart)
    4. 2.4 Inferential and Exploratory Analyses
      1. 2.4.1 Differences in Average Street Tree Size Across Boroughs (One-way ANOVA)
      2. 2.4.2 Association Between Borough and Wildlife Condition (Chi-square test of independence)
      3. 2.4.3 Exploratory Relationship Between Street Tree Abundance and Wildlife Incidents (Simple linear regression)
    5. 2.5 Discussion and Implications
      1. 2.5.1 Conclusion
      2. 2.5.2 Audience & Relevance
      3. 2.5.3 Connection to Open Data
  4. 3 Environmental Stressors and Social Complaints in New York City
    1. 3.1 Research Question
    2. 3.2 Data Sources
    3. 3.3 Reproducible Workflow
    4. 3.4 Loading Downloaded Excel Datasets
    5. 3.5 Accessing NYC Open Data via API (311 Noise Complaints)
    6. 3.6 Data Cleaning and Preparation
    7. 3.7 Merging Datasets
    8. 3.8 Descriptive Statistics
    9. 3.9 Visualization 1: Flooding Complaints by Borough
    10. 3.10 Visualization 2: Flooding and Noise Complaints
    11. 3.11 Statistical Analysis
    12. 3.12 Results
    13. 3.13 Discussion
    14. 3.14 Limitations and Future Directons
    15. 3.15 Connection to Open Data
    16. 3.16 Conclusion
  5. 4 The Madison Square Garden Effect in the NBA
    1. 4.0.1 What is Madison Square Garden?
    2. 4.0.2 What makes MSG so special?
    3. 4.0.3 Is the MSG effect real?
    4. 4.0.4 Three overarching research questions:
    5. 4.1 —————————————————————————–
    6. 4.2 NBA Data Project
    7. 4.3 —————————————————————————–
    8. 4.4 Q1: Do the New York Knicks experience a special home-court advantage due to playing at MSG?
    9. 4.5 —————————————————————————–
    10. 4.6 Q2: Do visiting players play differently at MSG than other arenas?
      1. 4.6.1 For context, let’s look at the league-wide home vs. away comparisons.
      2. 4.6.2 Let’s see if visiting players play better or worse at MSG compared to other away games.
    11. 4.7 —————————————————————————–
    12. 4.8 Q3: Who benefits the most from playing at MSG?
      1. 4.8.1 Which players put up the best performances at MSG? (min = 8 games played at MSG)
      2. 4.8.2 Who steps up their game the most playing at MSG vs. other away games?
      3. 4.8.3 Let’s also look at shooting efficiency.
      4. 4.8.4 How do the stars of the NBA today perform at MSG compared to other venues?
    13. 4.9 —————————————————————————–
    14. 4.10 Conclusion: Is the MSG Effect detectable?
      1. 4.10.1 On an individual player performance level: yes.
  6. 5 NYC Restaurants and Museums
    1. 5.1 Packages
    2. 5.2 Data Loading, Cleaning, and Merging
    3. 5.3 Loading Data
    4. 5.4 Cleaning and Merging Data Sets
      1. 5.4.1 Cleaning “restaurant_rating_data” Set
    5. 5.5 Cleaning “restaurant_data” Set
    6. 5.6 Merging Data Sets
    7. 5.7 Inputting Ratings for EACH Restaurant
    8. 5.8 Deleting Restaurants Without Rating from Google
    9. 5.9 Merging “dba” and “name” Columns
    10. 5.10 Deleting Unnecessary Columns in “merged_restaurant_data” Set
    11. 5.11 Cleaning “museum_data” Set
    12. 5.12 Goal 1: Statistical analysis (higher ratings)
    13. 5.13 Creating New Column
    14. 5.14 Typing “Yes” or “No”
    15. 5.15 Binning ratings into Groups
    16. 5.16 Contingency Table
    17. 5.17 Visualizing our Data
    18. 5.18 Chi-Square Test
      1. 5.18.1 Chi=Square Interpretation
    19. 5.19 Goal 2: Statistical analysis (Restaurant Violations)
    20. 5.20 Creating New Column
    21. 5.21 Typing “None” or “Critical”
    22. 5.22 Contingency Table
    23. 5.23 Visualizing our Data
    24. 5.24 Chi-Square Test
      1. 5.24.1 Interpretation
    25. 5.25 Fisher’s Exact Test
      1. 5.25.1 Interpretation
    26. 5.26 Goal 3: Creating an interactive Map
    27. 5.27 Conclusion
    28. 5.28 References
  7. 6 Leading Causes of Death and Indoor Environmental Complaints
    1. 6.1 Loading Libraries and importing data sets
    2. 6.2 Cleaning the data sets
    3. 6.3 Looking at both data sets
    4. 6.4 Visualizations
    5. 6.5 Pairing Complaint types with Causes of Death
    6. 6.6 Process of merging data
    7. 6.7 Merged Data
    8. 6.8 Corrleation between causes of death and indoor environmental complaints
    9. 6.9 Linear Regression
    10. 6.10 Relevance and Conclusion
  8. 7 Social Infrastructure & Well-Being
    1. 7.1 Libraries Used
    2. 7.2 Data Loading
    3. 7.3 Cleaning
      1. 7.3.1 Basic Events Cleaning
      2. 7.3.2 BoroReport Cleaning
      3. 7.3.3 Final Events Cleaning
    4. 7.4 Events Count
    5. 7.5 SNAP Benefits Count
    6. 7.6 Merging
    7. 7.7 Linear Regression
    8. 7.8 Conclusion

7 Social Infrastructure & Well-Being

Author: Jonah Dratfield

NYC Open Data provides numerous datasets about the city “as part of an initiative to improve the accessibility, transparency, and accountability of City government.” My presentation focuses on how public data can help ordinary citizens better understand—and potentially improve—the quality of life in New York City. While my analysis centers around two pre-existing data sets and a relationship between them, it focuses, as much, on how future data collection can be improved to better address the aforementioned goal of holistic improvement.

Many NYC Open Data datasets, such as 311 service request logs, provide valuable information for policymakers, administrators, or individuals with substantial financial or political power. However, these datasets are often difficult for ordinary residents to act upon. The majority of New Yorkers, for example, do not have the capacity to meaningfully influence the housing market.

That said, there are certain types of information that (i) can be directly acted upon by individuals and (ii) can be translated into concrete, low-barrier actions. The field of positive psychology, which consistently finds that strong social relationships are the most reliable predictors of well-being, provides one such framework for identifying this information. One, when considering this area of research, might ask the following:

Can publicly available data be used to explore the conditions that best facilitate social connectedness, and thereby, most enhance quality of life?

The answer, at the moment, is a tentative yes. At present, NYC Open Data does not include the validated measures psychologists typically use to assess metrics like social connectedness and well-being. Instead, researchers and citizens must rely on rough proxies — such as economic metrics. However, over time, the number of resources amenable to the type of analysis I propose can be expanded.

In this exploratory analysis, I examine whether the number of permitted events in a community district (i.e., gatherings, such as street fairs, that require city permits) predicts the number of monthly SNAP recipients in a community district (i.e., low-income individuals who receive benefits that can be used to purchase food). (Note: The acronym SNAP stands for Supplemental Nutrition Assistance Program). I conceptualize permitted events as a rough measure of social connectedness and number of SNAP recipients per month as a rough measure of economic health and, thereby, overall well-being. Yet, rather than treat these variables as definitive measures, I use them as an opportunity to demonstrate how lucrative this mode of research can be. I conclude, also, with a number of suggestions as to how data collection in this field can best be facilitated.

7.1 Libraries Used

library(tidyverse)
#> ── Attaching core tidyverse packages ──── tidyverse 2.0.0 ──
#> ✔ dplyr     1.1.4     ✔ readr     2.1.5
#> ✔ forcats   1.0.1     ✔ stringr   1.6.0
#> ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
#> ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
#> ✔ purrr     1.2.0     
#> ── Conflicts ────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycOpenData)
library(dplyr)
library(stringr)

7.2 Data Loading

First, I loaded records of NYC permitted events and NYC borough community reports using the NYC Open Data package that my professor (Christian Martinez) created.

Events <- nyc_permit_events_historic(limit = 10000, filters = list())
BoroReport <- nyc_borough_community_report(limit = 10000, filters = list())

knitr::kable(
  head(Events, 25),
  caption = "First 25 rows of Events"
)
Table 7.1: First 25 rows of Events
event_agencyevent_idevent_namestart_date_timeend_date_timeevent_typeevent_boroughevent_locationstreet_closure_typecommunity_boardpolice_precinct
43,NANANANANANANANANANA
43,NANANANANANANANANANA
43,NANANANANANANANANANA
43,NANANANANANANANANANA
43,NANANANANANANANANANA
43,NANANANANANANANANANA
N/ANANANANANANANANANANA
N/ANANANANANANANANANANA
N/ANANANANANANANANANANA
N/ANANANANANANANANANANA
NA21124Ganando Almas Para Cristo’|08/28/10 01:00 PM|08/28/10 06:00 PM|Street Activity Permit Office|Religious Event|Bronx| MORRIS AVENUE between EAST 196 STREET and EAST KINGSBRIDGE ROAD|Full|Full Street Closure |7, |52, |NANANANANANANANA
67,NANANANANANANANANANA
66,NANANANANANANANANANA
13,NANANANANANANANANANA
05,NANANANANANANANANANA
60,NANANANANANANANANANA
44,NANANANANANANANANANA
Parks Department886939Summer on the Hudson Holiday on the Hudson2026-12-05T16:30:00.0002026-12-05T18:00:00.000Special EventManhattanWest Harlem Piers: Marginal Street Between 125th 123rd St.N/A9,26,
Parks Department886939Summer on the Hudson Holiday on the Hudson2026-12-05T16:30:00.0002026-12-05T18:00:00.000Special EventManhattanWest Harlem Piers: Marginal Street Between 125th 123rd St.N/A9,26,
Parks Department886939Summer on the Hudson Holiday on the Hudson2026-12-05T16:30:00.0002026-12-05T18:00:00.000Special EventManhattanWest Harlem Piers: Marginal Street Between 125th 123rd St.N/A9,26,
Parks Department886939Summer on the Hudson Holiday on the Hudson2026-12-05T16:30:00.0002026-12-05T18:00:00.000Special EventManhattanWest Harlem Piers: Marginal Street Between 125th 123rd St.N/A9,26,
Parks Department886939Summer on the Hudson Holiday on the Hudson2026-12-05T16:30:00.0002026-12-05T18:00:00.000Special EventManhattanWest Harlem Piers: Marginal Street Between 125th 123rd St.N/A9,26,
Parks Department886939Summer on the Hudson Holiday on the Hudson2026-12-05T16:30:00.0002026-12-05T18:00:00.000Special EventManhattanWest Harlem Piers: Marginal Street Between 125th 123rd St.N/A9,26,
Parks Department899434Junior Volunteer Corps2026-12-05T13:00:00.0002026-12-05T15:00:00.000Special EventBrooklynProspect Park: Bandshell SouthN/A55,78,
Parks Department899434Junior Volunteer Corps2026-12-05T13:00:00.0002026-12-05T15:00:00.000Special EventBrooklynProspect Park: Bandshell SouthN/A55,78,

knitr::kable(
  head(BoroReport, 25),
  caption = "First 25 rows of BoroReport"
)
Table 7.1: First 25 rows of BoroReport
monthboroughcommunity_districtbc_snap_recipientsbc_snap_householdsbc_ca_recipientsbc_ca_casesbc_ma_only_enrolleesbc_total_ma_enrollees
2025-09-01T00:00:00.000Staten_IslandS0314469901435612058751013710
2025-09-01T00:00:00.000Staten_IslandS02194011108046352612960318324
2025-09-01T00:00:00.000Staten_IslandS0141291225471692983111248037025
2025-09-01T00:00:00.000QueensQ1432329180771422870511027831230
2025-09-01T00:00:00.000QueensQ132417715506830349541367625712
2025-09-01T00:00:00.000QueensQ12538663260923653126892225354341
2025-09-01T00:00:00.000QueensQ1110421699720371314825311976
2025-09-01T00:00:00.000QueensQ10194011197553383009958817649
2025-09-01T00:00:00.000QueensQ092611615665691140531186522301
2025-09-01T00:00:00.000QueensQ082307914038606535271282122782
2025-09-01T00:00:00.000QueensQ073843125919770052782650340712
2025-09-01T00:00:00.000QueensQ0614111913928131685767613557
2025-09-01T00:00:00.000QueensQ05206341288845872866992617464
2025-09-01T00:00:00.000QueensQ042860617742528533181397923492
2025-09-01T00:00:00.000QueensQ032542515738563232881300222126
2025-09-01T00:00:00.000QueensQ0213205877240722577806414152
2025-09-01T00:00:00.000QueensQ012663016700969953551235827336
2025-09-01T00:00:00.000ManhattanM1247829337141261781201867640620
2025-09-01T00:00:00.000ManhattanM11419712724216403100481230636987
2025-09-01T00:00:00.000ManhattanM103082321013127588484874926825
2025-09-01T00:00:00.000ManhattanM09220121501776684856741319109
2025-09-01T00:00:00.000ManhattanM08691353732295158645748492
2025-09-01T00:00:00.000ManhattanM07170421276760994150797717923
2025-09-01T00:00:00.000ManhattanM06746459343252253133017887
2025-09-01T00:00:00.000ManhattanM05553143175186274730649537

7.3 Cleaning

7.3.1 Basic Events Cleaning

After this, I removed all non-numeric characters from the community board listings in events and made the community board listings numeric.

Community boards refer to community districts within the five boroughs (and, as a result, function as geographical subdivisions of New York City). There are 59 community boards, as well as a number of so-called “joint-interest areas.” I removed non-numeric characters – such as letters, commas and quotation marks – to standardize the community board notation in the dataset.

eventscleaner <- Events %>%
  mutate(
    cd_id =
      community_board |> 
      str_replace_all("[^0-9]", "") |>  
      as.numeric()                       
  )

7.3.2 BoroReport Cleaning

In the borough report, I separated the community district field into a borough identifier and a numeric community board. I then recoded the borough identifiers as numeric prefixes and combined these with the community board numbers to create a standardized community district ID. The goal of this transformation was to make the notation in the BoroReport dataset equivalent to that in the Events dataset.

BoroReport <- BoroReport %>%
  
  mutate(
    snap_borough = str_extract(community_district, "^[A-Za-z]") |> str_to_upper(),
    snap_cb      = str_extract(community_district, "[0-9]+") |> as.numeric()
  ) %>%
  
  mutate(
    snap_borough_num = case_when(
      snap_borough == "M" ~ 100,  # Manhattan
      snap_borough == "B" ~ 200,  # Bronx
      snap_borough == "K" ~ 300,  # Brooklyn
      snap_borough == "Q" ~ 400,  # Queens
      snap_borough == "S" ~ 500,  # Staten Island
      TRUE ~ NA_real_
    ),
    
    cd_id = snap_borough_num + snap_cb
  )

7.3.3 Final Events Cleaning

Finally, I applied this same numbering pattern to the events data sheet. I replaced the borough names with numbers and added these numbers to the community districts.

eventscleaner <- eventscleaner %>%
  
  mutate(
    borough_num = case_when(
      event_borough == "Manhattan"     ~ 100,
      event_borough == "Bronx"         ~ 200,
      event_borough == "Brooklyn"      ~ 300,
      event_borough == "Queens"        ~ 400,
      event_borough == "Staten_Island" ~ 500,
      event_borough == "Staten Island" ~ 500,  
      TRUE ~ NA_real_
    ),
    
   
    cd_id = borough_num + cd_id
  )

7.4 Events Count

After this, I glanced at the number of events per community district – just to garner a better understanding of the data.

events_cd <- eventscleaner %>%
  count(cd_id, name = "n_events")

knitr::kable(
  head(events_cd, 30),
  caption = "Number of Events Per CD"
)
Table 7.2: Number of Events Per CD
cd_idn_events
1012
1076
108277
10912
11196
164235
21111
228213
30116
30213
30519
30621
30733
31089
3116
31278
31521
31613
318135
3556290
37724
40148
402140
405628
407151
408747
411392
41284
41322
48180

Across community districts, the mean number of permitted events was 312.5, with a median of 63. (Note: The right skew in the data was due to the number of events in joint-interest areas. These were dropped from the later analysis, due to the lack of SNAP recipients in those areas).

I then created a graph to display the number of events per district, in descending order:

  • Community district numbers correspond to the final two digits shown on the y-axis.
  • District numbers starting with 1 indicate Manhattan.
  • District numbers starting with 2 indicate the Bronx.
  • District numbers starting with 3 indicate Brooklyn.
  • District numbers starting with 4 indicate Queens.
  • District numbers starting with 5 indicate Staten Island.
  • New York City has 59 community districts in total:
    • Manhattan: 12 districts
    • The Bronx: 12 districts
    • Brooklyn: 18 districts
    • Queens: 14 districts
    • Staten Island: 3 districts
  • District numbers that do not follow this schema (for example, 55 and 64) refer to joint-interest areas rather than standard community districts.
    • District 55 corresponds to Prospect Park.
    • District 64 corresponds to Central Park.

A full list of community districts and joint-interest areas is available here.


events_cd %>%
  slice_max(n_events, n = 25) %>%
  ggplot(aes(x = reorder(cd_id, n_events), y = n_events)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(
    title = "Number of Events by Community District (Top 25)",
    x = "Community District",
    y = "Number of Events"
  ) +
  theme_minimal()
*Fig 1: Number of permitted events by community district, ordered from highest to lowest for the top 25. Each horizontal bar represents a distinct community district. Community district identifiers are displayed on the y-axis and event counts are displayed on the x-axis.*

Figure 7.1: Fig 1: Number of permitted events by community district, ordered from highest to lowest for the top 25. Each horizontal bar represents a distinct community district. Community district identifiers are displayed on the y-axis and event counts are displayed on the x-axis.

7.5 SNAP Benefits Count

I also looked over the number of SNAP recipients per district.

The table below shows the mean number of SNAP recipients per month per community district. (Note: There are not necessarily equal amounts of people per community district, so number of SNAP recipients within a given district is not a de facto indication of the proportional amount of poverty in the area. That said, it still functions as a meaningful snapshot of poverty rates).

BoroReport <- BoroReport %>%
  mutate(
    bc_snap_recipients = as.numeric(bc_snap_recipients)
  )

snap_plot_data <- BoroReport %>%
  group_by(cd_id) %>%
  summarise(
    bc_snap_recipients = mean(bc_snap_recipients,na.rm = TRUE),
    .groups = "drop"
  )

snap_plot_data %>%
  slice_max(bc_snap_recipients, n = 25) %>%
  ggplot(
    aes(
      x = reorder(cd_id, bc_snap_recipients),
      y = bc_snap_recipients
    )
  ) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(
    title = "SNAP Recipients by Community District (Top 25)",
    x = "Community District",
    y = "SNAP Recipients"
  ) +
  theme_minimal()
*Fig 2: Mean number of SNAP recipients by community district, ordered from highest to lowest for the top 25. Each horizontal bar represents a distinct community district. Community district identifiers are displayed on the y-axis and SNAP recipients are displayed on the x-axis.*

Figure 7.2: Fig 2: Mean number of SNAP recipients by community district, ordered from highest to lowest for the top 25. Each horizontal bar represents a distinct community district. Community district identifiers are displayed on the y-axis and SNAP recipients are displayed on the x-axis.

Across community districts, the mean number of SNAP recipients per month was 28580, with a median of 25456.

7.6 Merging

Finally, I merged the two datasheets using the community district names I created earlier.

merged <- BoroReport %>%
  left_join(events_cd, by = "cd_id")

knitr::kable(
  head(merged, 25),
  caption = "First 25 rows of merged"
)
Table 7.3: First 25 rows of merged
monthboroughcommunity_districtbc_snap_recipientsbc_snap_householdsbc_ca_recipientsbc_ca_casesbc_ma_only_enrolleesbc_total_ma_enrolleessnap_boroughsnap_cbsnap_borough_numcd_idn_events
2025-09-01T00:00:00.000Staten_IslandS0314469901435612058751013710S3500503NA
2025-09-01T00:00:00.000Staten_IslandS02194011108046352612960318324S2500502NA
2025-09-01T00:00:00.000Staten_IslandS0141291225471692983111248037025S1500501NA
2025-09-01T00:00:00.000QueensQ1432329180771422870511027831230Q14400414NA
2025-09-01T00:00:00.000QueensQ132417715506830349541367625712Q1340041322
2025-09-01T00:00:00.000QueensQ12538663260923653126892225354341Q1240041284
2025-09-01T00:00:00.000QueensQ1110421699720371314825311976Q11400411392
2025-09-01T00:00:00.000QueensQ10194011197553383009958817649Q10400410NA
2025-09-01T00:00:00.000QueensQ092611615665691140531186522301Q9400409NA
2025-09-01T00:00:00.000QueensQ082307914038606535271282122782Q8400408747
2025-09-01T00:00:00.000QueensQ073843125919770052782650340712Q7400407151
2025-09-01T00:00:00.000QueensQ0614111913928131685767613557Q6400406NA
2025-09-01T00:00:00.000QueensQ05206341288845872866992617464Q5400405628
2025-09-01T00:00:00.000QueensQ042860617742528533181397923492Q4400404NA
2025-09-01T00:00:00.000QueensQ032542515738563232881300222126Q3400403NA
2025-09-01T00:00:00.000QueensQ0213205877240722577806414152Q2400402140
2025-09-01T00:00:00.000QueensQ012663016700969953551235827336Q140040148
2025-09-01T00:00:00.000ManhattanM1247829337141261781201867640620M12100112NA
2025-09-01T00:00:00.000ManhattanM11419712724216403100481230636987M1110011196
2025-09-01T00:00:00.000ManhattanM103082321013127588484874926825M10100110NA
2025-09-01T00:00:00.000ManhattanM09220121501776684856741319109M910010912
2025-09-01T00:00:00.000ManhattanM08691353732295158645748492M8100108277
2025-09-01T00:00:00.000ManhattanM07170421276760994150797717923M71001076
2025-09-01T00:00:00.000ManhattanM06746459343252253133017887M6100106NA
2025-09-01T00:00:00.000ManhattanM05553143175186274730649537M5100105NA

7.7 Linear Regression

I then conducted a linear regression to determine whether number of permitted events predicts number of SNAP recipients. The model was statistically significant, F(1, 723) = 45.34, p < .001, and explained approximately 6% of the variance in SNAP recipients (R² = .059). The number of events was a significant negative predictor of SNAP recipients, b = −21.30, SE = 3.16, t(723) = −6.73, p < .001.

(Note: The model dropped all rows with missing event counts. This means that all joint-interest areas were dropped from the analysis, as well as any months for which there was no event count data)

model1 <- lm(bc_snap_recipients ~ n_events, data = merged)
summary(model1)
#> 
#> Call:
#> lm(formula = bc_snap_recipients ~ n_events, data = merged)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -28081 -12534  -2646   8391  44572 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 30043.800    715.843  41.970  < 2e-16 ***
#> n_events      -21.303      3.164  -6.733 3.38e-11 ***
#> ---
#> Signif. codes:  
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 16210 on 723 degrees of freedom
#>   (986 observations deleted due to missingness)
#> Multiple R-squared:  0.059,  Adjusted R-squared:  0.0577 
#> F-statistic: 45.34 on 1 and 723 DF,  p-value: 3.385e-11

ggplot(
  merged,
  aes(
    x = n_events,
    y = bc_snap_recipients
  )
) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  labs(
    title = "Permitted Events and SNAP Recipients by Community District",
    x = "Permitted Events",
    y = "SNAP Recipients"
  ) +
  theme_minimal()
#> `geom_smooth()` using formula = 'y ~ x'
#> Warning: Removed 986 rows containing non-finite outside the scale
#> range (`stat_smooth()`).
#> Warning: Removed 986 rows containing missing values or values
#> outside the scale range (`geom_point()`).
*Fig 3: Relationship between permitted events and SNAP recipients by community district. Each point represents a distinct community district, and the line shows the linear association between event counts and SNAP recipients.*

Figure 7.3: Fig 3: Relationship between permitted events and SNAP recipients by community district. Each point represents a distinct community district, and the line shows the linear association between event counts and SNAP recipients.

7.8 Conclusion

Despite the significant p-value of this analysis, there are a number of limitations. As mentioned in the introduction, the number of SNAP recipients is an imperfect measure of economic well-being (not to mention holistic well-being). Likewise, permitted events are an imperfect indicator of social gatherings in an area. At a more granular level, community districts are not normalized by population size, and major hubs of social activity—such as parks—are excluded from the regression.

However, these limitations point to ways in which data collection could be improved. Below, I outline several possibilities for instantiating such improvements:

Better Dependent Variables

To meaningfully assess quality of life in NYC, future datasets should include more varied indicators of well-being and capture outcomes across the income distribution. Ideally, validated population-level measures of well-being and social connectedness would be available for use as dependent variables. In addition, economic proxies for well-being (such as median income) should be collected. Diverse datasets of this sort would provide a more complete picture of the psychological and economic well-being of NYC residents.

More Information about Social Gatherings

Currently, NYC Open Data has information about permitted events. Yet, there are countless other social gatherings that could be quantified as well. These include volunteer opportunities, Meetup groups, Eventbrite activities, Reddit meetups, and more. While an exhaustive catalog of social gatherings is not feasible, expanded coverage of accessible, low-barrier events would strengthen any analyses of social life in the city. It would also allow analysts to subdivide events in meaningful ways.

Geographic Information

Community districts provide a useful organizational unit, but many NYC datasets lack this data. In addition, even more detailed neighborhood-level data on events might provide information about areas with a shortage (or surplus) of social activity. Identifying such areas might support more strategic intervention. Finally, knowledge of individuals’ willingness (or lack of willingness) to travel might provide yet more valuable information. The prominence of parks in the event data suggests that social life is often organized around specific hubs. The practical accessibility of these hubs is yet another concept worth exploring.

Concrete Suggestions

There is no “control New York City.” As such, causality cannot be established through the analyses I describe. Nevertheless, if evidence were to suggest that certain types of social activities were associated with positive psychological outcomes, it would then be possible to recommend concrete actions to citizens who wished to improve civic and social life in New York. In this way, improved data infrastructure could help foster a stronger sense of civic autonomy among New Yorkers – as well as a happier, healthier New York City.

Annotate

Previous
Analyst Case Studies
Powered by Manifold Scholarship. Learn more at
Opens in new tab or windowmanifoldapp.org