Methodology to explore the food insecurity problem in NYC by leveraging Apache Spark and Google Cloud Dataproc to quickly and efficiently parse through ~10 GB (9.2 million rows) of data in less than 2 minutes.
Overview
The goal of these notebooks is to study the food insecurity problem by looking at the listed prices of various food items across neighborhoods in NYC. Our hypothesis is that people living in areas with higher food insecurity problems would pay more for the same items compared to those in more secured areas. For the scope of work, we will only assess food products from KeyFoods supermarkets, one of the top 4 Supermarket Leaders in Metro New York (according Food Trade News 2021 report). This is Task 1.
Additionally, we will determine the distance people traveled to grocery stores by census block group (CBG) using Safegraph data. In particular, we would like to know for each CBG, the average distance they traveled to the listed grocery stores in March 2019, October 2019, March 2020, and October 2020. (We select March and October to avoid summer and holidays with more noise from tourists and festivity shopping). The distances will be projected in the NAD83 plane (EPSG:2263) to increase the accuracy of the calculations. This is task 2.
Technical Details:
Tech Stack: Google Cloud Dataproc & Storage, Apache Spark (PySpark), Python pyproj
- For each problem, a Jupyter Notebook will be used to develop the initial logic to accomplish each task. The logic will utilize either PySpark Resilient Distributed Datasets (RDD) or PySpark Dataframes, such that we are able to scale out to address working with big data.
- The data is loaded from a local directory, if the file size is considered small, or is loaded from Google Cloud Platform Buckets, if the file size is considered large.
- The initial logic developed in the notebook will be converted into a Python script that can be ran on Google Cloud Dataproc Clusters.
- The outputs are written to Buckets for review.
Results
Leveraging Google Cloud Platform and Apache Spark, parsing through ~10 GB (9.2 million rows) of data and computing the results for the objectives took just 1 minute and 55 seconds. Please review the results for each task within their respective notebooks below.
Food Insecurity Analysis Task 1: Python Notebook
Food Insecurity Analysis Task 2: Python Notebook
Want to connect?
Connect with me through LinkedIn, or reach out to me via email or phone number.