Stack Overflow Big Data Processing

Project Description

Spun an Elastic MapReduce (EMR) cluster based on Spark and created a Spark application written in Python.

Implemented Python API for Apache Spark (PySpark) and performed spark-submit to process data from the Stack Overflow Annual Developer Survey 2020.

Created an S3 bucket to upload "survey_results_public.csv” file so EMR can access it for data processing.

Locally issued Linux commands (Amazon Linux 2) to the EMR cluster’s master node by connecting to an Elastic Compute Cloud (EC2) instance via Secure Shell (SSH) connection.

Overview

Created an EMR cluster with cluster launch mode and an initial S3 bucket was created automatically to store logs.

Software Configuration

emr-5.36.0
Spark: Spark 2.4.8 on Hadoop 2.10.1 YARN and Zeppelin 0.10.0

Hardware Configuration

m5.xlarge
Number of instances: 3

Security and access

EC2 key pair (used Amazon EC2 to create a RSA key pair)

Set-up a new S3 bucket to upload the file, "survey_results_public.csv", so EMR can access it for data processing.

Inserted a new folder within the same S3 bucket called "data-source" that contains the CSV file.

Created a Spark application in Python, "main.py", for Spark storage job to process data.


from pyspark.sql import SparkSession
from pyspark.sql.functions import col

S3_DATA_SOURCE_PATH = 's3://stackoverflow-123456/data-source/survey_results_public.csv'
S3_DATA_OUTPUT_PATH = 's3://stackoverflow-123456/data-output'

def main ():
    spark = SparkSession.builder.appName('StackoverflowApp').getOrCreate()
    all_data = spark.read.csv(S3_DATA_SOURCE_PATH, header=True)
    print('Total number of records in the source data: %s' % all_data.count())
    selected_data = all_data.where((col('Country') == 'United States') & (col('WorkWeekHrs') > 45))
    print('The number of engineers who work more than 45 hours a week in the US is: %s' % selected_data.count())
    selected_data.write.mode('overwrite').parquet(S3_DATA_OUTPUT_PATH)
    print('Selected data was was successfully saved to S3: %s' % S3_DATA_OUTPUT_PATH)

if __name__ == '__main__':
    main()