Stack Overflow Big Data Processing

Stack Overflow Big Data Processing

notion image

Project Description

  • Spun an Elastic MapReduce (EMR) cluster based on Spark and created a Spark application written in Python.
  • Created an S3 bucket to upload "survey_results_public.csv” file so EMR can access it for data processing.
  • Locally issued Linux commands (Amazon Linux 2) to the EMR cluster’s master node by connecting to an Elastic Compute Cloud (EC2) instance via Secure Shell (SSH) connection.

Overview

  • Created an EMR cluster with cluster launch mode and an initial S3 bucket was created automatically to store logs.
    • Software Configuration
      • emr-5.36.0
      • Spark: Spark 2.4.8 on Hadoop 2.10.1 YARN and Zeppelin 0.10.0
    • Hardware Configuration
      • m5.xlarge
      • Number of instances: 3
    • Security and access
      • EC2 key pair (used Amazon EC2 to create a RSA key pair)
        • notion image
  • Set-up a new S3 bucket to upload the file, "survey_results_public.csv", so EMR can access it for data processing.
    • notion image
  • Inserted a new folder within the same S3 bucket called "data-source" that contains the CSV file.
    • notion image
  • Created a Spark application in Python, "main.py", for Spark storage job to process data.
from pyspark.sql import SparkSession from pyspark.sql.functions import col S3_DATA_SOURCE_PATH = 's3://stackoverflow-123456/data-source/survey_results_public.csv' S3_DATA_OUTPUT_PATH = 's3://stackoverflow-123456/data-output' def main (): spark = SparkSession.builder.appName('StackoverflowApp').getOrCreate() all_data = spark.read.csv(S3_DATA_SOURCE_PATH, header=True) print('Total number of records in the source data: %s' % all_data.count()) selected_data = all_data.where((col('Country') == 'United States') & (col('WorkWeekHrs') > 45)) print('The number of engineers who work more than 45 hours a week in the US is: %s' % selected_data.count()) selected_data.write.mode('overwrite').parquet(S3_DATA_OUTPUT_PATH) print('Selected data was was successfully saved to S3: %s' % S3_DATA_OUTPUT_PATH) if __name__ == '__main__': main()
  • Opened port 22 to SSH into the EMR cluster using IP address and ran spark-submit command for "main.py" data processing.
    • notion image
      notion image
      notion image
      notion image
      notion image
      notion image
  • A new folder called "data-output" with parquet files was created in the same S3 bucket, after executing the commands from "main.py".
    • notion image

Language & Tools

  • Python
  • Spark (PySpark)
  • AWS (EMR, EC2, S3)
  • Bash (Amazon Linux 2)

View Source