Created an S3 bucket to upload "survey_results_public.csv” file so EMR can access it for data processing.
Locally issued Linux commands (Amazon Linux 2) to the EMR cluster’s master node by connecting to an Elastic Compute Cloud (EC2) instance via Secure Shell (SSH) connection.
Overview
Created an EMR cluster with cluster launch mode and an initial S3 bucket was created automatically to store logs.
Software Configuration
emr-5.36.0
Spark: Spark 2.4.8 on Hadoop 2.10.1 YARN and Zeppelin 0.10.0
Hardware Configuration
m5.xlarge
Number of instances: 3
Security and access
EC2 key pair (used Amazon EC2 to create a RSA key pair)
Set-up a new S3 bucket to upload the file, "survey_results_public.csv", so EMR can access it for data processing.
Inserted a new folder within the same S3 bucket called "data-source" that contains the CSV file.
Created a Spark application in Python, "main.py", for Spark storage job to process data.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
S3_DATA_SOURCE_PATH = 's3://stackoverflow-123456/data-source/survey_results_public.csv'
S3_DATA_OUTPUT_PATH = 's3://stackoverflow-123456/data-output'
def main ():
spark = SparkSession.builder.appName('StackoverflowApp').getOrCreate()
all_data = spark.read.csv(S3_DATA_SOURCE_PATH, header=True)
print('Total number of records in the source data: %s' % all_data.count())
selected_data = all_data.where((col('Country') == 'United States') & (col('WorkWeekHrs') > 45))
print('The number of engineers who work more than 45 hours a week in the US is: %s' % selected_data.count())
selected_data.write.mode('overwrite').parquet(S3_DATA_OUTPUT_PATH)
print('Selected data was was successfully saved to S3: %s' % S3_DATA_OUTPUT_PATH)
if __name__ == '__main__':
main()
Opened port 22 to SSH into the EMR cluster using IP address and ran spark-submit command for "main.py" data processing.
A new folder called "data-output" with parquet files was created in the same S3 bucket, after executing the commands from "main.py".