PERFORMANCE PREDICTION OF SPARK APPLICATIONS USING MACHINE LEARNING TECHNIQUES

2021 COE Engineering Design Project (OD01)


Faculty Lab Coordinator

Olivia Das

Topic Category

Software Systems

Preamble

Data intensive applications are proliferating at a faster speed today in several critical sectors—public health, stock market, transportation, social media are just to name a few. These applications are complex and are often executed on software frameworks installed in clouds. One such framework is Apache Spark. It can be installed in a cloud known as Google Cloud Platform (GCP). The applications that are run on Apache Spark are popularly called Spark Applications. Predicting execution time of a Spark application is important since it will help the application users to estimate how the execution time of the application will be affected by the software resources allocated to run the application, before actually availing those resources. The question is: Can we develop a model—based on machine learning (ML) techniques—that would allow us to predict the execution time of a Spark application on GCP?

Objective

This project will involve running multiple Spark benchmarks on GCP, gather data, develop multiple machine learning (ML) models based on the gathered data, use each such model to predict the execution time of every benchmark, and finally compare the outputs from the models.

Partial Specifications

• The evaluation should use Google Cloud Platform (GCP).
• The evaluation should use a machine learning (ML) library, for example, Scikit-learn.
• A background on python programming is necessary.

Suggested Approach

• Study research paper on how machine learning is used to estimate performance of Spark applications.
• Get familiar with GCP.
• Develop machine learning (ML) models to predict the execution times of Spark Applications.

Group Responsibilities

The group members should accomplish the items mentioned in section “Suggested Approach” above.

Student A Responsibilities

• Learn tools needed for development of machine learning (ML) models.
• Learn to run Spark applications on GCP (using actual software resources).
• Develop machine learning (ML) models to predict the execution times of the Spark applications.
• Develop the report.

Student B Responsibilities

• Learn tools needed for development of machine learning (ML) models.
• Learn to run Spark applications on GCP (using actual software resources).
• Develop machine learning (ML) models to predict the execution times of the Spark applications.
• Develop the report.

Student C Responsibilities

• Learn tools needed for development of machine learning (ML) models.
• Learn to run Spark applications on GCP (using actual software resources).
• Develop machine learning (ML) models to predict the execution times of the Spark applications.
• Develop the report.

Student D Responsibilities

• Learn tools needed for development of machine learning (ML) models.
• Learn to run Spark applications on GCP (using actual software resources).
• Develop machine learning (ML) models to predict the execution times of the Spark applications.
• Develop the report.

Course Co-requisites

To ALL EDP Students

Due to COVID-19 pandemic, in the event University is not open for in-class/in-lab activities during the Winter term, your EDP topic specifications, requirements, implementations, and assessment methods will be adjusted by your FLCs at their discretion.

 


OD01: PERFORMANCE PREDICTION OF SPARK APPLICATIONS USING MACHINE LEARNING TECHNIQUES | Olivia Das | Sunday September 5th 2021 at 03:19 PM