Olivia Das
Software Systems
Data intensive applications are proliferating at a faster speed today in several critical sectors—public health, stock market, transportation, social media are just to name a few. These applications are complex and are often executed on software frameworks installed in clouds. One such framework is Apache Spark. It can be installed in a cloud known as Google Cloud Platform (GCP). The applications that are run on Apache Spark are popularly called Spark Applications. Predicting execution time of a Spark application is important since it will help the application users to estimate how the execution time of the application will be affected by the software resources allocated to run the application, before actually availing those resources. The question is: Can we develop a model—based on machine learning (ML) techniques—that would allow us to predict the execution time of a Spark application on GCP?
This project will involve running multiple Spark benchmarks on GCP, gather data, develop multiple machine learning (ML) models based on the gathered data, use each such model to predict the execution time of every benchmark, and finally compare the outputs from the models.
• The evaluation should use Google Cloud Platform (GCP).
• The evaluation should use a machine learning (ML) library, for example, Scikit-learn.
• A background on python programming is necessary.
• Study research paper on how machine learning is used to estimate performance of Spark applications.
• Get familiar with GCP.
• Develop machine learning (ML) models to predict the execution times of Spark Applications.
The group members should accomplish the items mentioned in section “Suggested Approach” above.
• Learn tools needed for development of machine learning (ML) models.
• Learn to run Spark applications on GCP (using actual software resources).
• Develop machine learning (ML) models to predict the execution times of the Spark applications.
• Develop the report.
• Learn tools needed for development of machine learning (ML) models.
• Learn to run Spark applications on GCP (using actual software resources).
• Develop machine learning (ML) models to predict the execution times of the Spark applications.
• Develop the report.
• Learn tools needed for development of machine learning (ML) models.
• Learn to run Spark applications on GCP (using actual software resources).
• Develop machine learning (ML) models to predict the execution times of the Spark applications.
• Develop the report.
• Learn tools needed for development of machine learning (ML) models.
• Learn to run Spark applications on GCP (using actual software resources).
• Develop machine learning (ML) models to predict the execution times of the Spark applications.
• Develop the report.
OD01: PERFORMANCE PREDICTION OF SPARK APPLICATIONS USING MACHINE LEARNING TECHNIQUES | Olivia Das | Sunday September 5th 2021 at 03:19 PM