Raspberry Pi 4 Hadoop/Spark Distributed Machine Learning Cluster

From Fall 2019 to Spring 2020 I designed and implemented a small-scale Hadoop YARN Spark cluster using Raspberry Pi 4 (4GB RAM version) minicomputers. The end goal of the project was to use Apache Spark's Machine Learning Library (MLlib) to build models for the classification of astronomical objects into one of three groups: stars, galaxies, or quasars. The data used was obtained from the Sloane Digital Sky Survey project, which is a long-term project to construct three-dimensional maps of the universe and to measure the spectra of different astronomical objects throughout it. This project was a great learning experience to go through the entire pipeline of designing a distributed machine learning cluster from scratch. I cover the entire process, from getting the hardware and configuring passwordless SSH to obtaining the data and programming a final machine learning classifier with PySpark, in my dedicated GitHub walkthrough linked below.

Pi Cluster GitHub

Project Poster