Apache Spark is one of the latest data processing engines that can support batch, interactive, iterative and graphing processing. The combination of elegant application programming interfaces (APIs) and a fast in-memory general-purpose cluster computing system makes it a very attractive option for companies to leverage for various data processing needs. It complements Hadoop in big data analytic applications. Apache Spark is written in Scala, a functional programming language. However, its APIs are available in three programming languages: Scala, Java and Python. This course focuses on Spark’s API in Scala language only.

The course introduces Apache Spark, its architecture, and the execution model. The course includes a short introduction to the functional programming language Scala with basic syntax, case class and collection APIs. You’ll learn how to manipulate Apache Spark’s programming model, Resilient Distributed Dataset (RDD), through its APIs for data processing, and understand how to build Spark applications with Scala. In addition to batch and iterative data processing, Apache Spark also supports stream processing, which is very important for companies to extract business insight at near real-time. The second half of the course covers stream processing capability and developing streaming applications with Apache Spark.

By the end of the course, you’ll have a good foundation in Scala language and a strong understanding of Apache Spark’s architecture, execution model and programming model. In addition, you’ll be able to manipulate RDDs through Apache Spark’s API and develop Apache Spark applications in Scala for batch, interactive and stream processing applications. You should have prior object-oriented programming experience to learn Scala and this course only offers a short introduction to Scala.

Topics include:

  • Big data processing ecosystem

  • Introduction to Apache Spark architecture and execution model

  • Introduction to Scala programming language

  • Apache Spark programming model with RDD

  • Data processing with Apache Spark RDD Scala APIs

  • How to develop Apache Spark applications with Scala

  • Introduction to streaming processing with Apache Spark

  • How to develop stream processing applications with Apache Spark

Skills Needed: Programming experience with Java is required. Knowledge of Hadoop is recommended.