The first half of the course includes an overview of the two MapReduce frameworks, MR2 and Spark, and Hadoop Distributed File System (HDFS). You will learn how to write MapReduce code and how to optimize data processing applications.
The second half of the course covers Hadoop's ecosystem, including the data warehouse and query service Hive, important libraries for SQL processing and machine learning, distributed scalable NoSQL databases, and the distributed processing coordination system Zookeeper. Together the course presents a comprehensive introduction to the Hadoop framework for distributed data processing applications.
Upon completion of this course, you will possess a strong understanding of the Hadoop platform and be able to develop data processing distributed applications using the MapReduce framework, Hive and SparkML. You will also become familiar with additional components of the Hadoop ecosystem. The course consists of interactive lectures, hands-on lab exercises and programming assignments.
- Overview of Hadoop
- Understanding Hadoop Distributed File System (HDFS)
- How MapReduce framework works
- Hadoop IO
- Developing MapReduce programs and algorithms
- Introduction to HBase
- Introduction to SparkSQL queries and Spark libraries for Machine Learning
- Introduction to Hive and developing Hive queries
- Introduction to Data Pipelines and Zookeeper
Students are required to bring Laptops with a minimum of 8GB of memory.
Skills Needed: Basic SQL skills are required as is the ability to create simple programs in any modern programming language. An understanding of database, parallel or distributed computing is recommended.