Hadoop is an open-source platform for distributed processing of large amounts of data across clusters of servers. Hadoop can handle data-intensive distributed applications that require exabytes of data with a high degree of fault tolerance. Internet companies such as Facebook, Twitter, eBay, LinkedIn, and Yahoo! have adopted Hadoop and contributed to its ongoing improvement. Hadoop is steadily maturing and many enterprises are finding a place for Hadoop in their data architecture plans. This course will bring you up to speed on the Hadoop platform and its ecosystem.
The first half of the course includes an overview of the two MapReduce frameworks, MR2 and Spark, and Hadoop Distributed File System (HDFS). You will learn how to write MapReduce code and how to optimize data processing applications.
The second half of the course covers Hadoop's ecosystem, including the data warehouse and query service Hive, important libraries for SQL processing and machine learning, distributed scalable NoSQL databases, and the distributed processing coordination system Zookeeper. Together the course presents a comprehensive introduction to the Hadoop framework for distributed data processing applications.
Upon completion of this course, you will possess a strong understanding of the Hadoop platform and be able to develop data processing distributed applications using the MapReduce framework, Hive and SparkML. You will also become familiar with additional components of the Hadoop ecosystem. The course consists of interactive lectures, hands-on lab exercises and programming assignments.
- Overview of Hadoop
- Understanding Hadoop Distributed File System (HDFS)
- How MapReduce framework works
- Hadoop IO
- Developing MapReduce programs and algorithms
- Introduction to HBase
- Introduction to SparkSQL queries and Spark libraries for Machine Learning
- Introduction to Hive and developing Hive queries
- Introduction to Data Pipelines and Zookeeper
Note(s): Cloudera Hadoop is used in this course. Discount is available to students who want to take Cloudera Certified Big Data Professional (CCP) exam after completion of the course.
Students are required to bring Laptops with a minimum of 8GB of memory.
Skills Needed: Basic SQL skills are required as is the ability to create simple programs in any modern programming language. An understanding of database, parallel or distributed computing is recommended.