Hadoop: Distributed Processing of Big Data

Hadoop is an open-source platform for distributed processing of large amounts of data across clusters of servers. Hadoop can handle data-intensive distributed applications that require exabytes of data with a high degree of fault tolerance. Internet companies such as Facebook, Twitter, eBay, LinkedIn, and Yahoo! have adopted Hadoop and contributed to its ongoing improvement. Hadoop is steadily maturing and many enterprises are finding a place for Hadoop in their data architecture plans. This course will bring you up to speed on the Hadoop platform and its ecosystem.

The first half of the course includes an overview of the two MapReduce frameworks, MR2 and Spark, and Hadoop Distributed File System (HDFS). You will learn how to write MapReduce code and how to optimize data processing applications.

The second half of the course covers Hadoop's ecosystem, including the data warehouse and query service Hive, important libraries for SQL processing and machine learning, distributed scalable NoSQL databases, and the distributed processing coordination system Zookeeper. Together the course presents a comprehensive introduction to the Hadoop framework for distributed data processing applications.

Upon completion of this course, you will possess a strong understanding of the Hadoop platform and be able to develop data processing distributed applications using the MapReduce framework, Hive and SparkML. You will also become familiar with additional components of the Hadoop ecosystem. The course consists of interactive lectures, hands-on lab exercises and programming assignments. Experience with Java Programming is required for this course.

Topics Include:

  • Overview of Hadoop

  • Understanding Hadoop Distributed File System (HDFS)

  • How MapReduce framework works

  • Hadoop IO

  • Developing MapReduce programs and algorithms

  • Introduction to HBase

  • Introduction to SparkSQL queries and Spark libraries for Machine Learning

  • Introduction to Hive and developing Hive queries

  • Introduction to Data Pipelines and Zookeeper

Note(s): Cloudera Hadoop is used in this course. Discount is available to students who want to take Cloudera Certified Big Data Professional (CCP) exam after completion of the course.

Skills Needed: Java programming experience is required for this course. Assignments need to be written in Java. An understanding of database, SQL, parallel or distributed computing is recommended.


Offering code Offering title
CMPR.X413 Java Programming, Comprehensive

Sections :

Section Start Date Time Location Cost Instructor Name Full Schedule Enroll
DBDA.X407.(3) 4/7/2018 09:00 AM SANTA CLARA 960 Marilson Campos View Enroll