Hands-On Data Engineering - Course | UCSC Silicon Valley Extension

Hands-On Data Engineering | DBDA.X424

Formerly: Data Engineering with Hadoop

Big Data platforms are distributed systems that can process large amounts of data across clusters of servers. They are being used across industries in internet startups and established enterprises. In this comprehensive course, you will get up to speed on the use of current Big Data platforms and gain insights into cloud-based Big Data architectures. We will cover Hadoop, Spark, Kafka and other Big Data platforms based on SQL, such as Hive.

The first half of the course includes an overview of the frameworks for MapReduce, Spark, Kafka, and Hive as well as some aspects of Python programming. You will learn how to write MapReduce/Spark jobs and how to optimize data processing applications and become familiar with SQL based tools for Big Data. We use Hive to build ETL jobs. The course also includes the fundamentals of NoSQL databases like HBase and Kafka.

The second half of the course covers stream processing capability and developing streaming applications with Apache Spark. You’ll learn how to process large amounts of data using DataFrame, Apache Spark’s structured data processing programming model that provides simple, powerful APIs. In addition to batch and iterative data processing, Apache Spark also supports stream processing, which enables companies to extract interesting and useful business insights at near real-time.

The course consists of interactive lectures, hands-on labs in class, and take home practice exercises. Upon completion of this course, you will possess a strong understanding of the tools used to build Big Data applications using MapReduce, Spark, and Hive.

Learning Outcomes
At the conclusion of the course, you should be able to

Describe the role Hadoop plays in the analysis of big data
Discuss the inner workings of Hadoop's computing framework, including MapReduce processing and Hadoop's file system (HDFS)
Develop programs/small applications in Spark and Hive
Use Hive and NOSQL databases for data analysis
Leverage the Hadoop ecosystem to become productive in analyzing data

Topics Include

Big Data applications architecture
Understanding Hadoop distributed file system (HDFS)
How MapReduce framework works
Introduction to HBase (Hadoop NoSQL database)
Introduction to Apache Kafka
Introduction to Spark and SparkSQL
Developing Spark/SparkSQL and Hive applications
Managing tables and query development in Hive
Introduction to data pipelines

Skills Needed
Basic SQL skills and the ability to create simple programs in a modern programming language, like Python are required. An understanding of database, parallel or distributed computing is helpful.

Notes
This course uses AWS EMR and Databricks for Spark, Hive and HDFS programming. Students are required to have accounts with AWS and Databricks.

Have a question about this course?

Speak to a student services representative.
(408) 861-3860
FAQ

ENROLL EARLY!

Save Your Seat
Help us confirm course scheduling. Enroll at least seven days before your course starts.
Accessing Canvas
Learn more about gaining access to your course on Canvas in our FAQ section.
Accessibility and Accommodation
For accessibility questions or to request an accommodation, please visit Access for Students with Disabilities or email the Extension registrar.
Finance Your Education
Here are ways to pay for your education.

This course is related to the following programs:

Certificate Program in Computer Programming

Certificate Program in Data Science and Data Analytics

Specialization in Data Engineering

Estimated Cost: TBD

Course Availability Notification

Please use this form to be notified when this course is open for enrollment.

Speak to a student services representative.

(408) 861-3860

extension@ucsc.edu