Course Description
Formerly "Big Data, Introduction."
In the era of big data and compute-intensive analytics, the ability to write high-performance Python code is essential. This course is designed for learners with basic Python knowledge who want to handle large volumes of data efficiently and optimize their workflows. We will explore how to make Python performant-moving beyond basic pandas use-by introducing tools, techniques, and tradeoffs for improving execution speed, memory use, and scalability.
You will learn strategies such as vectorization, avoiding unnecessary loops, leveraging data structures like NumPy arrays, and using multithreading/multiprocessing. We will also explore distributed computing with PySpark and Dask, and introduce Polars as a cutting-edge alternative to pandas. These skills will be placed in the broader context of big data frameworks and architectures, including Apache Spark, Apache Kafka, and modern NoSQL databases like MongoDB and Cassandra. GPU optimization techniques will also be discussed at an introductory level.
The final project will integrate these concepts into the design of a high-performance data processing pipeline, giving you hands-on experience with tools and methods to analyze large datasets efficiently.
Topics
- Introduction to performance optimization in Python for data analytics
- Tradeoffs in compute time, memory, latency, and scalability
- Vectorization and avoiding inefficient loops
- Working with NumPy arrays and alternative data structures
- Multithreading and multiprocessing in Python
- Distributed computing with PySpark and Dask
- Introduction to Polars for high-speed data processing
- Apache Kafka for real-time data streams
- NoSQL databases: MongoDB and Cassandra
- GPU acceleration for Python workloads
- Designing a high-performance data pipeline
Prerequisites / Skills Needed
:
Prerequisites/
- Basic Python programming knowledge and familiarity with Python data analysis libraries such as pandas, or completion of a course such as DBDA.X420 - "Python for Data Analysis."
Additional Information
AI* - This course focuses on leveraging AI and AI powered tools to enhance learning, write code in python and design Big Data pipelines.
This course applies to these programs: