Course Description
This course explains where large data sets come from and how they are stored and managed. It also examines data sizes, accessibility approaches, and how data are transformed and used for AI consumption. You will examine the challenges and considerations when choosing data for training sets.
By the end of course, you will understand the types of data used in bioinformatics, how the data are collected, stored, managed and searched, and how the data are transformed for further processing and analysis. You will also develop skills on how to aggregate and normalize the data to be used for machine learning and/or AI training sets.
Topics
- Pipeline Design
- Workflow management systems and workflow analysis with open-source tools
- Documentation skills / proof of concept with foresight
- Using SQL for bioinformatics data
- Data lakes (e.g, Databricks, Redshift and/or Snowflake)
- Large data sets
- Databases - how to store, move, and learn what AI models to use
- Flexible Attend in person or via Zoom at scheduled times.
This class meets simultaneously in a classroom and remotely via Zoom. Students are expected to attend and participate in the course, either in-person or remotely, during the days and times that are specified on the course schedule. Students attending remotely are also strongly encouraged to have their cameras on to get the most out of the remote learning experience. Students attending the class in-person are expected to bring a laptop to each class meeting.
To see all meeting dates, click “Full Schedule” below.
You will be granted access in Canvas to your course site and course materials approximately 24 hours prior to the published start date of the course.
