ORIE 4741: Learning with Big Messy Data

Logistics

COVID-19: In Fall 2020, this course will be available online. Students will be able to complete the course exclusively online, in any time zone, for full credit.

Schedule:

  • Class on Tuesday and Thursday 9:55 – 11:10am on Zoom

  • Discussion sections are optional. You can enroll in one for a 4 credit option, or omit for 3 credits.

  • All course events (lectures, sections, and office hours) are listed on the course calendar with their times

All lectures and discussion sections, and most office hours, will be remote and fully online. Students are strongly encouraged to attend synchronously (at the scheduled times), but the course will be accessible to students in any time zone: all course components (lectures and section) will be recorded, and office hours will be chosen to allow students anywhere on earth to attend some office hour synchronously (ie, not in the middle of the night). To receive full participation credit for the lectures, students attending lectures asynchronously must respond to questions on each lecture before the following lecture; see the participation tab for details.

Discussion forum: campuswire. Get access by enrolling or completing the course survey.

Enrollment: We expect that everyone who want to take the class will be able to enroll. Note that you can enroll without the discussion (for 3 credits), and that you can enroll in a discussion section even if you plan to attend it asynchronously.

Overview

Modern data sets, whether collected by scientists, engineers, physicians, bureaucrats, financiers, or tech billionaires, are often big, messy, and extremely useful. This course addresses scalable robust methods for learning from big messy data. We will cover techniques for learning with data that is messy  —  consisting of measurements that are continuous, discrete, boolean, categorical, or ordinal, or of more complex data such as graphs, texts, or sets, with missing entries and with outliers  —  and that is big  —  which means we can only use algorithms whose complexity scales linearly in the size of the data. We will cover techniques for cleaning data, supervised and unsupervised learning, finding similar items, model validation, and feature engineering. The course will culminate in a final project in which students extract useful information from a big messy data set.

Prerequisites: Familiarity with linear algebra and matrix notation, a modern scripting language (such as Python, Matlab, Julia, R), and basic complexity and O(n) notation. More formally, we strongly recommend

  • Linear Algebra (MATH 2940 or equivalent). Important topics: inner products, matrix multiplication, singular value decomposition.

  • Probability (ENGRD 2700 or equivalent). Important topics: random sampling, maximum likelihood estimation.

  • Programming (ENGRD/CS 2110 or equivalent). Important topics: basic comfort in a scripting language, iteration, functions.

Students familiar with these prerequisites in past years found that homework assignments took about about ten hours a week. Students without these prerequisites found that homework assignments took up to thirty hours a week as they caught up on background knowledge. Hence if you are tempted to ignore these prerequisites, please make sure you can afford to spend thirty hours a week on this class.

Announcements

  • We've hit some enrollment bugs, but the class is not yet full. Rest assured that if you want to take the course, and you persist with the assignments, you will be able to enroll eventually.

  • Sign up for the class by completing the course survey, regardless of whether you've registered. This sign-up will allow you to access our campuswire site for course Q&A.

  • Want to get a head start on learning with big messy data? You might try learning Julia, reviewing linear algebra, or reading the book Learning from Data. See about for more ideas.

  • Course materials on this website reflect the Fall 2019 course. Lectures and topics may change slightly in Fall 2020 to reflect student interest and accommodate students attending exclusively online.