ORIE 4741/5741: Learning with Big Messy DataLogisticsCOVID-19: In Fall 2021, this course will be available both in-person and online. Students will be able to complete the course exclusively online for full credit. If you are unvaccinated, feel sick, or have a known exposure, please attend class online. Schedule:
Lectures and discussion sections will be held in-person and streamed online. Office hours will be mixed, some in-person and some online. Students are strongly encouraged to attend synchronously (at the scheduled times), but all course components (lectures and section) will be recorded. To receive full participation credit for the lectures, students attending lectures asynchronously must respond to questions on each lecture before the following lecture; see the participation tab for details. Discussion forum: zulip. Enrollment: We expect that everyone who want to take the class will be able to enroll. Note that you can enroll without the discussion (for 3 credits), and that you can enroll in a discussion section even if you plan to attend it asynchronously. You're also welcome to attend the discussion even if you can't enroll in it. 4741 vs 5741: ORIE 4741 and 5741 are substantively similar. ORIE 5741 is designed for graduate students, and has different project requirements, including a more business-oriented project and a final project presentation. The courses are graded separately. OverviewModern data sets, whether collected by scientists, engineers, physicians, bureaucrats, financiers, or tech billionaires, are often big, messy, and extremely useful. This course addresses scalable robust methods for learning from big messy data. We will cover techniques for learning with data that is messy — consisting of measurements that are continuous, discrete, boolean, categorical, or ordinal, or of more complex data such as graphs, texts, or sets, with missing entries and with outliers — and that is big — which means we can only use algorithms whose complexity scales linearly in the size of the data. We will cover techniques for cleaning data, supervised and unsupervised learning, finding similar items, model validation, and feature engineering. The course will culminate in a final project in which students extract useful information from a big messy data set. Prerequisites: Familiarity with linear algebra and matrix notation, a modern scripting language (such as Python, Matlab, Julia, R), and basic complexity and O(n) notation. More formally, we strongly recommend
Students familiar with these prerequisites in past years found that homework assignments took about about ten hours a week. Students without these prerequisites found that homework assignments took up to thirty hours a week as they caught up on background knowledge. Hence if you are tempted to ignore these prerequisites, please make sure you can afford to spend thirty hours a week on this class. Announcements
|