ORIE 4741: Learning with Big Messy Data

Logistics

  • Where: 165 Olin Hall

  • When:

  • Class on Tuesday and Thursday 10:10am – 11:25am.

  • Discussion section on Monday 10:10am–11am or Friday 1:25pm – 2:15pm. (Pick one; the Monday session repeats the previous Friday session.)

  • Discussion forum: Piazza. Sign up here.

Overview

Modern data sets, whether collected by scientists, engineers, physicians, bureaucrats, financiers, or tech billionaires, are often big, messy, and extremely useful. This course addresses scalable robust methods for learning from big messy data. We will cover techniques for learning with data that is messy  —  consisting of measurements that are continuous, discrete, boolean, categorical, or ordinal, or of more complex data such as graphs, texts, or sets, with missing entries and with outliers  —  and that is big  —  which means we can only use algorithms whose complexity scales linearly in the size of the data. We will cover techniques for cleaning data, supervised and unsupervised learning, finding similar items, model validation, and feature engineering. The course will culminate in a final project in which students extract useful information from a big messy data set.

Prerequisites: Familiarity with linear algebra and matrix notation, a modern scripting language (such as Python, Matlab, Julia, R), and basic complexity and O(n) notation.

Announcements

  • The practice final is posted. (The final will be similar, but longer.)

  • Review the project work of your team members (and yourself) here by 4pm December 7.

  • Peer review assignments are out. Submit your peer reviews for the project here, and submit comments on projects as an issue on the project GitHub repo by 4pm on December 7. See the projects page for more information.

  • Prof. Udell will not hold office hours on 12/6.

  • The project due date has been extended to 4pm on December 5 by popular demand. This means you will have 48 hours to turn in your peer reviews, by 4pm December 7, which is the final date for class deadlines set by the Cornell registrar.

  • Homework 5 due 10am Tuesday 11-29-16

  • Update: Prof. Udell will not hold office hours on Tuesday 11-15. Instead, office hours will be held 2-4pm on Thursday 11-17.

  • Here's a quick link to the lecture on the limitations of predictive modeling, with data drawn from the 2016 presidential election.

  • Peer reviews of project midterm reports are due Friday, 11-4-16, at midnight.

  • Homework 4 is out! Due Saturday, 11-5-16, at 5pm.

  • The project midterm report is due by midnight on Friday, 10-28-16. A description of the requirements can be found on the projects page.

  • Discussion sections will only be held as needed, and will be announced ahead of time on piazza and on the website: if there's no announcement, there's no section. You can go to any discussion section you like, or to none.