ORIE 4741: Learning with Big Messy Data


  • Class on Tuesday and Thursday 10:10am – 11:25am in 165 Olin Hall.

  • Discussion sections. (Pick one; the Monday session repeats the previous Friday session.)

    • Monday 10:10am – 11am in Rhodes 453

    • Monday 1:25pm – 2:15pm in Rhodes 453

    • Tuesday 9:05am – 9:55am in Rhodes 453

    • Friday 1:25pm – 2:15pm in Rhodes 453

  • Discussion forum: Piazza. Sign up here.


Modern data sets, whether collected by scientists, engineers, physicians, bureaucrats, financiers, or tech billionaires, are often big, messy, and extremely useful. This course addresses scalable robust methods for learning from big messy data. We will cover techniques for learning with data that is messy  —  consisting of measurements that are continuous, discrete, boolean, categorical, or ordinal, or of more complex data such as graphs, texts, or sets, with missing entries and with outliers  —  and that is big  —  which means we can only use algorithms whose complexity scales linearly in the size of the data. We will cover techniques for cleaning data, supervised and unsupervised learning, finding similar items, model validation, and feature engineering. The course will culminate in a final project in which students extract useful information from a big messy data set.

Prerequisites: Familiarity with linear algebra and matrix notation, a modern scripting language (such as Python, Matlab, Julia, R), and basic complexity and O(n) notation. More formally, we strongly recommend

  • Linear Algebra (MATH 2940 or equivalent)

  • Probability Theory (ENGRD 2700 or equivalent)

  • Programming (ENGRD/CS 2110 or equivalent)

Students familiar with these prerequisites in past years found that homework assignments took about about ten hours a week. Students without these prerequisites found that homework assignments took up to thirty hours a week as they caught up on background knowledge. Hence if you are tempted to ignore these prerequisites, please make sure you can afford to spend thirty hours a week on this class.


  • Please submit a Project Team Review by 12-10-17. This form gives you a chance to speak about individual team member contributions.

  • Please review the final report peer review assignments, due 12-10-17. Please submit comments on the report by opening an issue on the group's github repo, and submit a grade for the report via Google Forms.

  • Schedule of due dates through end of semester:

    • exam review during section times 12-1 and 12-4

    • project due 12-4

    • final exam 9am 12-6

    • project peer reviews due 12-10

  • A practice final has been posted. We will be going over the answers in section on Friday and (in both section times) on Monday

  • We have added one more office hour Dec 1 from 4:30 – 5:30pm

  • On Dec 5, office hours will be held only from 2 – 3pm (and not 5:30 – 6:30pm)