ORIE 4740 Statistical Data Mining I (Spring 2021)

1. Basic info
2. Prerequisites
3. Textbooks
4. Labs
5. Homework
6. Quizzes
7. Exams
8. Final Project
9. Websites
10. Grading
11. Topics
12. Academic Integrity

1. Basic Info

Lectures: MW 1:00pm–2:15pm, Online (zoom link can be found on Canvas)
Labs: Online (zoom link on Canvas; see below for times and other info)
Recordings of the lectures and labs can be found on Canvas.

Yudong Chen (yudong.chen at cornell dot edu)
Office hours: Thursday 8:30--9:30pm, zoom link on Canvas

Matthew Ford (mtf62, office hours: Thu 9:30-10:30am, zoom link on Canvas)
Samuel Tan (sst76, office hours: Fri 3:30-4:30pm, zoom link on Canvas)

1.1  Email Policy

Because this is a large class, I will typically not respond to individual emails from students. Instead direct all your questions regarding the course to Ed Discussion. If you need to discuss a personal matter, please see me during my office hours. If you’re having trouble reaching me, please contact your TA and they can tell you whether it is necessary to meetwith me in person.

2. Prerequisites

  • ORIE 2700 and 3500 (statistics and probability) or equivalent: Marginal probability, joint probability, conditional probability, Bayes’ theorem, multivariate Normal distributions, mean and variance. Point and interval estimation, hypothesis testing, p-values. Simple linear regression.
  • Math 2940 (linear algebra) or equivalent: Matrix/vector notation and operations, eigenvalues and eigenvectors, eigen and singular value decompositions, inverse, trace, norms.
  • Programming experience in R, Python, Matlab, C or Java.
  • • Strongly recommended: Background in multiple linear regression and logistic regression (this will be taught, but prior knowledge would help).

3. Textbooks

  • Required: An Introduction to Statistical Learning (ISLR) by James, Witten, Hastie and Tibshirani, any edition. A pdf of the book (1st edition) is available for free from the authors' web page.
  • • Optional: 
    • - The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Hastie, Tibshirani and Friedman. Second Ed., 2009. This book is the advanced version of ISLR. Freely available here.
    • - Data Mining for Business Intelligence: Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner by Shmueli, Patel, and Bruce. Second Ed., 2010. (book web page)

4. Labs

The discussion sessions will be a combination of recitations and computer labs on using R. They will be held weekly unless notified otherwise. Each student should register for one of the following sessions:

  • • Thursday 1:30-2:20pm (TA: Matthew Ford)
  • • Thursday 2:40-3:30pm (TA: Matthew Ford)
  • • Friday 12:25-1:15pm (TA: Samuel Tan)
  • • Friday 2:40-3:30pm (TA: Samuel Tan)

You will need to submit your work for each lab similarly to a homework assignment.

TAs will be responsible for holding the labs. Lab participation is crucial to prepare you for the final project. Questions are best addressed during office hours and labs, or on Ed Discussions (instead of email).

R is freely available here. You may consider using the RStudio enviorment.

5. Homework

There will be about 9 labs and homework assignments in total. Homework/lab is due at 11:59pm on Friday a week after it is given out (unless specified otherwise), and must be submitted electronically through Gradescope, NOT on Canvas or by email, etc.

Out of all your homework/lab grades, the 2 lowest ones will be dropped; this accommodates sickness, family emergency, religious holiday or other circumstances without a formal process. If you miss an assignment for these reasons then it must count as the dropped assignment.

You may discuss the content of the course with other students in your 4740 class, but you must complete your  homework/lab independently and individually.

Submission instructions:

  • •  Submit your assignment on Gradescope as a pdf file (recommened) or images. Do NOT submit a .zip file.
  • •  Gradescope will ask you which page/image corresponds to each question. 
  • •  You can export plots from RStudio to .pdf
  • •  You code should also be included in your pdf file. Do NOT submit a separate .R file.
  • •  There are multiple websites and tools that allow you to combine pdfs into one file. https://combinepdf.com is one example.
  • •  If you are handwriting your assignment, please scan (or take clear pictures) and convert to .pdf file.
  • •  You may submit as many times as you want. Only the last submission before the deadline will be graded.
  • •  Grades will be available on Gradescope together with any feedback/comments.
  • •  No late homework/lab will be accepted.
  • •  For additional help on how to use Gradescope, see this step-by-step video guide or this website.

Regrade policy:

  • •  Regrade requests regarding homework/labs should be submitted on Gradescope within 1 week of grade posting.
  • •  Detail which section of the homework/lab you want regraded. Regrade requests without explanation will not be considered.
  • •  The grader will re-examine the entire homework before adjusting the final grade.

6. Quizzes

Students are expected to complete an online quiz once a week. The quiz will be posted to our Canvas site. It is primarily a tool for you to test yourunderstanding of the course concepts.

7. Exams

  • Prelim 1: March 24 in class.
  • Prelim 2: May 12 in the last class of the semester. Prelim 2 will be cumulative.

Request for special accommodation must be made at least 2 weeks prior to each exam. No final exam.

Regrade requests regarding exams should be submitted on Gradescope within 1 week of grade posting.

8. Final Project

In the final project, the techniques taught in the class are used to analyze a large dataset. Students work in teams of 2-4 students. Each team finds the necessary data, carries out the project, and writes a project report.

Detailed projection information can be found here.

9. Websites

  • Canvas: We use Canvas for all course materials and communication. You should be automatically given access to the course Canvas site when you enroll in the course.
  • Gradescope:  We use Gradescope for homework submission, grading and regrading. You should be automatically enrolled in Gradescope when you enroll in the course.
  • Ed Discussion: We will use Ed Discussion, where students can post and answer questions about the content. Participation will count towards your participation points. You should be automatically enrolled in Ed Discussion when you enroll in the course.
    • - The instructor and TAs will monitor the forum, but it is primarily the responsibility of other students to help each other.
    • - The goal for such a forum is to encourage learning through peers; knowing how your peers are struggling with a problem can be useful in your learning. Or answering your peers' questions can help identify any gaps in your understanding. This will not be achieved if the questions/comments are only seen by the instructor/TAs. If you believe your question or a comment will be useful to the entire class, make it public. This should generally be the default position. I may make a question public if I think it's the more appropriate option.
    • - Only rarely would the private option be appropriate. An example where a private note would make sense is in a situation where you are far along answering an assignment question and are unsure about some aspect of your approach.
    • - You can keep yourself anonymous from your peers if you want; I will still see your identity.

10. Grading

  • • Homework & labs: 40%
  • • Quizzes: 5%
  • • Exams: 30%
  • • Project: 24%
  • • Participation: 1%. Students are expected to submit and answer questions on Ed Discussion, and fill out the course evaluation.

These weights are approximate; we reserve the right to change them later.

11. Topics

[Data mining is] the process of discovering meaningful correlations, patterns, and trends by sifting through large amounts of data... It employs pattern recognition technologies, as well as statistical and mathematical techniques.” (The Gartner Group).

Data mining often involves datasets with many records and many variables. Frequently little is known about the distribution of any particular variable, or about the relationships between variables. Desirable approaches have few assumptions or are robust to the violation of those assumptions. They also must be computationally tractable on large data sets. By the end of this course, you will be able to take a large commercial or governmental data set, decide on data mining techniques to answer our question of interest, apply those techniques, compare them, and draw conclusions. In order to cement your understanding you will implement some techniques, and modify or apply implementations of some more complex techniques.

We will cover most of the chapters in ISLR, including (tentatively): 

  • • Linear Regression
  • • Classification
  • • Dimensionality Reduction/PCA
  • • Clustering 
  • • Nonlinear Methods 
  • • Decision Trees and Random Forests
  • • Support Vector Machines
  • • Model Validation/Selection and Regularization.

12. Academic Integrity

Each student in this course is expected to abide by the Cornell University Code of Academic Integrity. Any work submitted by a student in this course for academic credit will be the student’s own work. See above for the policy regarding homework. The Code is available here.