### ORIE 4740 Statistical Data Mining I (Spring 2017)

1. Basic info

2. Prerequisites

3. Textbooks

4. Labs

5. Homework

6. Exams

7. Final Project

8. Websites

9. Grading

10. Topics

11. Academic Integrity

Lectures: TR 1:25–2:40,
Gates Hall G01

Labs: Rhodes Hall 453 (see
below)

Instructor:

Yudong Chen (yudong.chen at cornell dot edu, Rhodes 223)

Office hours: Tuesday 5:15-6:15pm

TA:

Sijia Ma (sm2462, office hours: Wednesday 2:30-4:00 pm, Rhodes 416)

Shuang Tao (st754, office hours:Thursday 4:30-6pm, Rhodes 419)

- • ORIE 2700 and 3500 (statistics and probability) or equivalent: Marginal probability, joint probability, conditional probability, Bayes’ theorem, multivariate Normal distributions, mean and variance. Point and interval estimation, hypothesis testing, p-values. Simple linear regression.
- • Math 2940 (linear algebra) or equivalent: Matrix/vector notation and operations, eigenvalues and eigenvectors, eigen and singular value decompositions, inverse, trace, norms.
- • Programming experience in R, Python, Matlab, C or Java.
- • Strongly recommended: Background in multiple linear regression and logistic regression (this will be taught, but prior knowledge would help).

- • Required: An Introduction to Statistical Learning
(ISLR) by James, Witten, Hastie and Tibshirani. A pdf of the
book is available for free from the authors' web page.

- • Required: i>clicker:
available from Cornell Bookstore. Please register your i>clicker on
Blackboard. Participation in i>clicker polling will count towards your participation points.

- • Optional:

- - Data Mining for Business
Intelligence: Concepts, Techniques, and Applications in
Microsoft Office Excel with XLMiner by Shmueli, Patel, and Bruce.
Second Ed., 2010. (book web page)

- - The Elements of Statistical
Learning: Data Mining, Inference, and Prediction by Hastie,
Tibshirani and Friedman. Second Ed., 2009. This book is the advanced
version of ISLR. Freely available here.

The discussion sessions will be computer labs on using R. They will
be held in Rhodes Hall 453. Each student should register for one of the
following sessions:

- • Monday 2:30-3:20
- • Monday 3:35-4:25
- • Tuesday 10:10-11:00
- • Wednesday 1:25-2:15

You will need to submit your work for each lab similarly to a homework assignment; follow the instruction on the lab handouts.

TAs will be responsible for holding the labs. Lab participation is crucial to prepare you for the final project. Questions are best addressed during office hours and labs, or on Piazza (instead of email).

R is freely available here.

There will be about 9 labs and homework assignments in total.
Homework/lab is due at 4:30pm on Friday a week after it is given out
(unless specified otherwise), and must be submitted to the course
dropbox (2nd floor Rhodes), NOT by email, under door, etc.

You may discuss the content of the homework with other students in
your 4740 class, but the final product must be your own.

Your lowest 2 homework/lab grades will be dropped; this accommodates sickness, family emergency, religious holiday or other circumstances without a formal process. If you miss an assignment for these reasons then it must count as the dropped assignment.

- • Prelim 1: March 30, in class

- • Prelim 2: during last class

Request for special accommodation must be made at least 2 weeks prior to each exam. No final exam.

In the final project, the techniques taught in the class are used to
analyze a large dataset chosen by the students. Students work in teams
of 2-3 students. Each team writes a project proposal, finds the
necessary data, carries out the project, and writes a project report.

- • Blackboard: We use Blackboard for all course
materials and communication. You should be
automatically given access to the course Blackboard site when you
enroll in the course

- • Piazza: We
will have a class Piazza forum where students can post and answer questions about
the content. Piazza participation will count towards your participation
points. Sign up for this course on Piazza using this link.

- - The instructor and TAs
will monitor the forum, but it is primarily the responsibility of other
students to help each other.

- - The goal for such a forum is to encourage learning through peers; knowing how your peers are struggling with a problem can be useful in your learning. Or answering your peers' questions can help identify any gaps in your understanding. This will not be achieved if the questions/comments are only seen by the instructor/TAs. If you believe your question or a comment will be useful to the entire class, make it public. This should generally be the default position. I may make a question public if I think it's the more appropriate option.
- - Only rarely would the private option be appropriate. An example where a private note would make sense is in a situation where you are far along answering an assignment question and are unsure about some aspect of your approach.
- - You can keep yourself anonymous from your peers if you want;
I will still see your identity.

- • Homework & labs: 37%

- • Exams: 25%

- • Project: 35%
- • Participation: 3%. Students are expected to submit and answer
questions on Piazza, participate in i>clicker polling in class, and
fill out the course evaluation.

These weights are approximate; we reserve the right to change them later.

“[Data mining is] the process of discovering meaningful correlations, patterns, and trends by sifting through large amounts of data... It employs pattern recognition technologies, as well as statistical and mathematical techniques.” (The Gartner Group).

Data mining often involves datasets with many records and many variables. Frequently little is known about the distribution of any particular variable, or about the relationships between variables. Desirable approaches have few assumptions or are robust to the violation of those assumptions. They also must be computationally tractable on large data sets. By the end of this course, you will be able to take a large commercial or governmental data set, decide on data mining techniques to answer our question of interest, apply those techniques, compare them, and draw conclusions. In order to cement your understanding you will implement some techniques, and modify or apply implementations of some more complex techniques.

We will cover most of the chapters in ISLR, including (tentatively): Linear Regression, Classification, Dimensionality Reduction/PCA, Clustering, Nonlinear Methods, Decision Trees, Support Vector Machines, Model Validation/Selection and Regularization.

Each student in this course is expected to abide by the Cornell
University Code of Academic Integrity. Any work submitted by a student
in this course for academic credit will be the student’s own work. See
above for the policy regarding homework. The Code is available here.