ORIE 4740 Project Information  (Spring 2021)


[Back to course page]

In the final project, the techniques taught in the class are used to analyze a large dataset chosen by the students. Students work in teams of 2-4 students. Each team finds the necessary data, carries out the project, and writes a project report.

1. Due dates
2. Project Teams
3. Project Assignment
4. Submission
5. Grading
6. Sample Project Reports


1. Due dates

  • Team and dataset: Once you form your group and decide on which dataset to use, email the your TA Sam Tan (sst76) ASAP with the following information:
  • i) the names and NetID of your group members,
  • ii) source of the dataset(s),
  • iii) a brief description of the data, and 
  • iv) the numbers of observations and predictors in the dataset (to the best of your knowledge). 
  • The general rule is that no two groups may use the same dataset. In case of a conflict, the first group that emails Sam will have priority. On Ed Discussions we will maintain a list of datasets that have been chosen.
  • • Project report and peer evaluation form (24% of final grade): Due May 25th, to be submitted online via Gradescope.

2. Project Teams

You should work in a team of 2 to 4 students.  Please try to form a team yourself; if you have trouble finding teammates then let me know and I will help you find a team. You may not work alone. 

You may use the "Search for Project Teammates!" megathread on Ed Discussion.  


3. Project Assignment

You will apply tools that you have learned in 4740 to a dataset of your choice.

  • (a) Simulated (artificially generated) datasets are not allowed
    (b) You may NOT use a dataset used before in HW/labs, nor any of the datasets from ISLR.
    (c) You may NOT use the UC Irvine Machine Learning Repository
    (d) You may NOT use CMU Statlib

You may obtain a dataset from a company, for instance if you have had an internship in a company and they are willing to provide you with such a dataset for this purpose.  You may use a dataset from a research project at this university or another university, with permission. I highly encourage you to look around for a dataset on a topic that particularly interests you, rather than using generic datasets from data mining websites.  Example: say I am interested in doing a project related to beer.  A web search on “beer dataset” brings up a dataset with 1 million + beer reviews, from BeerAdvocate.  You are allowed use datasets from other textbooks, but you cannot do an analysis similar to that done in the textbook on the same dataset.

Here are a few data sources:

Note that some datasets in the above links are not acceptable as per rules (a)-(d) above. Regardless of the source of the data, this source must be referenced in your report. If the dataset is not in the public domain, then you must obtain permission for its use in this class project. No two groups may use the same dataset; if two groups propose the same dataset by chance, the one that emails Sam Tan first will have priority.

What is required? Each team must find the necessary data, carry out the data analysis, and write a project report. The analysis should be motivated by one or two particular scientific / commercial goals, such as (for the veteran’s data example): “We seek to predict whether or not an individual will contribute, and the contribution amount if they do contribute, based on their demographic characteristics and contribution history. This prediction can be used to choose which individuals receive solicitations, or to estimate the total expected contributions in order to guide the organization’s financial planning.”

The data analysis that you perform needs to be more than a direct mapping of one of our lab analyses to another dataset. You will need to use more than one of the approaches that we have learned in the class. An example of a data analysis with sufficient scope is:

For a data set like the veteran’s data that has a continuous outcome variable and a binary outcome variable: Applying linear regression to predict the continuous outcome and applying logistic regression and decision trees to predict the binary outcome, while handling missing data. Comparing the results from logistic regression and decision trees, and recommending which should be used. 

The data analyses that you perform should be appropriate for the goal(s) that you have stated. You should choose one or several data sets that are appropriate to address the goal(s) you have stated. If you have more than one data set or more than one goal, the project should form a coherent whole, rather than being two or three unrelated data analyses. For instance, you may have a single scientific goal, and use two data sets to address this goal. Or, you might have a single data set, with which you address two related scientific questions.

If you analyze a single data set then it should have a reasonably large number of observations (at least 1000); otherwise, two smaller data sets suffice but they should each contain at least 600 observations. One of your data sets should have at least five predictors. (You should discuss with the professor or TA if you have a strong reason to use a dataset that does not satisfy the above requirements.)

Sometimes a data analysis yields negative or inconclusive results. For instance, perhaps none of the predictors were significant in the model, even though they seemed like reasonable predictors. Perhaps the predictions were poor, and the methods chosen, although they were a reasonable choice and had good promise, turned out to not work well. These are acceptable results, as long as all of the analyses and conclusions are correct. You might in this case suggest alternative approaches in your conclusion.

Work on the project is to be done entirely by the project group; communication between groups regarding project work is not allowed. You may not apply a technique that has been previously applied to the same data set in a published or unpublished work, if you are aware or could reasonably be expected to be aware of the existence of their work. You should cite in the bibliography any and all published or unpublished written works or spoken communications that have influenced your analysis.

You should employ at least one technique covered in class/ISLR, but are free to use any additional methods beyond class. You are encouraged to use R, but using another language (such as Python) is allowed.

Regarding time series and text data: 
Some problems may involve time series data, e.g., stock prices. Some problems may involve text data, e.g., tweets, news articles, comments on Yelp, etc. Techniques for time series analysis and natural language processing are not covered by this course. While the course staff are happy to provide general guidance, you may want to learn some of these techniques yourself if you want to work on such a dataset.

3.1 Project Plan

You may discuss with your professor/TAs regarding your plan about:

  • • The proposed scientific/commercial goal(s) of the analysis;
    • The proposed data set(s) that will be used, including their source, number of variables and data points;
    • The proposed data analyses to be performed;
    • What figures or tables you might include;
    • Why you expect the data set(s) and analysis methods to successfully address your goal;
    • Any other details at your discretion
3.2 Final Report

The report should be no more than 10 pages (double-spaced, 11+ pt font) and should contain:

  • • Title page with authors and abstract
  • • Introduction telling what the project is about, what your team has accomplished, and a brief statement of results and conclusions.
  • • One or more sections describing the project
  • • Conclusions
  • • Bibliography

Tables and figures can be interspersed in the text or at the end of the report. All tables and figures should be numbered and referred to by number. The report should not contain raw computer output. Rather, any computer output should be in a table or figure, with explanation in the main text. Do not hand in the code (R, Python, etc.) for your analysis, but the instructor reserves the right to ask for your code if he deems it necessary.

Given the page limit, you should present your results in a concise and informative manner. Highlight the most interested/important findings; summarize and parse your results rather than just provide a list of numbers. You do not need to explain a standard algorithm, but feel free to provide references for an algorithm not covered in class. On the other hand, you may want to provide some details if you use a less well-known algorithm, modify an existing one, or adopt a new approach.

If your report has more than 10 pages, then there is no guarantee that the extra pages will be read by the instructor or graders.

3.3 Peer evaluation

Each student is asked to fill out this peer evaluation form that assesses individual’s contribution to the group. This form is due the same day as the final project report.

4. Submission

The project report and peer evaluation form should be submitted electronically on Gradescope.

• Each student should submit their own peer evaluation form.
• Only one member of each team needs to submit the project report.

5. Grading

Grades will be based on:

  • Validity of the goal(s): Are the goal(s) of the project well-defined and well-explained?
    Data sets and data analyses: Are the data sets and analysies selected appropriate to address that goal?
    Scope of the data analysis: Is the proposed data analysis of a sufficient scope? 
    Conclusions: Are the conclusions comprehensive and valid?
    Creativity: Are the problem formulation, methodology and analysis interesting and creative?
    Clarity and conciseness of the report: A wordy report will get a lower grade than one saying the same amount in less space.
    Team size: Projects done by larger teams are expected to be more extensive.
    Individual’s contribution to the group: as assessed by peer evaluations


6. Sample Project Reports

Can be found on Canvas. Pleaes do not circulate these reports outside this class.

Note: These reports are not necessarily among the ones that received the highest grades in previous years, and may even contain errors and flaws. They simply give you a sense of the scope and structures of the project, as well as the possibility of techniques and outcomes.


[Back to course page]