ORIE 4740 Project Information  (Spring 2018)

[Back to course page]

In the final project, the techniques taught in the class are used to analyze a large dataset chosen by the students. Students work in teams of 2-3 students. Each team writes a project proposal, finds the necessary data, carries out the project, and writes a project report.

1. Due dates
2. Project Teams
3. Project Assignment
4. Grading
5. Sample Project Reports

1. Due dates

  • Team and dataset: Once you form your group and decide on which dataset to use, email the instructor ASAP with the names and NetID of your group members as well as the source of the dataset(s). The general rule is that no two groups may use the same dataset. In case of a conflict, the first group that emails the instructor will have priority. On Piazza I will maintain a list of datasets that have been chosen.
  • • Project proposal (2% of final grade): March 30th (Friday) 4:30PM, to the 4740 drop box.
  • • Project report and peer evaluation form (20% of final grade): May 18th (Friday) 4:30PM, to the 4740 drop box.

2. Project Teams

You should work in a team of two to three students.  Please try to form a team yourself; if you have trouble finding teammates then let me know and I will help you find a team. You may not work alone. 

3. Project Assignment

You will apply tools that you have learned in 4740 to a dataset of your choice.

  • • Simulated (artificially generated) datasets are not allowed
    • You may NOT use a dataset used before in HW/labs, nor any of the datasets from ISLR.
    • You may NOT use the UC Irvine Machine Learning Repository
    • You may NOT use CMU Statlib

You may obtain a dataset from a company, for instance if you have had an internship in a company and they are willing to provide you with such a dataset for this purpose.  You may use a dataset from a research project at this university or another university, with permission. I highly encourage you to look around for a dataset on a topic that particularly interests you, rather than using generic datasets from data mining websites.  Example: say I am interested in doing a project related to beer.  A web search on “beer dataset” brings up a dataset with 1 million + beer reviews, from BeerAdvocate.  You are allowed use datasets from other textbooks, but you cannot do an analysis similar to that done in the textbook on the same dataset.

Here are a few data sources:

Regardless of the source of the data, this source must be referenced in your report. If the dataset is not in the public domain, then you must obtain permission for its use in this class project. No two groups may use the same dataset; if two groups propose the same dataset by chance, the one that emails me first will have priority.

What is required? Each team must write a project proposal, find the necessary data, carry out the data analysis, and write a project report. The analysis should be motivated by one or two particular scientific / commercial goals, such as (for the veteran’s data example): “We seek to predict whether or not an individual will contribute, and the contribution amount if they do contribute, based on their demographic characteristics and contribution history. This prediction can be used to choose which individuals receive solicitations, or to estimate the total expected contributions in order to guide the organization’s financial planning.”

The data analysis that you perform needs to be more than a direct mapping of one of our lab analyses to another dataset. You will need to use more than one of the approaches that we have learned in the class. An example of a data analysis with sufficient scope is:

For a data set like the veteran’s data that has a continuous outcome variable and a binary outcome variable: Applying linear regression to predict the continuous outcome and applying logistic regression and decision trees to predict the binary outcome, while handling missing data. Comparing the results from logistic regression and decision trees, and recommending which should be used. 

The data analyses that you perform should be appropriate for the goal(s) that you have stated. You should choose one or several data sets that are appropriate to address the goal(s) you have stated. If you have more than one data set or more than one goal, the project should form a coherent whole, rather than being two or three unrelated data analyses. For instance, you may have a single scientific goal, and use two data sets to address this goal. Or, you might have a single data set, with which you address two related scientific questions.

If you analyze a single data set then it should have a reasonably large number of observations (at least 500); otherwise, two smaller data sets suffice but they should each contain at least 300 observations. One of your data sets should have at least five predictors.

Sometimes a data analysis yields negative or inconclusive results. For instance, perhaps none of the predictors were significant in the model, even though they seemed like reasonable predictors. Perhaps the predictions were poor, and the methods chosen, although they were a reasonable choice and had good promise, turned out to not work well. These are acceptable results, as long as all of the analyses and conclusions are correct. You might in this case suggest alternative approaches in your conclusion.

Work on the project is to be done entirely by the project group; communication between groups regarding project work is not allowed. You may not apply a technique that has been previously applied to the same data set in a published or unpublished work, if you are aware or could reasonably be expected to be aware of the existence of their work. You should cite in the bibliography any and all published or unpublished written works or spoken communications that have influenced your analysis.

You should employ at least one technique covered in class/ISLR, but are free to use any additional methods beyond class. You are encouraged to use R, but using another language (such as Python) is allowed.

3.1 Proposal

The proposal should be one page (double-spaced, 11+ pt font) and include:

  • • The proposed scientific/commercial goal(s) of the analysis;
    • A brief description of the proposed data set(s) that will be used, including their source, number of variables and data points;
    • The proposed data analyses to be performed;
    • What figures or tables you might include;
    • Why you expect the data set(s) and analysis methods to successfully address your goal;
    • Any other details at your discretion
3.2 Final Report

The report should be no more than 15 pages (double-spaced, 11+ pt font) and should contain:

  • • Title page with authors and abstract
  • • Introduction telling what the project is about, what your team has accomplished, and a brief statement of results and conclusions.
  • • One or more sections describing the project
  • • Conclusions
  • • Bibliography

Tables and figures can be interspersed in the text or at the end of the report. All tables and figures should be numbered and referred to by number. The report should not contain raw computer output. Rather, any computer output should be in a table or figure, with explanation in the main text. Do not hand in the code (R, Python, etc.) for your analysis, but the instructor reserves the right to ask for your code if he deems it necessary.

If your report has more than 15 pages, then there is no guarantee that the extra pages will be read by the instructor or graders.

3.3 Peer evaluation

Each student will be asked to fill out this peer evaluation form that assesses individual’s contribution to the group. This form is due the same day as the final project report, but can be submitted separately (to the 4740 dropbox).

4. Grading

Grades will be based on:

  • • Validity of the goal(s)
    • Whether the data set(s) and data analyses selected are appropriate to address that goal
    • Sufficient scope of the data analysis
    • Comprehensiveness and validity of the conclusions
    • Creativity
    • Clarity and conciseness of the report. A wordy report will get a lower grade than one saying the same amount in less space.
    • Number of students. Projects done by larger teams are expected to be more extensive.
    • Individual’s contribution to the group, as assessed by peer evaluations

5. Sample Project Reports

Can be found on Blackboard. Pleaes do not circulate these reports outside this class.

[Back to course page]