Project description

The class will culminate in a final project. These projects will be completed in groups of 2–3 students, and can take one of two forms:

  1. Data analysis. Formulate an important question, and show how to use big messy data to answer (or try to answer) the question. The final product will be a paper suitable to present to a subject matter expert well versed in the problem domain (but not necessarily in big messy data analysis). This project will give students experience in the kind of work that a data scientist might perform in industry, or at a government or nonprofit agency.

  2. Algorithm development. Design a new method for analyzing big messy data. The final product will be a paper suitable for submission to NIPS, ICML, or KDD. This project will allow students to experience the kind of work that a researcher might perform in academia or in an industrial research lab.

Here are the class projects for the Fall 2016 term. Some of our favorites:

What makes a good data analysis project? Here are a few considerations:

  • Clear outcome to predict

  • Linear regression should do something interesting

  • New, interesting model; not a Kaggle competition

  • Avoid: images, time series, NLP

Project timeline

  • September 8. Form project groups. Please submit your choice of group here

  • September 22. Submit project proposal (problem statement and description of at least one data set)

  • September 29. Peer reviews of problem statements due review assignments

  • October 27. Project midterm reports due (at midnight)

  • November 3. Peer reviews of project midterm reports due review assignments

  • December 4. Project reports due

  • December 10. Peer reviews of project reports due

Detailed requirements

Project repository. Your project team should create a GitHub repository. Each team member should have push access to the repository. Add a file named README.md to the repository, in which you state the name of your project, list the names and NetIDs of the project members, and describe your project in a paragraph or two. Make a pull request (PR) to add a link to your repository to the list of ORIE 4741 projects. (See above link for detailed instructions.)

Project proposal. The project proposal should be no more than 1 page, written in LaTeX or markdown, and posted on your project repository with the filename “project_proposal”. (The file extension should be either .tex + .pdf, or just .md.) It should identify a question, and a data set that you'll use to answer the question. Justify why the problem is important, and why you think the data set will allow you to (begin to) answer the question.

Stylistically, the proposal should be written as though it were a memo to your manager (at whatever kind of enterprise might care about this question: either government, nonprofit, or industry). You should justify why it's worthwhile to this enterprise for you to work on the project for a few months, and why you think you're likely to succeed.

Proposal Peer review. Suppose you're a manager reviewing your employees’ proposal for an independent research project. What do you like about the proposal? What concerns you? Do you think you could use the results of this study? What other aspects of the question do you think the group should consider?

Submit a grade for the proposal via google forms, and submit comments on the proposal by opening an issue on the group's github repo. Review assignments here.

Concretely, your comments should begin with a two or three sentence summary of the project you're reviewing: What's it about? What data are they using? What's their objective? Then detail at least three things you like about the proposal, and three areas for improvement. Make sure to back up your subjective assessments with reasoned, detailed explanations.

Project midterm report. By this time, you should have made some progress in cleaning up and understanding your data, and in running a few preliminary analyses. Your project midterm report should be no more than 3 pages, written in LaTeX or markdown, and posted in your project repository with the filename “midterm_report”. (The file extension should be either .tex + .pdf, or just .md.)

In the report, you should describe your data set in greater detail. Describe how you plan to avoid over (and under-)fitting, and how you will test the effectiveness of the models you develop. Include a few histograms or other descriptive statistics about the data. How many features and examples are present? How much data is missing or corrupted? How can you tell? You should also run a few preliminary analyses on the data, including perhaps some regressions or other supervised models, describing how you chose which features (and transformations) to use. Finally, explain what remains to be done, and how you plan to develop the project over the rest of the semester.

Midterm Peer review. Instructions are the same as the first peer review assignment, with one clarification: we expect your reviews to be at least two paragraphs, and expect that you will provide critical and useful feedback to the team you're reviewing. Think about what kind of feedback or ideas would be most useful to you, and try to give what you'd like to receive!

Submit a grade for the report via google forms, and submit comments on the report by opening an issue on the group's github repo. Review assignments here.

Project final report. The final report should be no more than 8 pages long, include graphs and tables. (A bibliography of references you used may be listed on a final 9th page.) In your report, you should describe the problem, the data set, and how you tried to solve the problem. Describe the algorithms you used, the results you obtained, and discuss how confident you are in your results. Would you be willing to use them in production to change how your company or enterprise makes decisions? If not, why not?

Technically, your report should demonstrate that you tried at least three techniques from class on your data set, in addition to anything else you decided to do to achieve your goal. If you used techniques not discussed in class, be sure to describe how they work and provide references so that anyone reading the paper has the tools to understand it.

Final Peer review. Instructions are the same as the previous peer review assignment. We expect your reviews to be at least two paragraphs, and expect that you will provide critical and useful feedback to the team you're reviewing. Think about what kind of feedback or ideas would be most useful to you, and try to give what you'd like to receive!

Submit a grade for (and answer a few more specific questions about) the project and submit comments on the proposal by opening an issue on the group's github repo.

Project ideas

  1. Who's ready to leave? Medical edition. Hospitals often find that patients take a turn for the worse and return to the hospital right after being released. The percentage of patients who return (within a short time window) is called the re-admission rate. Your goal is to help the hospital understand which patients are likely to return, and to recommend which patients are ready for release. Example data:

    1. MIMIC III is an openly available dataset developed by the MIT Lab for Computational Physiology, comprising deidentified health data associated with >40,000 critical care patients. It includes demographics, vital signs, laboratory tests, medications, and more.

  2. Who's ready to leave? Criminal edition. Prisoners are often released on parole for good behaviour, but some immediately commit crimes and return to prison. The percentage of prisoners who return (within a short time window) is called the recidivism rate. Your goal is to help the government understand which inmates are likely to commit a subsequent crime, and which inmates are ready for parole. Your solution must not discriminate on the basis of a protected class, such as race, sex, or national origin.

    1. No example data yet, but we're working on it (and could use your help!)

  3. Which hospital should I visit? Different hospitals can have startlingly different success rates for the same medical problem, and can have startlingly different costs for performing the same procedure. But remember that the best hospitals often attract the most sick patients, and so can appear to have the worst outcomes. Given a particular health condition, which hospital should I visit? Example data:

    1. The Statewide Planning and Research Cooperative System (SPARCS) Hospital Inpatient Discharges record treatment and cost details for a large number of hospitals in NY State.

    2. The Medical Expenditure Panel Survey (MEPS) is a set of large-scale surveys of families and individuals, their medical providers, and employers across the United States. MEPS is the most complete source of data on the cost and use of health care and health insurance coverage.

    3. Practice Fusion is America's fastest growing Electronic Health Record (EHR) community, with more than 170,000 medical professional users treating 34 million patients in all 50 states. Practice Fusion’s EHR-driven research dataset is used to detect disease outbreaks, identify dangerous drug interactions and compare the effectiveness of competing treatments.

  4. Beat Nate Silver. The 2016 US presidential election is coming. One of the most critical tasks for any electoral campaign is to get out the vote  —  and that means getting out the vote for your candidate. Predict which districts (or better, which individuals) are likely to support your candidate, so that you can route resources to get out their vote. (If you choose this project, please identify a before analysis performed before the election, and an after analysis in which you'll analyze how good your predictions and recommendations were.) Example data:

    1. The Ohio Voter File lists every registered voter in Ohio and which elections they've voted in.

    2. Kaggle has data on votes in the 2016 primary election.

    3. The Voting Information Project has a list of all polling places in the US.

    4. The Census administers two surveys each year (the ACS and CPS) covering a range of economic and demographic questions.

  1. The 2017 IBM Service Analytics Challenge: This prize competition is sponsored by the INFORMS Service Science Section. Over 40 Years Federal Payroll Records (nearly 30 gigabytes) is available here. Individual challengers or teams (up to 3 members per team) can participate in this global contest. Each submission should include at least two parts:

    1. Challengers should explore the dataset, clearly describing the data with exploratory findings.

    2. Is there any pattern for federal career trajectories over the last 40 years? How did they look across the board and on an individual agency basis (for at least three chosen agencies)?

    3. How do these career trajectories, and associated job descriptions and specifications, support the associated services being provided? (Based on your findings in last question, you should focus on three specific jobs you are interested in.)

    4. How is the presidential transition impacting federal employees?

More interesting data sets and project ideas: