ORIE 4741: Projects

Professor Madeleine Udell, Cornell University

Project description

The class will culminate in a final project. These projects will be completed in groups of 2–3 students, and can take one of two forms:

Data analysis. Formulate an important question, and show how to use big messy data to answer (or try to answer) the question. The final product will be a paper suitable to present to a subject matter expert well versed in the problem domain (but not necessarily in big messy data analysis). This project will give students experience in the kind of work that a data scientist might perform in industry, or at a government or nonprofit agency.
Algorithm development. Design a new method for analyzing big messy data. The final product will be a paper suitable for submission to NeurIPS, ICML, or KDD. This project will allow students to experience the kind of work that a researcher might perform in academia or in an industrial research lab.

Here are the class projects for the Fall 2020 term. Some of our favorites:

What makes a good data analysis project? Here are a few considerations:

Clear outcome to predict
Linear regression should do something interesting
New, interesting model; not a Kaggle competition

How might you come up with an algorithm development project?

Read a collection of several (maybe 5) recent papers from a top computer science conference (eg NeurIPS, ICML, or KDD) on the same topic.
Design some computational experiments to explore the performance of some of these methods on a collection of tasks.
Use what you learn from these experiments to design a new method that performs better. Performance might mean many things: accuracy, speed, computational resources, interpretability, fairness, robustness to corruptions in the data, …
Can you prove anything about how well your method works?
A good project would implement several methods, produce careful experiments comparing them, and would implement at least two tweaks to the published ideas and see how well they work. The end result might be recommendations for best practices when using these algorithms on a new dataset.
A great project would produce a new algorithm that works better than the previous published state of the art on a wide variety of datasets - but this is not necessary to get a good grade on the project for this class.

Some algorithm development projects from Fall 2019:

5741 vs 4741

ORIE 5741 projects have slightly different requirements from ORIE 4741.

(For data analysis projects,) projects should be more business-oriented, using techniques from the class to address a clear business problem.
Projects will require a final (video) presentation, in addition to the other requirements for ORIE 4741.
The project presentations should be uploaded to YouTube (make sure to set visibility to either Public or Unlisted so that peer reviewers and graders can view it) and the video link should be added to the “README.md” file on your team's public GitHub repository.

If any student in your group is taking ORIE 5741, your group will be be required to fulfill the ORIE 5741 requirements.

Project timeline

All project deadlines are at 11:59pm ET, and most are on Sundays.

September 19. Form project groups. Please submit your choice of group here
October 3. Submit project proposal (problem statement and description of at least one data set)
October 8. (Friday) Peer reviews of problem proposals due review assignments
November 1. Project midterm reports due
November 7. Peer reviews of project midterm reports due review assignments
December 3. For ORIE 5741, project presentations due (as prerecorded video). Upload to YouTube, and add a link to your video on your project's README.
December 5. Project reports due
December 12. Peer reviews of project reports due review assignments

Detailed requirements

Project repository. Your project team should create a public GitHub repository. Each team member should have push access to the repository. Add a file named README.md to the repository, in which you state the name of your project, list the names and NetIDs of the project members, and describe your project in a paragraph or two. Make a pull request (PR) to add a link to your repository to the list of ORIE 4741 projects. (See above link for detailed instructions.)

Project proposal. The project proposal should be no more than 1 page, written in LaTeX or markdown, and posted on your project repository with the filename “project_proposal”. (The file extension should be either .tex + .pdf, or just .md.) It should identify a question, and a data set that you'll use to answer the question. Justify why the problem is important, and why you think the data set will allow you to (begin to) answer the question.

Stylistically, the proposal should be written as though it were a memo to your manager (at whatever kind of enterprise might care about this question: either government, nonprofit, or industry). You should justify why it's worthwhile to this enterprise for you to work on the project for a few months, and why you think you're likely to succeed.

Proposal Peer review. Suppose you're a manager reviewing your employees’ proposal for an independent research project. What do you like about the proposal? What concerns you? Do you think you could use the results of this study? What other aspects of the question do you think the group should consider?

Submit a grade for the proposal via google forms, and submit comments on the proposal by opening an issue on the group's github repo. Review assignment and link to form will be posted on our discussion forum.

Concretely, your comments should begin with a one paragraph (at least three sentence) summary of the project you're reviewing: What's it about? What data are they using? What's their objective? Then detail at least three things you like about the proposal, and three areas for improvement (at least one sentence each). Make sure to back up your subjective assessments with reasoned, detailed explanations.

Project midterm report. By this time, you should have made some progress in cleaning up and understanding your data, and in running a few preliminary analyses. Your project midterm report should be no more than 3 pages, written in LaTeX or markdown, and posted in your project repository with the filename “midterm_report”. (The file extension should be either .tex + .pdf, or just .md.)

In the report, you should describe your data set in greater detail. Describe how you plan to avoid over (and under-)fitting, and how you will test the effectiveness of the models you develop. Include a few histograms or other descriptive statistics about the data. How many features and examples are present? How much data is missing or corrupted? How can you tell? You should also run a few preliminary analyses on the data, including perhaps some regressions or other supervised models, describing how you chose which features (and transformations) to use. Finally, explain what remains to be done, and how you plan to develop the project over the rest of the semester.

Midterm Peer review. Instructions are the same as the first peer review assignment. Again, we expect that you will provide critical and useful feedback to the team you're reviewing. Think about what kind of feedback or ideas would be most useful to you, and try to give what you'd like to receive!

Submit a grade for the midterm report via google forms, and submit comments on the proposal by opening an issue on the group's github repo. Review assignment and link to form will be posted on our discussion forum.

Project presentation. Submit a 5–10 minute video describing the problem your project sought to address, the techniques that you used to solve it, conclusions you were able to draw, and directions for future work. You might organize your presentation following the template for project reports, described below, or make it more creative!

Project final report. The final report should be no more than 8 pages long, include graphs and tables. (A bibliography of references you used may be listed on a final 9th page.)

Data analysis projects: In your report, you should describe the problem, the data set, and how you tried to solve the problem. Describe the algorithms you used, the results you obtained, and discuss how confident you are in your results. Would you be willing to use them in production to change how your company or enterprise makes decisions? If not, why not?

Technically, your report should demonstrate that you tried at least three techniques from class on your data set, in addition to anything else you decided to do to achieve your goal. If you used techniques not discussed in class, be sure to describe how they work and provide references so that anyone reading the paper has the tools to understand it.

Your report should also include some discussion about whether your project might produce a Weapon of Math Destruction (as defined in the lecture on the limitations of data science) and whether fairness is an important criterion to consider when choosing a model for your application (as we discussed in the lecture on fairness). If fairness is important, discuss what metrics might be appropriate to measure fairness in your application, and report their values for at least one of your classifiers.

Algorithm development projects: your report should roughly follow the following outline.

introduction: topic + question you want to answer
important background related work papers you've read (at least 4 citations)
your methodology: how did you try to answer the question?
explanation and results of specific experiments you did (including a few plots or tables)
conclusion and future work

Final Peer review. Instructions are the same as the previous peer review assignment. We expect your reviews to be at least two paragraphs, and expect that you will provide critical and useful feedback to the team you're reviewing. Think about what kind of feedback or ideas would be most useful to you, and try to give what you'd like to receive!

Submit a grade for (and answer a few more specific questions about) the project via google forms here for data analysis projects, here for algorithm design projects, and here for 5741 required presentations and submit comments on the proposal by opening an issue on the group's github repo. Review assignment and link to form will be posted on our discussion forum.

Project ideas

COVID-19 symptom data challenge. Can you identify symptoms to help with early detection of COVID-19?
1. The COVID-19 symptom challenge can connect you to symptom data for this task; top prize is $50,000!
Who's ready to leave? Medical edition. Hospitals often find that patients take a turn for the worse and return to the hospital right after being released. The percentage of patients who return (within a short time window) is called the re-admission rate. Your goal is to help the hospital understand which patients are likely to return, and to recommend which patients are ready for release. Example data:
1. MIMIC III is an openly available dataset developed by the MIT Lab for Computational Physiology, comprising deidentified health data associated with >40,000 critical care patients. It includes demographics, vital signs, laboratory tests, medications, and more.
Who's ready to leave? Criminal edition. Prisoners are often released on parole for good behaviour, but some immediately commit crimes and return to prison. The percentage of prisoners who return (within a short time window) is called the recidivism rate. Your goal is to help the government understand which inmates are likely to commit a subsequent crime, and which inmates are ready for parole. Your solution must not discriminate on the basis of a protected class, such as race, sex, or national origin. Example data:
1. COMPAS
Which hospital should I visit? Different hospitals can have startlingly different success rates for the same medical problem, and can have startlingly different costs for performing the same procedure. But remember that the best hospitals often attract the most sick patients, and so can appear to have the worst outcomes. Given a particular health condition, which hospital should I visit? Example data:
1. The Statewide Planning and Research Cooperative System (SPARCS) Hospital Inpatient Discharges record treatment and cost details for a large number of hospitals in NY State.
2. The Medical Expenditure Panel Survey (MEPS) is a set of large-scale surveys of families and individuals, their medical providers, and employers across the United States. MEPS is the most complete source of data on the cost and use of health care and health insurance coverage.
3. Practice Fusion is America's fastest growing Electronic Health Record (EHR) community, with more than 170,000 medical professional users treating 34 million patients in all 50 states. Practice Fusion’s EHR-driven research dataset is used to detect disease outbreaks, identify dangerous drug interactions and compare the effectiveness of competing treatments.
Beat Nate Silver. The 2020 US presidential election is coming. One of the most critical tasks for any electoral campaign is to get out the vote — and that means getting out the vote for your candidate. Predict which districts (or better, which individuals) are likely to support your candidate, so that you can route resources to get out their vote. Example data:
1. The Ohio Voter File lists every registered voter in Ohio and which elections they've voted in.
2. Kaggle has data on votes in the 2016 primary election.
3. The Voting Information Project has a list of all polling places in the US.
4. The Census administers two surveys each year (the ACS and CPS) covering a range of economic and demographic questions.
Who gets credit? Decide which applications should be approved for credit. The goal for a financial firm would be to offer credit to customers who will pay back their loans. But how can you tell if your model is unfair, either to individuals or to (disadvantaged) groups? Example data:
1. The HDMA mortgage dataset.
Which algorithm should I use? Meta-learning, or learning-to-learn, is the process of borrowing knowledge from past similar learning tasks to improve performance on a new task in the future. It is an active area in modern-day machine learning research and has strong ties to automated machine learning (AutoML). One important subproblem is to predict how long it will take for a given model (like linear regression) to train on a given dataset. This would help us decide which models to try out when the time budget is limited, which is often the case in practice. Example data: Example data: we can collect data about how long it takes models to run on a variety of datasets, in order to fit a “meta-model” to predict how long a given model will run on a new dataset. Here is one such collection of data on runtimes (Cornell NetID login required). Talk to TA Chengrun Yang cy438 to discuss.

More interesting data sets and project ideas:

https://www.drivendata.org/competitions/. Data science competitions to build a better world.
https://www.innocentive.com/. Problem solving competitions: some with data, some for which you'll need to be creative in hunting down your own datasets!
http://dreamchallenges.org/ Biomedical data science challenges (including one with access to a US hospital's COVID data)
https://www.data.gov The US Federal Government's compendium of data, tools, and resources to conduct research, develop web and mobile applications, design data visualizations, and more. More than 200,000 datasets!
Kaggle data sets. Everything from peer-to-peer lending to speed dating to climate change to university rankings…!
Resource Watch. An open data platform under the World Resources Institute, a global environmental research nonprofit. This platform features over 250 datasets in 10 thematic areas: food, forests, water, energy, climate, cities, biodiversity, conflict, society, and commerce from sources such as NASA, NOAA, the UN, the World Band, the FAO, and more.
Data Science for Social Good compiles a list of projects that use data to make the world a better place.
CornellTech Health Hack 2016. Data sets from a recent hackathon.
A curated list of weather and climate data sources, courtesy of NEON.
New York State data portal
National longitudinal survey of youth, a yearly survey dating back to 1979. You might use it to explore how public opinion changes over time.