Project description

The class will culminate in a final project in which you use the data visualization and data analysis skills you’ve learned to analyze an interesting dataset. This project will give students experience in the kind of work that a data scientist might perform in industry, or at a government or nonprofit agency. These projects will be completed in groups of 2–3 students, except by special permission from the instructor.

There will be two major project milestones:

  1. Data visualization. You will describe your dataset, discuss a few interesting and important questions that you wish to answer using the dataset, present at least 3 different visualizations of the data, and discuss how these visualizations give insight into the questions you seek to answer. The report will be submitted as a pdf (max 4 pages) or html (of approximately equivalent length).
  2. Data analysis. You will use at least three tools learned in class, including linear regression, logistic regression, cross validation, QQ plots, or forecasting, to better answer the questions posed above about your dataset. You may also use other tools to achieve your goal, so long as they are clearly explained. The report will be submitted as a pdf (max 8 pages) or html (of approximately equivalent length).

Both reports will be peer-graded and reviewed by your course instructors. Write a report you think your peers will enjoy reading!

Project rubric

The following criteria will be used to assess the second project milestone.

  • Is the project driven by asking and answering one (or a few) interesting questions?
  • Did the report address the questions? How well does the report answer the questions posed?
  • Are the visualizations easy to understand? Do they add value?
  • Is the report well-written and interesting?
  • Does the project use at least 3 tools from class?
    • Linear regression
    • Logistic regression
    • Checking assumptions of linear regression to ensure validity of pvalues
    • Cross-validation or out-of-sample validation
    • Model selection
    • Assessing collinearity
    • Forecasting (with trend / with seasonality)
  • How creative are the analyses? Did this project surprise you? Did you learn something?
  • Are the techniques they use well-explained and easy to understand?
  • Does the project comply with the technical requirements (eg, page limit)? Is it well-formatted and pretty?

How to submit your project

  1. Ensure that the members of your Canvas Final Project Group are correct by following the instructions given on Piazza.
  2. Designate one person from your group to submit your report to the appropriate assignment on Canvas. The submission will count for all members of your Canvas group. Submit milestone 1 here.

How to complete peer reviews

Milestone 1 peer review assignments will be posted on Canvas after class on 4/21/2020 and will be due on 4/26/2020 at 12:00pm Eastern time. You will review the submissions of two randomly assigned students and give feedback through Canvas. Feedback will consist of a rubric and the following questions, which you should answer in 3-5 sentences each in the comments field on Canvas:

  1. What is the project about?
  2. Is the report well-written and complete?
  3. Are the visualizations easy to understand? Do they add value to the report?
  4. What are 3 things you like about the project?
  5. What are 3 things that could be improved or explored in more detail for the next milestone?

Resources from Canvas to help you get started with peer reviews:

Suggested datasets

The instructors will post suggested datasets here. You are also welcome to choose your own.

  • COVID-19 datasets. These are being generated and updated rapidly! Feel free to pick your own. You might look for data from Singapore or Korea in particular, which have been very assiduously tracking cases.
    • Can you forecast the number of cases 1 week out? 2 weeks out?
    • Can you predict which people will be hospitalized?
    • Can you predict which countries will have the fastest increase based on data about that country? (You might need to join a few different datasets to answer this question!)
  • Bot or Human. Can online marketplaces distinguish bots from humans when they make offers on a product? This auction dataset consists of three files: “bids.csv” contains information about each bid, such as the bid id, the id of the person who made the offer, auction type, type of device the bid was made from, (a random hash of) the buyer’s ip address, etc; “train.csv” records information on each bidder, including whether they are a bot or a human; and “test.csv” has the same data as train, but without the final designation, so that you may predict this yourself! Interesting questions include:
    • Where are the bot “hot spots”? Do certain auction types, or IPs, or countries, have more bots?
    • How does the number of bots evolves over time? Could we predict this?
    • How many offers on average does each bot place?
    • What variables best predict being a bot?
  • Airbnb. This airbnb dataset consists of more than 50,000 Airbnb listings in New York City, categorized by neighborhood and housing type. The data set has longitude/latitude information, which makes for compelling visualizations. Interesting questions include:
    • How is the price of an Airbnb listing influenced by the neighborhood, housing type, or amenities? You could refer to the Global Power City Index (GPCI), for example, to analyze different neighborhoods. http://mori-m-foundation.or.jp/english/ius2/gpci2/index.shtml
    • Is there a relationship between minimum number of nights and number of reviews? The website offers archived data for the same city for different time periods as well.
    • How does the price change over time? Is the behaviour cyclic, or is there a trend?
  • UNHCR Refugee Migration. This data set records refugee migrations over 60 years, across hundreds of countries. Interesting questions include:
    • Are there any countries that are both large origin countries and asylum countries? Why might this be the case?
    • Do the countries with the most emigrants or immigrants change over time?
  • Youtube. Youtube calculates trending videos using a variety of factors. This dataset contains 200 trending videos across 205 days. Interesting questions include:
    • Can you forecast which videos will keep trending for several days?
    • What kind of tags occur frequently among trending videos?
    • Are there any similarities or differences across countries?
    • Are there channels that tend to produce many trending videos?
    • Do likes and dislikes affect views?

Grading

Grades will be based on peer grades, course staff grades, and peer evaluation by your teammates.