Final Project

From Data to Decisions: A Reproducible Data Science Investigation with TidyTuesday

Important

This is an INDIVIDUAL project (not a team project).


Project Purpose

This final project assesses your ability to:

  • Work with real, messy, open datasets
  • Implement the entire data science lifecycle
  • Apply:
    • Regression
    • Classification
    • Clustering
  • Build reproducible machine learning pipelines
  • Write and validate testable, modular Python code
  • Interpret results in real-world terms
  • Reflect on ethical and societal consequences
  • Communicate findings professionally

You will select your own dataset from the TidyTuesday archive, justify your methodological approach, and build a fully reproducible, end-to-end data science investigation.


Repository Structure


.
├── data/                       # Your TidyTuesday CSV(s) (EDITABLE)
│   └── your_dataset.csv
│
├── src/
│   ├── ds.py                   # Main reproducible ML pipeline (YOU WRITE)
│   └── test_ds.py              # Unit tests for your pipeline (YOU WRITE)
│
├── proposal.qmd                # Project proposal & justification (EDITABLE)
├── index.qmd                   # Final scientific & ethical report (EDITABLE)
│
├── final_project.qmd           # Analysis, EDA, visuals & interpretation (EDITABLE)
│
├── requirements.txt            # Reproducible environment (EDITABLE)
├── .gitignore                  # Prevents junk files from being tracked (EDITABLE)
│
└── README.md                   # README overview - DO NOT EDIT
Important

Only modify files explicitly marked as editable.


What You Must Deliver

Component File Points
Dataset Justification proposal.qmd 10
Reproducible ML Pipeline ds.py + test_ds.py 30
Analysis & Visualization final_project.qmd 20
Final Written Report index.qmd 15
Final Presentation Google Slide + Panopto 15
Peer-review + reflection + Repo Quality GitHub Issues + repo wide 5
TOTAL 100

Proposal

(Detailed prompts can be found in - proposal.qmd)

Your proposal should include:

  • A brief description of your dataset including its provenance, dimensions, etc. (Make sure to load the data and use inline code for some of this information.)

  • The reason why you chose this dataset.

  • The two questions you want to answer.

  • A plan for answering each of the questions including the variables involved, variables to be created (if any), external data to be merged in (if any).

  • A weekly “plan of attack” outlining your steps to complete your project and including the team member(s) assigned to that task.

    • This should be in the following form:

      Task Name Status Due Priority Summary
      Low
      Moderate
      High
    • Note that this is a living document and should be updated at minimum once a week.

Write-up

(Detailed prompts can be found in - index.qmd)

Your write-up should follow a structured, end-to-end data science workflow. It should be organized into the following sections:

1. Introduction (1–2 paragraphs)

Provide a brief introduction to your dataset.

  • Describe what the dataset contains and where it comes from (e.g., TidyTuesday).

  • Summarize key variables and context.

  • Write as if this is a standalone document—assume the reader has no prior knowledge of the dataset.

2. Data Understanding & Exploration

Overview (1–2 paragraphs)

  • Describe the size and structure of the dataset (rows, columns, variable types).

  • Identify any immediate data quality issues (missing values, duplicates, unusual values).

EDA Approach (1–2 paragraphs)

  • Explain how you explored the data (summary statistics, distributions, relationships).

  • Justify why these steps help build intuition about the dataset.

Analysis (code + visuals + comments)

  • Include summary tables and visualizations (e.g., histograms, pairplots).

  • Highlight:

    • Distributions

    • Outliers

    • Potential transformations

Discussion (1–2 paragraphs)

  • What patterns or issues did you discover?

  • What variables seem most important or problematic?

3. Data Cleaning & Feature Engineering

Cleaning (1 paragraph)

  • Describe what cleaning steps were applied (e.g., handling missing values, filtering).

  • Explain why these steps are reasonable.

Feature Engineering (1 paragraph)

  • Describe new features created.

  • Explain how they improve your ability to model or analyze the data.

4. Supervised Learning – Regression

Introduction (1 paragraph)

  • Define your regression task:

    • What are you predicting?

    • Why is it meaningful?

Approach (1 paragraph)

  • Describe:

    • Features used

    • Model choice

    • Train/test split strategy

Analysis (code + 1–2 figures)

  • Train your model and report metrics (e.g., R², RMSE).

  • Include at least one visualization (e.g., true vs. predicted plot).

Discussion (1–2 paragraphs)

  • Interpret model performance.

  • Discuss:

    • Bias (over/under prediction)

    • Where the model performs poorly

    • Whether results are meaningful

5. Supervised Learning – Classification (if applicable)

Introduction (1 paragraph)

  • Define the classification task and why it matters.

Approach (1 paragraph)

  • Describe:

    • Features

    • Model choice

    • Use of stratification

Analysis (code + metrics)

  • Train and evaluate your model.

  • Report metrics beyond accuracy if needed (e.g., precision, recall).

Discussion (1–2 paragraphs)

  • Interpret results.

  • Discuss:

    • Class imbalance

    • Practical or ethical implications of errors

6. Unsupervised Learning – Clustering (if applicable)

Introduction (1 paragraph)

  • What structure are you trying to uncover?

Approach (1 paragraph)

  • Describe:

    • Features used

    • Choice of number of clusters (K)

Analysis (code + 1 figure)

  • Run clustering and visualize results.

Discussion (1–2 paragraphs)

  • Do clusters represent meaningful groups?

  • Are results stable or sensitive?

7. Error Analysis & Limitations (1–2 paragraphs)

Reflect on:

  • Where models perform poorly and why

  • Missing or biased data

  • Assumptions and limitations of your methods

8. Ethical & Societal Considerations (1–2 paragraphs)

Address explicitly:

  • Who is represented in the data—and who is not

  • Potential harms from misuse

  • Biases embedded in the dataset or modeling choices

  • What responsible use of this analysis looks like

9. Conclusion & Future Work (1 paragraph)

Summarize:

  • Key findings

  • What you learned about the data/system

  • What you would do next with:

    • More time

    • More data

    • More advanced methods

General Guidelines

  • Be concise: paragraphs should typically be ≤ 5 sentences

  • Use clear section titles

  • Include code, figures, and interpretation together

  • Focus on reasoning and justification, not just output

  • Statistical tests are optional but allowed

Presentation + slides

Slides

In addition to the written report, you will also create presentation slides and deliver presentation that summarize and showcase your project. Introduce your research question and data set, showcase visualizations, and discuss the primary conclusions. These slides should serve as a brief visual addition to your written report and will be graded for content and quality.

You can create your slides with any software you like (Keynote, PowerPoint, Google Slides, etc.). We recommend choosing an option that’s easy to collaborate with, e.g., Google Slides. If you choose this option, save the slides as PDF and upload it to your repo as presentation.pdf.

You can also use Quarto to make your slides! While we won’t be covering making slides with Quarto in the class, we would be happy to help you with it in office hours. It’s no different than writing other documents with Quarto, so the learning curve will not be steep!

The slide deck should be roughly 6-7 content slides + 1 title slide. Here is a suggested outline as you think through the slides; you do not have to use this exact format for the 6-7 slides.

  • Title Slide

  • Slide 1: Introduce the topic and motivation

  • Slide 2: Introduce the data

  • Slide 3: Main Question

  • Slide 4-6: Analysis 1, 2, etc

  • Slide 7: Conclusions + future work

Presentation

Presentations will take place in class during the last lab of the semester. The presentation must be no longer than 5 minutes. You will pre-record a video using one of the options below.

To pre-record your presentation, you may use can use any platform that works best for your group to record your presentation. Below are a few resources on recording videos:

Once your video is ready, upload the video to Panopto or another video platform (e.g., YouTube), then add a link to your video in your repo README.

To create your video with Panopto:

  • Go to https://arizona.hosted.panopto.com/ and sign in via your NetID
  • Click the “+” and select “Upload media”.
  • Drag and drop your recorded video.
  • Once you’ve uploaded the video to Panopto, click to share the video and copy the video’s URL. You will need this when you post the video in the discussion forum.

Grading Rubric:

Slides

Are the slides well organized, readable, not full of text, featuring figures with legible labels, legends, etc.?

Presentation

  • Time management: Did they go over time?
  • Professionalism: How well did they present? Does the presentation appear to be well practiced?
  • Narrative: Did they present a clear story, or did it seem like independent pieces of work patched together?
  • Creativity and critical thought: Is the project carefully thought out? Does it appear that time and effort went into the planning and implementation of the project?
  • Content: Including, but not limited to the following:
    • Is the question well articulated in the presentation?
    • Can the question be answered with the data?
    • Does the analysis answer the question?
    • Are the conclusion(s) made based on the analysis justifiable?
    • Are the limitations carefully considered and articulated?

Dataset Requirements (TidyTuesday Only)

You must choose one dataset from:

🔗 https://github.com/rfordatascience/tidytuesday

Your dataset must:

  • Come from one specific week
  • Include at least one numeric outcome
  • Support at least TWO of the following:
    • Regression
    • Classification
    • Clustering

You may combine multiple tables from the same TidyTuesday week only.


Required Files & Their Roles

proposal.qmd — Project Design & Justification (10 pts)

Must include:

  • Dataset name + TidyTuesday link
  • Why you chose this dataset
  • Your real-world research question(s)
  • Which modeling tasks you will apply:
    • Regression (target variable)
    • Classification (class definition)
    • Clustering (expected structure)
  • Potential bias, harm, and ethical risks
Warning

You may not begin modeling until this is approved.


src/ds.py — Reproducible ML Pipeline (30 pts)

This is the engineering backbone of your project.

Your pipeline must include all of the following categories:

Data Handling

  • load_data
  • initial_summary
  • clean_data
  • feature_engineering

Modeling (At Least 2 Required)

  • run_regression
  • run_classification
  • run_clustering

Evaluation

  • Regression: R², RMSE
  • Classification: Accuracy, F1, ROC-AUC
  • Clustering: BIC, Calinski–Harabasz

✅ Must run end-to-end from raw CSV → metrics
✅ Must be pure functions only (no prints, no plotting)
✅ Must be deterministic (fixed random seeds)
✅ Must be fully testable


src/test_ds.py — Automated Testing (5 pts)

You must implement at least 8 real unit tests that validate:

  • Data loading
  • Cleaning behavior
  • Feature engineering outputs
  • Train/test splits
  • Model training & prediction
  • Metric correctness
  • Handling of edge cases:
    • NaNs
    • Small data
    • Class imbalance
Tip

These tests enforce professional engineering discipline.


final_project.qmd — Analysis & Visualization (20 pts)

Your notebook must:

  1. Load data using ds.py
  2. Perform EDA:
    • Distributions
    • Correlations
    • Missingness
  3. Run:
    • At least 1 regression
    • At least 1 classification or clustering
  4. Visualize:
    • Predictions vs truth
    • Cluster structure
    • Feature relationships
  5. Interpret results in Markdown
Warning

No model logic is allowed inside the notebook.
All models must come from ds.py.


index.qmd — Final Scientific & Ethical Report (15 pts)

Must include:

  • Research questions
  • Dataset background
  • Modeling choices & justification
  • Results & limitations
  • Ethical risks & societal implications
  • What you would do next with more time or data
Important

This is where your thinking is graded, not just your code.


Running the Project Locally

1. Install Dependencies

pip install -r requirements.txt

2. Run Unit Tests

pytest

✅ All tests must pass before submission.


3. Run Code Quality Check

pylint ds.py

✅ Minimum acceptable score: 9.0


4. Render Final Analysis

quarto render final_project.qmd

✅ Notebook must render without error.


Rules & Constraints

✅ You may use:

  • pandas
  • numpy
  • scikit-learn
  • matplotlib / seaborn
  • statsmodels (optional)

❌ You may NOT:

  • Use AutoML tools
  • Use LLMs to generate conclusions without verification
  • Copy existing TidyTuesday solutions
  • Fit models directly inside the notebook
  • Hardcode answers to pass tests
  • Ignore ethics and interpretability

Peer-review + Reflection + Professionalism (5 pts)

You will lose points for:

  • Poor peer-review and obvious GenAI usage
  • Poor reflection (e.g., I did everything great)
  • Tracking large datasets in Git
  • Poor Git history
  • Missing documentation
  • Non-reproducible results

✅ Submission Checklist

Before your final push:

  • ✅ Proposal approved
  • ds.py complete & clean
  • test_ds.py complete
  • pytest passes
  • pylint ds.py ≥ 9.0
  • final_project.qmd renders
  • report.md submitted
  • ✅ Dataset placed in /data
  • ✅ Clean Git history

Why This Project Is Structured This Way

This mirrors real-world data science in:

  • Healthcare analytics
  • Climate modeling
  • Economic forecasting
  • Policy research
  • Responsible AI development

You are being evaluated on your ability to:

  1. Think like a scientist
  2. Build like an engineer
  3. Reflect like an ethicist
  4. Communicate like a professional analyst

Academic Integrity

This is an individual project.

You may:

  • Discuss ideas at a high level
  • Debug conceptually with classmates

You may NOT:

  • Share code
  • Share tests
  • Copy solutions
  • Reuse previous TidyTuesday analyses

All work must be your own.


If you can successfully complete this project, you are demonstrating real industry-ready applied data science skills.