Final Project
From Data to Decisions: A Reproducible Data Science Investigation with TidyTuesday
This is an INDIVIDUAL project (not a team project).
Project Purpose
This final project assesses your ability to:
- Work with real, messy, open datasets
- Implement the entire data science lifecycle
- Apply:
- Regression
- Classification
- Clustering
- Build reproducible machine learning pipelines
- Write and validate testable, modular Python code
- Interpret results in real-world terms
- Reflect on ethical and societal consequences
- Communicate findings professionally
You will select your own dataset from the TidyTuesday archive, justify your methodological approach, and build a fully reproducible, end-to-end data science investigation.
Repository Structure
.
├── data/ # Your TidyTuesday CSV(s) (EDITABLE)
│ └── your_dataset.csv
│
├── src/
│ ├── ds.py # Main reproducible ML pipeline (YOU WRITE)
│ └── test_ds.py # Unit tests for your pipeline (YOU WRITE)
│
├── proposal.qmd # Project proposal & justification (EDITABLE)
├── index.qmd # Final scientific & ethical report (EDITABLE)
│
├── final_project.qmd # Analysis, EDA, visuals & interpretation (EDITABLE)
│
├── requirements.txt # Reproducible environment (EDITABLE)
├── .gitignore # Prevents junk files from being tracked (EDITABLE)
│
└── README.md # README overview - DO NOT EDIT
Only modify files explicitly marked as editable.
What You Must Deliver
| Component | File | Points |
|---|---|---|
| Dataset Justification | proposal.qmd |
10 |
| Reproducible ML Pipeline | ds.py + test_ds.py |
30 |
| Analysis & Visualization | final_project.qmd |
20 |
| Final Written Report | index.qmd |
15 |
| Final Presentation | Google Slide + Panopto | 15 |
| Peer-review + reflection + Repo Quality | GitHub Issues + repo wide | 5 |
| TOTAL | 100 |
Proposal
(Detailed prompts can be found in - proposal.qmd)
Your proposal should include:
A brief description of your dataset including its provenance, dimensions, etc. (Make sure to load the data and use inline code for some of this information.)
The reason why you chose this dataset.
The two questions you want to answer.
A plan for answering each of the questions including the variables involved, variables to be created (if any), external data to be merged in (if any).
A weekly “plan of attack” outlining your steps to complete your project and including the team member(s) assigned to that task.
This should be in the following form:
Task Name Status Due Priority Summary Low Moderate High Note that this is a living document and should be updated at minimum once a week.
Write-up
(Detailed prompts can be found in - index.qmd)
Your write-up should follow a structured, end-to-end data science workflow. It should be organized into the following sections:
1. Introduction (1–2 paragraphs)
Provide a brief introduction to your dataset.
Describe what the dataset contains and where it comes from (e.g., TidyTuesday).
Summarize key variables and context.
Write as if this is a standalone document—assume the reader has no prior knowledge of the dataset.
2. Data Understanding & Exploration
Overview (1–2 paragraphs)
Describe the size and structure of the dataset (rows, columns, variable types).
Identify any immediate data quality issues (missing values, duplicates, unusual values).
EDA Approach (1–2 paragraphs)
Explain how you explored the data (summary statistics, distributions, relationships).
Justify why these steps help build intuition about the dataset.
Analysis (code + visuals + comments)
Include summary tables and visualizations (e.g., histograms, pairplots).
Highlight:
Distributions
Outliers
Potential transformations
Discussion (1–2 paragraphs)
What patterns or issues did you discover?
What variables seem most important or problematic?
3. Data Cleaning & Feature Engineering
Cleaning (1 paragraph)
Describe what cleaning steps were applied (e.g., handling missing values, filtering).
Explain why these steps are reasonable.
Feature Engineering (1 paragraph)
Describe new features created.
Explain how they improve your ability to model or analyze the data.
4. Supervised Learning – Regression
Introduction (1 paragraph)
Define your regression task:
What are you predicting?
Why is it meaningful?
Approach (1 paragraph)
Describe:
Features used
Model choice
Train/test split strategy
Analysis (code + 1–2 figures)
Train your model and report metrics (e.g., R², RMSE).
Include at least one visualization (e.g., true vs. predicted plot).
Discussion (1–2 paragraphs)
Interpret model performance.
Discuss:
Bias (over/under prediction)
Where the model performs poorly
Whether results are meaningful
5. Supervised Learning – Classification (if applicable)
Introduction (1 paragraph)
- Define the classification task and why it matters.
Approach (1 paragraph)
Describe:
Features
Model choice
Use of stratification
Analysis (code + metrics)
Train and evaluate your model.
Report metrics beyond accuracy if needed (e.g., precision, recall).
Discussion (1–2 paragraphs)
Interpret results.
Discuss:
Class imbalance
Practical or ethical implications of errors
6. Unsupervised Learning – Clustering (if applicable)
Introduction (1 paragraph)
- What structure are you trying to uncover?
Approach (1 paragraph)
Describe:
Features used
Choice of number of clusters (K)
Analysis (code + 1 figure)
- Run clustering and visualize results.
Discussion (1–2 paragraphs)
Do clusters represent meaningful groups?
Are results stable or sensitive?
7. Error Analysis & Limitations (1–2 paragraphs)
Reflect on:
Where models perform poorly and why
Missing or biased data
Assumptions and limitations of your methods
8. Ethical & Societal Considerations (1–2 paragraphs)
Address explicitly:
Who is represented in the data—and who is not
Potential harms from misuse
Biases embedded in the dataset or modeling choices
What responsible use of this analysis looks like
9. Conclusion & Future Work (1 paragraph)
Summarize:
Key findings
What you learned about the data/system
What you would do next with:
More time
More data
More advanced methods
General Guidelines
Be concise: paragraphs should typically be ≤ 5 sentences
Use clear section titles
Include code, figures, and interpretation together
Focus on reasoning and justification, not just output
Statistical tests are optional but allowed
Presentation + slides
Slides
In addition to the written report, you will also create presentation slides and deliver presentation that summarize and showcase your project. Introduce your research question and data set, showcase visualizations, and discuss the primary conclusions. These slides should serve as a brief visual addition to your written report and will be graded for content and quality.
You can create your slides with any software you like (Keynote, PowerPoint, Google Slides, etc.). We recommend choosing an option that’s easy to collaborate with, e.g., Google Slides. If you choose this option, save the slides as PDF and upload it to your repo as presentation.pdf.
You can also use Quarto to make your slides! While we won’t be covering making slides with Quarto in the class, we would be happy to help you with it in office hours. It’s no different than writing other documents with Quarto, so the learning curve will not be steep!
The slide deck should be roughly 6-7 content slides + 1 title slide. Here is a suggested outline as you think through the slides; you do not have to use this exact format for the 6-7 slides.
Title Slide
Slide 1: Introduce the topic and motivation
Slide 2: Introduce the data
Slide 3: Main Question
Slide 4-6: Analysis 1, 2, etc
Slide 7: Conclusions + future work
Presentation
Presentations will take place in class during the last lab of the semester. The presentation must be no longer than 5 minutes. You will pre-record a video using one of the options below.
To pre-record your presentation, you may use can use any platform that works best for your group to record your presentation. Below are a few resources on recording videos:
- Recording presentations in Zoom
- Apple Quicktime for screen recording
- Windows 10 built-in screen recording functionality
- Kap for screen recording
Once your video is ready, upload the video to Panopto or another video platform (e.g., YouTube), then add a link to your video in your repo README.
To create your video with Panopto:
- Go to https://arizona.hosted.panopto.com/ and sign in via your NetID
- Click the “+” and select “Upload media”.
- Drag and drop your recorded video.
- Once you’ve uploaded the video to Panopto, click to share the video and copy the video’s URL. You will need this when you post the video in the discussion forum.
Grading Rubric:
Slides
Are the slides well organized, readable, not full of text, featuring figures with legible labels, legends, etc.?
Presentation
- Time management: Did they go over time?
- Professionalism: How well did they present? Does the presentation appear to be well practiced?
- Narrative: Did they present a clear story, or did it seem like independent pieces of work patched together?
- Creativity and critical thought: Is the project carefully thought out? Does it appear that time and effort went into the planning and implementation of the project?
- Content: Including, but not limited to the following:
- Is the question well articulated in the presentation?
- Can the question be answered with the data?
- Does the analysis answer the question?
- Are the conclusion(s) made based on the analysis justifiable?
- Are the limitations carefully considered and articulated?
Dataset Requirements (TidyTuesday Only)
You must choose one dataset from:
🔗 https://github.com/rfordatascience/tidytuesday
Your dataset must:
- Come from one specific week
- Include at least one numeric outcome
- Support at least TWO of the following:
- Regression
- Classification
- Clustering
You may combine multiple tables from the same TidyTuesday week only.
Required Files & Their Roles
proposal.qmd — Project Design & Justification (10 pts)
Must include:
- Dataset name + TidyTuesday link
- Why you chose this dataset
- Your real-world research question(s)
- Which modeling tasks you will apply:
- Regression (target variable)
- Classification (class definition)
- Clustering (expected structure)
- Potential bias, harm, and ethical risks
You may not begin modeling until this is approved.
src/ds.py — Reproducible ML Pipeline (30 pts)
This is the engineering backbone of your project.
Your pipeline must include all of the following categories:
Data Handling
load_datainitial_summaryclean_datafeature_engineering
Modeling (At Least 2 Required)
run_regressionrun_classificationrun_clustering
Evaluation
- Regression: R², RMSE
- Classification: Accuracy, F1, ROC-AUC
- Clustering: BIC, Calinski–Harabasz
✅ Must run end-to-end from raw CSV → metrics
✅ Must be pure functions only (no prints, no plotting)
✅ Must be deterministic (fixed random seeds)
✅ Must be fully testable
src/test_ds.py — Automated Testing (5 pts)
You must implement at least 8 real unit tests that validate:
- Data loading
- Cleaning behavior
- Feature engineering outputs
- Train/test splits
- Model training & prediction
- Metric correctness
- Handling of edge cases:
- NaNs
- Small data
- Class imbalance
These tests enforce professional engineering discipline.
final_project.qmd — Analysis & Visualization (20 pts)
Your notebook must:
- Load data using
ds.py - Perform EDA:
- Distributions
- Correlations
- Missingness
- Run:
- At least 1 regression
- At least 1 classification or clustering
- Visualize:
- Predictions vs truth
- Cluster structure
- Feature relationships
- Interpret results in Markdown
No model logic is allowed inside the notebook.
All models must come from ds.py.
index.qmd — Final Scientific & Ethical Report (15 pts)
Must include:
- Research questions
- Dataset background
- Modeling choices & justification
- Results & limitations
- Ethical risks & societal implications
- What you would do next with more time or data
This is where your thinking is graded, not just your code.
Running the Project Locally
1. Install Dependencies
pip install -r requirements.txt2. Run Unit Tests
pytest✅ All tests must pass before submission.
3. Run Code Quality Check
pylint ds.py✅ Minimum acceptable score: 9.0
4. Render Final Analysis
quarto render final_project.qmd✅ Notebook must render without error.
Rules & Constraints
✅ You may use:
pandasnumpyscikit-learnmatplotlib/seabornstatsmodels(optional)
❌ You may NOT:
- Use AutoML tools
- Use LLMs to generate conclusions without verification
- Copy existing TidyTuesday solutions
- Fit models directly inside the notebook
- Hardcode answers to pass tests
- Ignore ethics and interpretability
Peer-review + Reflection + Professionalism (5 pts)
You will lose points for:
- Poor peer-review and obvious GenAI usage
- Poor reflection (e.g., I did everything great)
- Tracking large datasets in Git
- Poor Git history
- Missing documentation
- Non-reproducible results
✅ Submission Checklist
Before your final push:
- ✅ Proposal approved
- ✅
ds.pycomplete & clean - ✅
test_ds.pycomplete - ✅
pytestpasses - ✅
pylint ds.py≥ 9.0 - ✅
final_project.qmdrenders - ✅
report.mdsubmitted - ✅ Dataset placed in
/data - ✅ Clean Git history
Why This Project Is Structured This Way
This mirrors real-world data science in:
- Healthcare analytics
- Climate modeling
- Economic forecasting
- Policy research
- Responsible AI development
You are being evaluated on your ability to:
- Think like a scientist
- Build like an engineer
- Reflect like an ethicist
- Communicate like a professional analyst
Academic Integrity
This is an individual project.
You may:
- Discuss ideas at a high level
- Debug conceptually with classmates
You may NOT:
- Share code
- Share tests
- Copy solutions
- Reuse previous TidyTuesday analyses
All work must be your own.
✅ If you can successfully complete this project, you are demonstrating real industry-ready applied data science skills.