Final Project
From Data to Decisions: A Reproducible Data Science Investigation with TidyTuesday
This is an INDIVIDUAL project (not a team project).
Project Purpose
This final project assesses your ability to:
- Work with real, messy, open datasets
- Implement the entire data science lifecycle
- Apply:
- Regression
- Classification
- Clustering
- Build reproducible machine learning pipelines
- Write and validate testable, modular Python code
- Interpret results in real-world terms
- Reflect on ethical and societal consequences
- Communicate findings professionally
You will select your own dataset from the TidyTuesday archive, justify your methodological approach, and build a fully reproducible, end-to-end data science investigation.
Repository Structure
.
├── data/ # Your TidyTuesday CSV(s) (EDITABLE)
│ └── your_dataset.csv
│
├── src/
│ ├── ds.py # Main reproducible ML pipeline (YOU WRITE)
│ └── test_ds.py # Unit tests for your pipeline (YOU WRITE)
│
├── proposal.qmd # Project proposal & justification (EDITABLE)
├── index.qmd # Final scientific & ethical report (EDITABLE)
│
├── final_project.qmd # Analysis, EDA, visuals & interpretation (EDITABLE)
│
├── requirements.txt # Reproducible environment (EDITABLE)
├── .gitignore # Prevents junk files from being tracked (EDITABLE)
│
└── README.md # README overview - DO NOT EDIT
Only modify files explicitly marked as editable.
What You Must Deliver
| Component | File | Points |
|---|---|---|
| Dataset Justification | proposal.qmd |
10 |
| Reproducible ML Pipeline | ds.py + test_ds.py |
30 |
| Analysis & Visualization | final_project.qmd |
20 |
| Final Written Report | index.qmd |
15 |
| Final Presentation | Google Slide + Panopto | 15 |
| Peer-review + reflection + Repo Quality | GitHub Issues + repo wide | 5 |
| TOTAL | 100 |
Dataset Requirements (TidyTuesday Only)
You must choose one dataset from:
🔗 https://github.com/rfordatascience/tidytuesday
Your dataset must:
- Come from one specific week
- Include at least one numeric outcome
- Support at least TWO of the following:
- Regression
- Classification
- Clustering
You may combine multiple tables from the same TidyTuesday week only.
Required Files & Their Roles
proposal.qmd — Project Design & Justification (10 pts)
Must include:
- Dataset name + TidyTuesday link
- Why you chose this dataset
- Your real-world research question(s)
- Which modeling tasks you will apply:
- Regression (target variable)
- Classification (class definition)
- Clustering (expected structure)
- Potential bias, harm, and ethical risks
You may not begin modeling until this is approved.
src/ds.py — Reproducible ML Pipeline (30 pts)
This is the engineering backbone of your project.
Your pipeline must include all of the following categories:
Data Handling
load_datainitial_summaryclean_datafeature_engineering
Modeling (At Least 2 Required)
run_regressionrun_classificationrun_clustering
Evaluation
- Regression: R², RMSE
- Classification: Accuracy, F1, ROC-AUC
- Clustering: BIC, Calinski–Harabasz
✅ Must run end-to-end from raw CSV → metrics
✅ Must be pure functions only (no prints, no plotting)
✅ Must be deterministic (fixed random seeds)
✅ Must be fully testable
src/test_ds.py — Automated Testing (5 pts)
You must implement at least 8 real unit tests that validate:
- Data loading
- Cleaning behavior
- Feature engineering outputs
- Train/test splits
- Model training & prediction
- Metric correctness
- Handling of edge cases:
- NaNs
- Small data
- Class imbalance
These tests enforce professional engineering discipline.
final_project.qmd — Analysis & Visualization (20 pts)
Your notebook must:
- Load data using
ds.py - Perform EDA:
- Distributions
- Correlations
- Missingness
- Run:
- At least 1 regression
- At least 1 classification or clustering
- Visualize:
- Predictions vs truth
- Cluster structure
- Feature relationships
- Interpret results in Markdown
No model logic is allowed inside the notebook.
All models must come from ds.py.
index.qmd — Final Scientific & Ethical Report (15 pts)
Must include:
- Research questions
- Dataset background
- Modeling choices & justification
- Results & limitations
- Ethical risks & societal implications
- What you would do next with more time or data
This is where your thinking is graded, not just your code.
Running the Project Locally
1. Install Dependencies
pip install -r requirements.txt2. Run Unit Tests
pytest✅ All tests must pass before submission.
3. Run Code Quality Check
pylint ds.py✅ Minimum acceptable score: 9.0
4. Render Final Analysis
quarto render final_project.qmd✅ Notebook must render without error.
Rules & Constraints
✅ You may use:
pandasnumpyscikit-learnmatplotlib/seabornstatsmodels(optional)
❌ You may NOT:
- Use AutoML tools
- Use LLMs to generate conclusions without verification
- Copy existing TidyTuesday solutions
- Fit models directly inside the notebook
- Hardcode answers to pass tests
- Ignore ethics and interpretability
Peer-review + Reflection + Professionalism (5 pts)
You will lose points for:
- Poor peer-review and obvious GenAI usage
- Poor reflection (e.g., I did everything great)
- Tracking large datasets in Git
- Poor Git history
- Missing documentation
- Non-reproducible results
✅ Submission Checklist
Before your final push:
- ✅ Proposal approved
- ✅
ds.pycomplete & clean - ✅
test_ds.pycomplete - ✅
pytestpasses - ✅
pylint ds.py≥ 9.0 - ✅
final_project.qmdrenders - ✅
report.mdsubmitted - ✅ Dataset placed in
/data - ✅ Clean Git history
Why This Project Is Structured This Way
This mirrors real-world data science in:
- Healthcare analytics
- Climate modeling
- Economic forecasting
- Policy research
- Responsible AI development
You are being evaluated on your ability to:
- Think like a scientist
- Build like an engineer
- Reflect like an ethicist
- Communicate like a professional analyst
Academic Integrity
This is an individual project.
You may:
- Discuss ideas at a high level
- Debug conceptually with classmates
You may NOT:
- Share code
- Share tests
- Copy solutions
- Reuse previous TidyTuesday analyses
All work must be your own.
✅ If you can successfully complete this project, you are demonstrating real industry-ready applied data science skills.