Final Project

Your task for this project is to delve deep into the realm of data mining.

This is intentionally vague – a project that best mirrors your unique interests and expertise.

the incorporation of an element or concept you’ve self-taught. This could be an algorithm we haven’t explored in class, a specialized data mining library or tool, or perhaps a unique data preprocessing workflow. The sky’s the limit! If you’re uncertain whether your “newly acquired knowledge” counts, don’t hesitate to reach out.

Ideas

Your first task is to come up with a goal for your project. Here are a few ideas to help you get started thinking:

Time Series Analysis

  • Predict stock market prices using historical data and a combination of techniques like ARIMA, LSTM, or Prophet.

Anomaly Detection

  • Detect credit card frauds using transaction data.

  • Identify network intrusions from server logs.

Association Rule Mining

  • Find patterns from a supermarket transaction dataset (like market basket analysis).

Social Network Analysis

  • Identify influential nodes in a social network.

  • Study the spread of information (or misinformation) in a network.

Multimedia Mining

  • Develop a music recommendation system based on user listening history.

  • Use data mining techniques to automatically tag and categorize videos.

Interactive Data Visualization

  • Develop an interactive dashboard that allows users to explore a large dataset, finding patterns or outliers.

Ensemble Methods

  • Compare the performance of various ensemble techniques (like boosting, bagging) on a challenging classification problem.

And, of course, your project can revolve around mining a dataset that intrigues you. The singular restriction on this dataset is that it shouldn’t be sourced from TidyTuesday. Instead, opt for a dataset that genuinely resonates with your team and offers avenues for thorough exploration.

Above all, engage in extensive brainstorming, sifting through various ideas until you converge on a topic that unites your team in enthusiasm. It should ideally reflect the skills you’ve absorbed during this course, while also allowing room for novel applications in your project.

Workflow

  • Week 1 of project (week of Mar 18): Pick a focus for your project.

  • Week 2 of project (week of Mar 25): Work on developing your project proposal and setting up the structure for your repository.

  • Week 3 of project (week of Apr 01): Finalize your project proposal + Conduct peer review on project proposals

  • Week 4 of project (week of Apr 08): Continue working on your project, and optionally, submit in an updated version of your proposal.

  • Week 5 of project (week of Apr 15): Continue working on your project, incorporating feedback from proposals.

  • Week 6 of project (week of Apr 22): Continue working on your project.

  • Week 7 of project (week of Apr 29): Conduct peer code review.

  • Week 8 of project (week of May 06 - finals week): Present projects and turn in final write-up.

Due dates

  • Proposals for peer review: due Wed, Apr 03 at 1pm.
  • Peer review of proposals: Wed, Apr 03 in class
  • Revised proposals for instructor review: due Mon, Apr 08 at 5pm.
  • Peer review of code: Wed, May 01 in class
  • Write-up and presentation: due Mon, May 06, 1:00pm - 3:00pm (scheduled final date)

Deliverables

Proposal

Your proposal should include:

  • A one sentence description of your high-level goal (such as the ones listed under Ideas above.

  • A one to two paragraph description of your goals, including your motivation. Depending on the focus of your project, the following might go in here.

    • If using particular dataset(s), a brief description of each dataset including the reason why you chose the particular dataset, its dataset, its dimensions and any other relevant metadata. (Make sure to load the data and use inline code for some of this information.)

    • If answering a particular research question, the question itself and the reason why you chose this question.

  • A weekly “plan of attack” outlining your steps to complete your project and including the team member(s) assigned to that task.

    • This should be in the following form:

      Task Name Status Assignee Due Priority Summary
      Low
      Moderate
      High
    • Note that this is a living document and should be updated at minimum once a week.

  • The final organization of your project repository. This means describing the project organization in your proposal as well as implementing that organization in your repo. Create any folders needed and place a README.md in each folder explaining what goes in there.

Peer review

The peer reviews will be completed during the lab sessions as indicated above. On those days teams will have access to the project repos of the teams whose work they’re reviewing. The first peer review will be of proposals and the second peer review will be of code and organization.

Reviewer tasks

Teams will develop the review together, with discussion among all team members, but only one team member will submit it as an issue on the project repo. To do so, go to the Issues tab, click on the green New issue button on the top right, and then click on the green Get started button for the issue template titled Proposal peer review or Code peer review.

This will start a new issue with a peer review form that you can fill out. Make sure to update the introductory paragraph with your team name and the names of the team members participating in the review. Then, answer each of the questions in the spaces provided underneath them. You’re expected to be thorough in your review, but this doesn’t necessarily require lengthy responses.

Remember, your goal is to help the team whose project you’re reviewing. The team will not lose points because of issues you point out, as long as they address them before I review their work. You should be critical, but respectful in your review. Also remember that you will be evaluated on the quality of your review. So that’s an additional incentive to do a good job.

Reviewee tasks

Once you receive feedback from your peers, you should address them. You should do this by directly updating your proposal or making any other updates to your repo as needed. You can do these updates all in one commit or you can spread it across multiple commits. Regardless, in the last commit that addresses the peer review comments, you should use a keyword in your commit message that will close the peer review issues. These words are close, closes, closed, fix, fixes, fixed, resolve, resolves, and resolved and they need to be followed by the issue number (which you can find next to the issue title). So, your commit message can say something like “Finished updates based on peer review, fixes #1”.

Peer review assignments

The peer review assignments for proposal peer reviews are as follows:

reviewer reviewee_1 reviewee_2
Analytical Avengers KG Competitors MiningMinds
KG Competitors MiningMinds data detectives
MiningMinds data detectives Daaku Data Singh
data detectives Daaku Data Singh Analytical Avengers
Daaku Data Singh Analytical Avengers KG Competitors

The peer review assignments for code peer reviews are TBD

Write-up

Your have a lot of freedom in how you structure your write-up. This also comes with responsibility. You should make sure you have a clear introduction and a clear conclusion. You should also have other interim section headings that help the reader. Your write-up should be somewhere between 1000 and 2000 words. There is no expectation that you get close to the upper limit, anywhere in that range is fine as long as you have clearly explained yourself. The limits are provided to help you, not to set stressful expectations.

Presentation

Your presentation should generally follow the same structure as your write-up. Each team will have 5 minutes for their presentation, and each team member must speak (roughly equally) during this time.

You should create your presentation in a reproducible way, e.g., using Quarto. However you don’t have to use a package that is designed specifically for slides. If you prefer to build a dashboard or a Shiny app or a website, that’s fine too. The only rule is that it’s built reproducibly using R or Python.

Your evaluation will be based on your content, professionalism (including sticking to time), and your performance during the Q&A (question and answer). We don’t care how many slides you use to do this.

Website

Each of your projects will have a website. You can use the same workflow as for your first project or create something different. For example, if your project is building a dashboard, you might consider making your write-up a tab on that dashboard. Or if it’s building a package, you might consider making your website using the pkgdown package. Feel free to Google your way around it or ask in lab sessions, on the discussion forum, or office hours!

Repo organization

Since you have complete freedom on the focus of your project, you also have complete freedom on the organization of your repository. But… it should be organized in a logical way!

At a minimum each project should have an index.qmd. If your project is analyzing a dataset, you’ll probably want to structure your repo similar to your Project 1 repo. If you’re building a package, it should be structured like a package.

Grading

Total 100 pts
Proposal1 10 pts
Presentation2

30 pts

(25 pts from teaching team, 5 pts from audience)

Write-up3 30 pts
Reproducibility, style, and organization4 10 pts
Within team peer evaluation5 10 pts
Between team peer evaluation6 10 pts

Some of the components are further detailed below.

Note that there will be points allocated to commits from each team member for the proposal, presentation, and write-up.

Proposal (10 points)

  • High level goal
  • Expanded description
  • Plan
  • Repo organization
  • Workflow - Peer review issues closed via commits. (1 point)
  • Teamwork - All team members must contribute to the repo via commits. (1 point)

Presentation (30 points)

Teaching team (25 points)

  • Time management: Did the team divide the time well among themselves or got cut off going over time? (2 points)

  • Professionalism: How well did the team present? Does the presentation appear to be well practiced? Did everyone get a chance to say something meaningful about the project? (2 points)

  • Teamwork: Did the team present a unified story, or did it seem like independent pieces of work patched together? (3 points)

  • Slides: Are the slides well organized, readable, not full of text, featuring figures with legible labels, legends, etc.? (3 points)

  • Creativity / Critical Thought: Is the project carefully thought out? Does it appear that time and effort went into the planning and implementation of the project? (5 points)

  • Content: Including, but not limited to the following: (10 points)

    • Is the question well articulated in the presentation?
    • Can the question be answered with the data?
    • Do(es) the data mining method(s) answer the question?
    • Do(es) the data mining method(s) follow good practices?
    • Is/are the conclusion(s) made based on the mining(s) justifiable?
    • Are the limitations carefully considered and articulated?

Peers (5 points)

  • Content: Is the project well designed and is the data being used relevant to the focus of the project? (1 point)

  • Content: Did the team use appropriate mining methods and did they interpret them accurately? (1 point)

  • Creativity and Critical Thought: Is the project carefully thought out? Are the limitations carefully considered? Does it appear that time and effort went into the planning and implementation of the project? (1 point)

  • Slides: Are the slides (or other presentation medium) well organized, readable, not full of text, featuring figures with legible labels, legends, etc.? (1 point)

  • Professionalism: How well did the team present? Does the presentation appear to be well practiced? Are they reading off of a script? Did everyone get a chance to say something meaningful about the project? (1 point)

Write-up (30 points)

  • Introduction: The introduction provides a clear explanation of the question and the dataset used to answer the question, including a description of all relevant variables in the dataset. (3 points)
  • Justification of approach: The chosen approach and data mining method(s) are clearly explained and justified. (3 points)
  • Code: Code is correct, easy to read, properly formatted, and properly documented. (10 points)
  • Mining Method: The data mining method(s) are appropriate, easy to read, and properly labeled. (10 points)
  • Discussion: Discussion of results is clear and correct, and it has some depth without begin excessively long. (4 points)

Reproducibility, style, and organization (10 points)

  • All required files are provided. Quarto files render without issues and reproduce the necessary outputs. If building a package, the checks pass. (3 points)
  • If there’s a dataset, it’s provided in a data folder, a codebook is provided, and a local copy of the data file is used where needed. (3 points)
  • Documents are well structured and easy to follow. No extraneous materials. (2 points)
  • All issues are closed, mostly with specific commits addressing them. (2 points)

Guidelines

Please use the project repository that has been created for your team to complete your project. Everything should be done reproducibly. This means that I should be able to clone your repo and reproduce everything you’ve submitted as part of your project.

All results presented must have corresponding code. If you do calculations by hand instead of using R and then report the results from the calculations, you will not receive credit for those calculations. Any answers/results given without the corresponding R code that generated the result will not be considered. For example, if you’re reporting the number of observations in your dataset, don’t just write the number manually, use inline R code to calculate the number. All code reported in your final project document should work properly. Please do not include any extraneous code or code which produces error messages. Code which produces certain warnings and messages is acceptable, as long as you understand what the warnings mean. In such cases you can add warning: false and message: false in the relevant R chunks. Warnings that signal lifecycle changes (e.g., a function is deprecated and there’s a newer/better function out there) should generally be addressed by updating your code, not just by hiding the warning.

Tips

  • I hope some of you will take the challenge to be adventurous and learn some new skills as part of this project. I’m happy to support you along the way, so don’t hesitate to ask as many questions as needed!

  • You’re working in the same repo as your teammates now, so merge conflicts will happen, issues will arise, and that’s fine! Commit and push often, and ask questions when stuck.

  • Review the marking guidelines below and ask questions if any of the expectations are unclear.

  • Make sure each team member is contributing, both in terms of quality and quantity of contribution (we will be reviewing commits from different team members).

  • Set aside time to work together and apart (physically).

  • Code:

    • In your presentation your code should be hidden (echo: false) so that your slides are neat and easy to read. However your document should include all your code such that if I re-render your Quarto file I should be able to obtain the results you presented. Exception: If you want to highlight something specific about a piece of code, you’re welcomed to show that portion.

    • In your write-up your code should be visible.

  • Teamwork: You are to complete the assignment as a team. All team members are expected to contribute equally to the completion of this assignment and team evaluations will be given at its completion - anyone judged to not have sufficiently contributed to the final product will have their grade penalized. While different team members may have different backgrounds and abilities, it is the responsibility of every team member to understand how and why all code and approaches in the assignment work.

  • When you’re done, review the documents on GitHub to make sure you’re happy with the final state of your work. Then go get some rest!

Footnotes

  1. Your proposal score will also take into consideration how much you’ve improved your proposal based on peer feedback.↩︎

  2. A more detailed grading rubric for this component to be added.↩︎

  3. A more detailed grading rubric for this component to be added.↩︎

  4. Style and format does count for this assignment, so please take the time to make sure everything looks good and your data and code are properly formatted.↩︎

  5. This relates to your contribution to the teamwork and how your team members evaluate this contribution. Scores for each teammate will be different for this component of the project grade.↩︎

  6. This relates to the quality and quantity of the peer review you provide for other teams.↩︎