INFO 523 - Introduction to Data Mining

Announcements

Reading Quiz #1 is due Friday, Jan 19th, 11:59pm
Project 1 overview next week
- Teams will be announced as well

What is data mining?

One of many definitions:

“Data mining is the science of extracting useful knowledge from huge data repositories.” - ACM SIGKDD, Data Mining Curriculum: A Proposal

What is data mining?

Convergence of several fields

Statistics
Computer science (machine learning, AI)
Data science
Optimization

Why data mining?
Commercial viewpoint

Businesses collect + store tons of data

Purchases at department/grocery stores
Bank/credit card transactions
Web and social media data
Mobile and IOT

Computers are cheaper + more powerful

Competition to provide better services

Mass customization and recommendation systems
Targeted advertising
Improved logistics

Knowledge discovery in databases (KDD)

Data mining tasks

Descriptive

Find human-interpretable patterns that describes the data

Predictive Methods

Use features to predict unknown or future values of another feature

Predictive modeling

Classification

	Budget	Duration	Channel	Target_Audience_Size	Season	Campaign_Success
0	1.170199	-1.110199	1	-0.910081	1	0
1	0.198680	1.499221	0	0.988123	1	1
2	1.695723	0.126734	0	0.222349	0	1
3	-0.531455	-0.169270	1	-2.224787	0	0
4	0.194052	-2.479343	0	1.518522	2	0

Regression

	Feature	Target
0	0.931280	50.779929
1	0.087047	-10.065270
2	-1.057711	-34.918392
3	0.314247	10.526743
4	-0.479174	-17.738377

Classification

Find a model for the class attribute as a function of the other attributes

Goal: assign new records to a class as accurately as possible.

E.g., Customer Attrition, Directed Marketing

	Budget	Duration	Channel	Target_Audience_Size	Season	Campaign_Success
0	1.170199	-1.110199	1	-0.910081	1	0
1	0.198680	1.499221	0	0.988123	1	1
2	1.695723	0.126734	0	0.222349	0	1
3	-0.531455	-0.169270	1	-2.224787	0	0
4	0.194052	-2.479343	0	1.518522	2	0
5	1.807197	1.341938	1	-0.667830	1	1
6	-0.093387	-2.407374	0	-0.480568	2	0
7	1.063941	0.864310	1	-0.957300	1	1
8	-1.433152	1.360601	1	1.384636	1	1
9	-0.937926	0.464292	1	-2.113015	0	1

Regression

Find a model that predicts a variable (Y) from another variable (X)

Both are continuous variables (floats)

	Feature	Target
0	0.931280	50.779929
1	0.087047	-10.065270
2	-1.057711	-34.918392
3	0.314247	10.526743
4	-0.479174	-17.738377
5	0.647689	31.564596
6	-0.463418	-30.068883
7	0.542560	5.912007
8	0.611676	23.473374
9	1.003533	32.343595

Association mining

Given a set of transactions, produce rules of association

  antecedents consequents  antecedent support  consequent support  support  \
0   (Diapers)      (Beer)                 0.8                 0.6      0.6   
1      (Beer)   (Diapers)                 0.6                 0.8      0.6   
2     (Bread)   (Diapers)                 0.8                 0.8      0.6   
3   (Diapers)     (Bread)                 0.8                 0.8      0.6   
4     (Bread)      (Milk)                 0.8                 0.8      0.6   
5      (Milk)     (Bread)                 0.8                 0.8      0.6   
6      (Milk)   (Diapers)                 0.8                 0.8      0.6   
7   (Diapers)      (Milk)                 0.8                 0.8      0.6   

   confidence    lift  leverage  conviction  zhangs_metric  
0        0.75  1.2500      0.12         1.6           1.00  
1        1.00  1.2500      0.12         inf           0.50  
2        0.75  0.9375     -0.04         0.8          -0.25  
3        0.75  0.9375     -0.04         0.8          -0.25  
4        0.75  0.9375     -0.04         0.8          -0.25  
5        0.75  0.9375     -0.04         0.8          -0.25  
6        0.75  0.9375     -0.04         0.8          -0.25  
7        0.75  0.9375     -0.04         0.8          -0.25

Association mining

Let the rule discovered be: {Potato Chips, …} → {Soft drink}
Soft drink as RHS: what can boost sales? Discount Potato Chips?
Potato Chips as LHS: which products are affected if Potato Chips are discontinued
Potato Chips in LHS and Soft drink in RHS: What products should be sold with Potato Chips to promote sales of Soft drinks!

Association mining goals

Goal: Anticipate the nature of repairs to keep the service vehicles equipped with right parts to speed up repair time.
Approach: Process the data on tools and parts required in previous repairs at different consumer locations and discover co-occurrence patterns.

Clustering

Group points that are similar to one another
Separate dissimilar points
Groups are not known → Unsupervised Learning
E.g., Market Segmentation, Document Types

Anomaly detection

Detect significant deviations from normal behavior.

Other data mining tasks

Challenges of data mining

Legal, privacy, and security issues

Problem: Internet is global, legislation is local!

Legal, privacy, and security issues

Top Mobile App: Angry Birds is the highest-selling paid app on iPhone in the US and Europe.
Downloads: Surpassed a billion downloads globally.
Player Engagement: Users often engage for hours playing the game.
Privacy Concerns: A study by Jason Hong of Carnegie Mellon University found that out of 40 users, 38 were unaware that their location data was being stored.
Ad Targeting: The location data was used for targeting ads to the users.

Legal, privacy, and security issues

Location & Camera Access: Pokémon Go tracks location and requires camera access.
Data Collection Potential: Its popularity may lead to significant data gathering.
Privacy Policy Issues: Criticized for being deliberately vague.
User Data as Asset: User data classified as a business asset in the privacy agreement.
Data Transfer Clause: User data can be transferred if Niantic is sold.

Conclusions

Data Mining is interdisciplinary

Statistics
CS (machine learning, AI)
Data science
Optimization

Data mining is a team effort

Data management
Statistics
Programming
Communication
Application domain