Download KDD-2010 Review - IEEE Entity Web Hosting

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Data Mining in Practice:
Techniques and Practical Applications
Junling Hu
May 14, 2013
What is data mining?

Mining patterns from data

Is it statistics?




Functional form?
Computation speed concern?
Data size
Variable size
Is it machine learning?



2
Big data issue
New methods: network mining
Examples of data mining
Frequently bought together

3

Movie recommendation
More examples of data mining
Keyword suggestions


4

Genome & disease mining
Heart monitoring
Overview of data mining
Frequent pattern mining
Machine Learning




Supervised
Unsupervised
Stream mining
Recommender system
Graph mining
Unstructured data







Text,
Audio
Image and Video
Big data technology

5
Frequent Pattern Mining
Diaper and Beer

?
Product assortment
Click behavior
Machine breakdown



6
The case of Amazon
User
1
2
3
4
5


Items
{Princess dress, crown, gloves, t-shirt}
{Princess dress, crown, gloves, pink dress, t-shirt }
{Princess dress, crown, gloves, pink dress, jeans}
{ Princess dress, crown, gloves, pink dress}
{crown, gloves }
Count frequency of co-occurrence
Efficient algorithm
7
Machine Learning Process
8
Machine Learning

Supervised

Unsupervised (clustering)
9
Binary classification
Input features
Checking
Data point
10
Yes
Yes
No
Yes
Yes
Yes
Yes
Duration Savings Current
(years)
Loans
($k)
1
10
Yes
2
4
No
5
75
No
10
66
No
5
83
Yes
1
11
No
4
99
Yes
Output class
Loan
Purpose
Risky?
TV
TV
Car
Repair
Car
TV
Car
0
1
0
1
0
0
0
Classification (1)

Decision tree
11
Classification (2): Neural network

Perceptron

Multi-layer neural netowrk
12
Head pose detection
13
Support Vector Machine (SVM)

Search for a separating hyperplane
 Maximize margin
14
Perceived advantage of SVM

Transform data into higher dimension
15
Applications of SVM: Spam Filter
Input Features:

Transmission



Email header




From --“[email protected]”
To
--“undisclosed”
cc
Email Body



IP address --167.12.24.555
Sender URL -- one-spam.com
# of paragraphs
# words
Email structure


16
# of attachments
# of links
Logistic regression



Advantage: Simple functional form
Can be parallelized
Large scale
17
Applications of logistic regression

Click prediction




Search ranking (web pages, products)
Online advertising
Recommendation
The model


Output: Click/no click
Input features:
page content,
search keyword,
User information
18
Regression


Linear regression
Non-linear regression
19
Application:
• Stock price prediction
• Credit scoring
• employment forecast
History of Supervised learning
20
Semi-supervised learning

Application:

21
Speech dialog system
Unsupervised learning: Clustering

No labeled data

Methods

22
K-means
Categories of machine learning
23
Applications of Clustering


Malware detection
Document clustering: Topic detection
24
Graphs in our life

Social network
Friend recommendation
25

Molecular compound
Drug discovery
Graph and its matrix representation
Adjacency matrix
1
2
1
4
6
3
2
3
4
5
5
26
6
1
2
3
4
5
6
0
1
0
0
0
1
1
0
1
1
0
0
0
1
0
1
1
0
0
1
1
0
1
0
0
0
1
1
0
1
1
0
0
0
1
0
The web graph
Page 1
Anchor text
Page 2
Hyperlink
Anchor text
Anchor text
Page 3
Anchor text
27
PageRank as a steady state

Transition matrix
P=

1
2
3
4
5
6
1
0
0.5
0.25
0
0
0.5
2
0.33
0
0.25
1
0
0
3
0.33
0.5
0
0
0.33
0
4
0
0
0.25
0
0.33
0
PageRank is a probability vector
  P
28
5
0
0
0.25
0
0
0.5

6
0.33
0
0
0
0.33
0
such that
Discover influencers on Twitter

The Twitter graph



Node
Link
A PageRank approach: TwitterRank
2
Following
1
4
5
29
3
Facebook graph search

Entity graph

Natural language search

30
“Restaurants liked by my
friends”
Recommending a game
31
Recommendation in Travel site
32
Prediction Problems

Rating Prediction


Given how an user rated other items, predict the user’s rating for a given item
****
Top-N Recommendation

33
?
Given the list of items liked by an user, recommend new items that the user
might like
Explicit vs. Implicit Feedback Data

Explicit feedback


Ratings and reviews
Implicit feedback (user behavior)

Purchase behavior: Recency, frequency, …

Browsing behavior: # of visits, time of visit, time of staying,
clicks
34
Collaborative Filtering

Hypotheses

User/Item Similarities



Matching characteristics

35
Similar users purchase similar items
Similar items are purchased by similar users
Match exists between user’s and item’s characteristics
User-User similarity

User’s movie rating
36
John
Out of
Africa
4
Star
Wars
4
Air Force
One
5
Liar,
Liar
1
Adam
1
1
2
5
Laura
?
4
5
2
Item-item similarity
John
Adam
Out of
Africa
4
1
Star
Wars
4
1
Air Force
One
5
2
Liar,
Liar
1
5
Laura
?
4
5
2
37
Application of item-item similarity

Amazon
38
SVD (Singular Value Decomposition)
39
Latent factors
40
Application of Latent Factor Model

GetJar
41
Ranking-based recommendation
42
Application in LinkedIn

Ranking-based model
43
Thanks and Contact

Co-author: Patricia Hoffman
Contact:
 [email protected]

Twitter: @junling_tech
44
Related documents