Download Examples - School of Computing

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
CS5344: Big Data Analytics
Technology
http://www.comp.nus.edu.sg/~tankl/cs5344
TAN Kian-Lee
Professor, School of Computing
[email protected] COM1, Room 03-23
“We could have gotten started a lot earlier. We simply weren’t stepping back and
looking at how to use the data”
– Brad Smith, Intuit
1
Big Data
Analytics
Technology
2
What is BIG Data
Gartner Definition
• Volume
• Velocity
• Variety
More than what you can handle
• Veracity
• Value
• …
3
The Social Layer in an Instrumented Interconnected World
30 billion RFID
12+ TBs
tags today
(1.3B in 2005)
camera
phones
world
wide
100s of
millions
of GPS
enabled
data every day
? TBs of
of tweet data
every day
4.6
billion
devices
sold
annually
2+
billion
25+ TBs of
log data
every day
76 million smart
meters in 2009…
200M by 2014
people
on the
Web by
end 2011
Twitter Tweets per Second Record Breakers of 2011
Extract Intent, Life Events, Micro Segmentation Attributes
Pauline
Name, Birthday, Family
Tom Sit
Not Relevant - Noise
Tina Mu
Jo Jobs
Monetizable Intent
Not Relevant - Noise
Location
Wishful Thinking
Relocation
SPAMbots
Monetizable Intent
Some Big Data Stats
If you
Amount of Stored Data By Sector
like analogies…
(in Petabytes, 2009)
1000
848
600
35ZB = enough data
715
to fill a stack of619
DVDs
reaching halfway to Mars
500
434
900
800
700
Petabytes
966
364
400
269
300
227
Mars
200
100
0
Earth
Sources:
"Big Data: The Next Frontier for Innovation, Competition and Productivity."
US Bureau of Labor Statistics | McKinsley Global Institute Analysis
1 zettabyte?
= 1 million petabytes
= 1 billion terabytes
= 1 trillion gigabytes
7
Is This Qualitatively Different?
• From the domain perspective – absolutely Yes
• From the technology perspective – could be
Yes or No
8
Why BIG Data
• Can collect cheaply, due to automation
• Can store cheaply, due to falling media prices
• Realization that data was too valuable to
delete!
9
Analytics
• Data mining
• The process of examining (large amounts of)
data of a variety of types to uncover hidden
patterns, unknown correlations and other
useful and meaningful information
– result in business benefits, such as more effective
marketing and increased revenue
• Many “success” stories, where useful
predictions were made with the data
10
Analytics: Small vs Big Data
Source: https://www.youtube.com/watch?v=jujE79yEu6Y&spfreload=1
11
The WRONG Picture!
12
The BIG Data Analytics Pipeline
13
Log Analytics
• Analyze the entire data center’s logs to identify global
information and determine statistical correlations and
advanced predictive analytics
• Improve availability and make effective long-term plans
Extraction
Import
Integration
Analysis
Interpretation
Raw
Logs
Text
Parse &
Extract
Analytics
Parsed
Logs
Identify
Seeds
Identify
Content
“What if”
Analysis
Validation Alerts
CrossValidation
Sessionized
Logs
Lookup
Table
14
ID,prefix,
timestamp
Generate
Index
Identify
EntitySessions
Resolution
ID pairs
ID, type,
timestamp
Join Session
Info with
Original Logs
Feature
Vectors
IBM Research – Almaden
Reduced
Machine
Learning
Training
Feature
Feature
Selection
Vectors
Model
Data Integration and Cleaning
Garbage in
Garbage out
• The quality of results relates directly to quality of the
data
• 50%-70% of analytics process effort is spent on data
integration and cleaning
• Problems include: missing values, duplicate records,
outliers, entity resolution
15
Clinical Dataset Example
08/30/1993
0
F…
01/01/1931
08/10/1993
0
F…
M
01/01/1931
08/10/1994
1
F…
4
M
01/19/1849
09/17/1993
1
F…
4
M
10/31/1951
08/27/1993
0
F…
0000988
C
4
F
0001521
C
4
M
0001521
C
4
0002027
C
0233291
0233983
C
4
M
05/10/1939
09/06/1995
0
F…
0233983
C
4
M
05/10/1939
09/06/1995
1
F…
0234044
C
F
05/10/1929
09/03/1993
0
F…
16
Clinical Dataset Example
08/30/1993
0
F…
01/01/1931
08/10/1993
0
F…
M
01/01/1931
08/10/1994
1
F…
4
M
01/19/1849
09/17/1993
1
F…
4
M
10/31/1951
08/27/1993
0
F…
0000988
C
4
F
0001521
C
4
M
0001521
C
4
0002027
C
0233291
0233983
C
4
M
05/10/1939
09/06/1995
0
F…
0233983
C
4
M
05/10/1939
09/06/1995
1
F…
0234044
C
F
05/10/1929
09/03/1993
0
F…
Missing Values
17
Clinical Dataset Example
08/30/1993
0
F…
01/01/1931
08/10/1993
0
F…
M
01/01/1931
08/10/1994
1
F…
4
M
01/19/1849
09/17/1993
1
F…
4
M
10/31/1951
08/27/1993
0
F…
0000988
C
4
F
0001521
C
4
M
0001521
C
4
0002027
C
0233291
0233983
C
4
M
05/10/1939
09/06/1995
0
F…
0233983
C
4
M
05/10/1939
09/06/1995
1
F…
0234044
C
F
05/10/1929
09/03/1993
0
F…
Outlier
18
Clinical Dataset Example
08/30/1993
0
F…
01/01/1931
08/10/1993
0
F…
M
01/01/1931
08/10/1994
1
F…
4
M
01/19/1849
09/17/1993
1
F…
4
M
10/31/1951
08/27/1993
0
F…
0000988
C
4
F
0001521
C
4
M
0001521
C
4
0002027
C
0233291
0233983
C
4
M
05/10/1939
09/06/1995
0
F…
0233983
C
4
M
05/10/1939
09/06/1995
1
F…
0234044
C
F
05/10/1929
09/03/1993
0
F…
19
Data Selection
• Generate a set of training examples
–
–
choose sampling method
consider sample complexity
• Reduce attribute dimensionality
–
–
remove redundant and/or correlating attributes
combine attributes (sum, multiply, difference)
• Reduce attribute value ranges
–
–
group symbolic discrete values
quantize continuous numeric values
• Transform data
–
–
de-correlate and normalize values
map time-series data to static representation
20
Cell Phone Dataset Example
•
Time-series of calls for each of 3600 cellphone accounts
21
The BIG Data Analytics Pipeline
22
Interestingness of Patterns
• Interestingness criteria:
–
–
–
–
easily understood by humans
valid on new or test data with some degree of certainty
potentially useful
novel, or validates some hypothesis that a user seeks to
confirm
• Objective vs. subjective interestingness measures
– Objective: based on statistics and structures of patterns, e.g.,
support, confidence, etc.
– Subjective: based on user’s beliefs in the data, e.g.,
unexpectedness, novelty, actionable, etc.
Visualization
One Picture is Worth 1000 Words!
24
Data Analytics: What Kind of Data?
•
•
•
•
•
•
•
Relational databases
Transactional databases
XML databases
Spatial databases
Temporal databases
Text databases and multimedia databases
Graph databases
25
Technology
• Tools for data mining/analytics
• Technologies like MapReduce/Hadoop, and
NoSQL databases
• Emphasizes
– Scalability of number of
features and instances
– stress on algorithms and
architectures
– automation for
handling large,
heterogeneous data
Statistics
Data
Analytics
Database
Technology
Machine
Learning
26
An Example
27
An Example – Rule 1
28
An Example – Rule 2
29
How reliable are these rules?
• For any given train, how confident are you
that the answer is correct?
• Do we have enough data to construct a
reliable rule? How many data points is
enough?
30
How did you devise your rules?
• Did you…
– Look for characteristics in one set but missing in
the second set?
– Examine several potential rules?
– Consider simple rules first?
– Reject potential rules that didn’t perform well?
31
This is data analytics …
The process of…
• Deciding how to describe the data and task
(Task Specification and Data Representation)
• Identifying a rule
(Search and Knowledge Representation)
• Estimating confidence (Evaluation Function)
• Applying the rule (Inference Technique)
32
Major Data Analytics
• Association Rule Mining
– e.g. If a customer buy Beer, he/she will most likely buy
Diaper
• Classification/Prediction
– Is this a spam email?
– Will this customer spend much in my company?
• Clustering
– e.g. Help me to group the customer in my database
into three groups according to the ages, incomes and
expenses.
33
Association rule mining
Find frequent patterns, associations, correlations, or
causal structures among sets of items or objects in
transaction databases, relational databases, and other
information repositories.
Examples
Rule form: “Body ead [support, confidence]”.
buys(x, “diapers”)  buys(x, “beers”) [0.5%, 60%]
major(x, “CS”) ^ takes(x, “DB”) grade(x, “A”) [1%, 75%]
34
Association: Application 1
• Marketing and Sales Promotion:
– Let the rule discovered be
{Bagels, … } --> {Potato Chips}
– Potato Chips as consequent => Can be used to determine
what should be done to boost its sales.
– Bagels in the antecedent => Can be used to see which
products would be affected if the store discontinues
selling bagels.
– Bagels in antecedent and Potato chips in consequent =>
Can be used to see what products should be sold with
Bagels to promote sale of Potato chips!
35
Association: Application 2
• Supermarket shelf management.
– Goal: To identify items that are bought together by sufficiently
many customers.
– Approach: Process the point-of-sale data collected with barcode
scanners to find dependencies among items.
– A classic rule --
• If a customer buys diaper and milk, then he is very
likely to buy beer.
• So, don’t be surprised if you find six-packs stacked next
to diapers!
36
Classification
Classifies data based on the training set (model
construction) and uses it in classifying new data
(model usage).
Examples
Rule form: “if Conditions then Class” [Confidence].
if (age > 20) and (loan = no) then risk = low (78%)
if (loan = yes) then risk = high (90%)
37
Classification: Application 1
• Direct Marketing
• Goal: Reduce cost of mailing by targeting a set of consumers likely to
buy a new product.
• Approach:
• Use the data for a similar product introduced before.
• We know which customers decided to buy and which decided
otherwise. This {buy, don’t buy} decision forms the class attribute.
• Collect various demographic, lifestyle, and company-interaction
related information about all such customers.
• Type of business, where they stay, how much they earn, etc.
• Use this information as input attributes to learn a classifier model.
38
Classification: Application 2
• Customer Attrition/Churn:
• Goal: To predict whether a customer is likely to be lost to a
competitor.
• Approach:
• Use detailed record of transactions with each of the past and
present customers, to find attributes.
• How often the customer calls, where he calls, what time-of-the day
he calls most, his financial status, marital status, etc.
• Label the customers as loyal or disloyal.
• Find a model for loyalty.
39
Classification: Training Dataset
This
follows an
example
from
Quinlan’s
ID3
Outlook
sunny
sunny
overcast
rain
rain
rain
overcast
sunny
sunny
rain
sunny
overcast
overcast
rain
Temperature Humidity Windy Class
hot
high
false
N
hot
high
true
N
hot
high
false
P
mild
high
false
P
cool
normal false
P
cool
normal true
N
cool
normal true
P
mild
high
false
N
cool
normal false
P
mild
normal false
P
mild
normal true
P
mild
high
true
P
hot
normal false
P
mild
high
true
N
40
Classification: Decision Tree Model
outlook?
sunny
overcast
Humidity?
rain
wind?
P
high
normal
strong
weak
N
P
N
P
41
Clustering
• Given a set of data points, each having a set of
attributes, and a similarity measure among them,
find clusters such that
– Data points in one cluster are more similar to one another.
– Data points in separate clusters are less similar to one
another.
• Similarity Measures:
– Euclidean Distance if
attributes are continuous
– Other problem-specific measures
42
Clustering: Application
• Market Segmentation:
– Goal: subdivide a market into distinct subsets of customers
where any subset may conceivably be selected as a market
target to be reached with a distinct marketing mix.
– Approach:
• Collect different attributes of customers based on their
geographical and lifestyle related information.
• Find clusters of similar customers.
• Measure the clustering quality by observing buying patterns
of customers in same cluster vs. those from different clusters.
43
Applications for BIG Data Analytics
44
45
Acquiring Better Customers
Source: https://www.youtube.com/watch?v=BfoJgoItd4M
46
Improving Customer Experience
Source: https://www.youtube.com/watch?v=BfoJgoItd4M
47
Analytics Can Help
• Credit card companies
–
–
–
–
–
Who should I offer my credit cards to?
How do I decide on the credit limit for each customer?
How do I fix the interest rate?
How do I identify fraud?
How do I predict bankruptcy?
• Retailers
– How do I stock the products in order to maximize my
profitability?
– How do I market to my customers to maximize my appeal?
– How do I price new and existing products in my store?
– How do I design promotion strategies to maximize
customer benefit?
48
Analytics Can Help
• Telecom
– How do I manage credit limits for my post paid
connections
– How do I sell more value add services?
• Hotels
– How do I manage room pricing and occupancy
rates to maximize revenues?
– How do I estimate the life time value of a
customer?
49
When Analytics Does Not Work?
50
Other Aspects of BIG Data
•
•
•
•
Bigger Data are not always Better Data
“Big” will evolve/change
Not all Data are equivalent
Just because it is accessible doesn’t make it
ethical
• Bonferroni’s principle: (roughly) if you look in
more places for interesting patterns than your
amount of data will support, you are bound to
find crap (a pattern that “fits”)
51
Summary
52
53
Is this a SPAM?
54
Clustering of S&P 500
 Observe stock
movements every
day.
 Clustering points:
Stock-{UP/DOWN}
 Similarity Measure:
Two points are
more similar if the
events described by
them frequently
happen together on
the same day.
55