Download lecture2_intro_continued

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
ICS 278: Data Mining
Lecture 2: Measurement and Data
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Today’s lecture
• Questions on homework?
• Office hours tomorrow: 9:30 to 11
• Outline of today’s lecture:
– From lecture 1: various tasks in data mining
– Chapter 2: Measurement and Data
• Types of measurement
• Distance measures
• Multidimensional scaling
• Discussion of class projects
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Slides from Lecture 1……
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Different Data Mining Tasks
• Exploratory Data Analysis
• Descriptive Modeling
• Predictive Modeling
• Discovering Patterns and Rules
• + others….
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Exploratory Data Analysis
•
Getting an overall sense of the data set
– Computing summary statistics:
• Number of distinct values, max, min, mean, median, variance, skewness,..
•
Visualization is widely used
– 1d histograms
– 2d scatter plots
– Higher-dimensional methods
•
Useful for data checking
– E.g., finding that a variable is always integer valued or positive
– Finding the some variables are highly skewed
•
Simple exploratory analysis can be extremely valuable
– You should always “look” at your data before applying any data mining
algorithms
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Example of Exploratory Data Analysis
(Pima Indians data, scatter plot matrix)
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Different Data Mining Tasks
• Exploratory Data Analysis
• Descriptive Modeling
• Predictive Modeling
• Discovering Patterns and Rules
• + others….
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Descriptive Modeling
• Goal is to build a “descriptive” model
– e.g., a model that could simulate the data if needed
– models the underlying process
• Examples:
– Density estimation:
• estimate the joint distribution P(x1,……xp)
– Cluster analysis:
• Find natural groups in the data
– Dependency models among the p variables
• Learning a Bayesian network for the data
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Example of Descriptive Modeling
ANEMIA PATIENTS AND CONTROLS
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
Control Group
4.1
4
3.9
3.8
3.7
3.3
3.4
Data Mining Lectures
3.5
3.6
3.7
Red Blood Cell Volume
3.8
3.9
4
Lecture 2: Data Measurement
Anemia Group
Padhraic Smyth, UC Irvine
Example of Descriptive Modeling
ANEMIA PATIENTS AND CONTROLS
4.3
4.2
Control Group
4.1
EM ITERATION 25
4.4
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
3.8
3.9
4
Red Blood Cell Hemoglobin Concentration
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
Anemia Group
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
3.8
3.9
4
Red Blood Cell Volume
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Learning User Navigation Patterns from Web Logs
128.195.36.195, -, 3/22/00, 10:35:11, W3SVC, SRVR1, 128.200.39.181, 781, 363, 875, 200, 0, GET, /top.html, -,
128.195.36.195, -, 3/22/00, 10:35:16, W3SVC, SRVR1, 128.200.39.181, 5288, 524, 414, 200, 0, POST, /spt/main.html, -,
128.195.36.195, -, 3/22/00, 10:35:17, W3SVC, SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.195.36.101, -, 3/22/00, 16:18:50, W3SVC, SRVR1, 128.200.39.181, 60, 425, 72, 304, 0, GET, /top.html, -,
128.195.36.101, -, 3/22/00, 16:18:58, W3SVC, SRVR1, 128.200.39.181, 8322, 527, 414, 200, 0, POST, /spt/main.html, -,
128.195.36.101, -, 3/22/00, 16:18:59, W3SVC, SRVR1, 128.200.39.181, 0, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:54:37, W3SVC, SRVR1, 128.200.39.181, 140, 199, 875, 200, 0, GET, /top.html, -,
128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 17766, 365, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:55:07, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:55:39, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:56:03, W3SVC, SRVR1, 128.200.39.181, 1081, 382, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 20:56:04, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:56:33, W3SVC, SRVR1, 128.200.39.181, 0, 262, 72, 304, 0, GET, /top.html, -,
128.200.39.17, -, 3/22/00, 20:56:52, W3SVC, SRVR1, 128.200.39.181, 19598, 382, 414, 200, 0, POST, /spt/main.html, -,
User 1
User 2
User 3
User 4
User 5
…
Data Mining Lectures
2
3
7
1
5
3
3
7
5
1
…
2
3
7
1
1
2
1
7
1
5
3
1
7
1
3
1
7
5
3
1
1
1
3
1
3
3
7
1
7
5
1
1
1
1
1
1
Lecture 2: Data Measurement
3
3
Padhraic Smyth, UC Irvine
Clusters of Probabilistic State Machines
Cadez, Heckerman, et al, 2003
Cluster 1
A
B
Cluster 2
B
C
E
B
Data Mining Lectures
C
E
A
Cluster 3
A
Motivation:
capture heterogeneity
of Web surfing behavior
C
E
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
WebCanvas algorithm and software - currently in new SQLServer
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Another Example of Descriptive Modeling
• Learning Directed Graphical Models (aka Bayes Nets)
– goal: learn directed relationships among p variables
– techniques: directed (causal) graphs
– challenge: distinguishing between correlation and causation
– example: Do yellow fingers cause lung cancer?
hidden cause: smoking
smoking
yellow fingers
Data Mining Lectures
?
cancer
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Different Data Mining Tasks
• Exploratory Data Analysis
• Descriptive Modeling
• Predictive Modeling
• Discovering Patterns and Rules
• + others….
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Predictive Modeling
•
Predict one variable Y given a set of other variables X
– Here X could be a p-dimensional vector
– Classification: Y is categorical
– Regression: Y is real-valued
•
In effect this is function approximation, learning the relationship between Y
and X
•
Many, many algorithms for predictive modeling in statistics and machine
learning
•
Often the emphasis is on predictive accuracy, less emphasis on
understanding the model
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Predictive Modeling: Fraud Detection
•
Credit card fraud detection
– Credit card losses in the US are over 1 billion $ per year
– Roughly 1 in 50k transactions are fraudulent
•
Approach
– For each transaction estimate p(fraudulent | transaction)
– Model is built on historical data of known fraud/non-fraud
– High probability transactions investigated by fraud police
•
Example:
– Fair-Isaac/HNC’s fraud detection software based on neural networks, led to
reported fraud decreases of 30 to 50%
– http://www.fairisaac.com/fairisaac
•
Issues
– Significant feature engineering/preprocessing
– false alarm rate vs missed detection – what is the tradeoff?
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Predictive Modeling: Customer Scoring
•
Example: a bank has a database of 1 million past customers, 10% of whom
took out mortgages
•
Use machine learning to rank new customers as a function of p(mortgage |
customer data)
•
Customer data
– History of transactions with the bank
– Other credit data (obtained from Experian, etc)
– Demographic data on the customer or where they live
•
Techniques
– Binary classification: logistic regression, decision trees, etc
– Many, many applications of this nature
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Predictive Modeling: Telephone Call Modeling
•
Background
– AT&T has about 100 million customers
– It logs 200 million calls per day, 40 attributes each
– 250 million unique telephone numbers
– Which are business and which are residential?
•
Approach (Pregibon and Cortes, AT&T,1997)
– Proprietary model, using a few attributes, trained on known business
customers to adaptively track p(business|data)
– Significant systems engineering: data are downloaded nightly, model
updated (20 processors, 6Gb RAM, terabyte disk farm)
•
Status:
– running daily at AT&T
– HTML interface used by AT&T marketing
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
From C. Cortes and D. Pregibon,
Giga-mining, in Proceedings of the
ACM SIGKDD Conference, 1997
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Different Data Mining Tasks
• Exploratory Data Analysis
• Descriptive Modeling
• Predictive Modeling
• Discovering Patterns and Rules
• + others….
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Structure: Models and Patterns
• Model = abstract representation of a process
e.g., very simple linear model structure
–
–
–
–
Y=aX+b
a and b are parameters determined from the data
Y = aX + b is the model structure
Y = 0.9X + 0.3 is a particular model
“All models are wrong, some are useful” (G.E. Box)
• Pattern represents “local structure” in a data set
– E.g.,
if X>x then Y >y with probability p
– or a pattern might be a small cluster of outliers in multidimensional space
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Pattern Discovery
• Goal is to discover interesting “local” patterns in the data rather
than to characterize the data globally
• given market basket data we might discover that
• If customers buy wine and bread then they buy cheese with
probability 0.9
• These are known as “association rules”
• Given multivariate data on astronomical objects
• We might find a small group of previously undiscovered objects that
are very self-similar in our feature space, but are very far away in
feature space from all other objects
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Example of Pattern Discovery
ADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADAADABDBBDABABBCDDDCDDABD
CBBDBDBCBBABBBCBBABCBBACBBDBAACCADDADBDBBCBBCCBBBDCABDDBBADDBB
BBCCACDABBABDDCDDBBABDBDDBDDBCACDBBCCBBACDCADCBACCADCCCACCDDAD
CBCADADBAACCDDDCBDBDCCCCACACACCDABDDBCADADBCBDDADABCCABDAACABC
ABACBDDDCBADCBDADDDDCDDCADCCBBADABBAAADAAABCCBCABDBAADCBCDACBC
ABABCCBACBDABDDDADAABADCDCCDBBCDBDADDCCBBCDBAADADBCAAAADBDCADB
DBBBCDCCBCCCDCCADAADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADACCCDA
BAABBCBDBDBADBBBBCDADABABBDACDCDDDBBCDBBCBBCCDABCADDADBACBBBCC
DBAAADDDBDDCABACBCADCDCBAAADCADDADAABBACCBB
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Example of Pattern Discovery
ADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADAADABDBBDABABBCDDDCDDABD
CBBDBDBCBBABBBCBBABCBBACBBDBAACCADDADBDBBCBBCCBBBDCABDDBBADDBB
BBCCACDABBABDDCDDBBABDBDDBDDBCACDBBCCBBACDCADCBACCADCCCACCDDAD
CBCADADBAACCDDDCBDBDCCCCACACACCDABDDBCADADBCBDDADABCCABDAACABC
ABACBDDDCBADCBDADDDDCDDCADCCBBADABBAAADAAABCCBCABDBAADCBCDACBC
ABABCCBACBDABDDDADAABADCDCCDBBCDBDADDCCBBCDBAADADBCAAAADBDCADB
DBBBCDCCBCCCDCCADAADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADACCCDA
BAABBCBDBDBADBBBBCDADABABBDACDCDDDBBCDBBCBBCCDABCADDADBACBBBCC
DBAAADDDBDDCABACBCADCDCBAAADCADDADAABBACCBB
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Example of Pattern Discovery
• IBM “Advanced Scout” System
– Bhandari et al. (1997)
– Every NBA basketball game is annotated,
• e.g., time = 6 mins, 32 seconds
event = 3 point basket
player = Michael Jordan
• This creates a huge untapped database of information
– IBM algorithms search for rules of the form
“If player A is in the game, player B’s scoring rate increases
from 3.2 points per quarter to 8.7 points per quarter”
– IBM claimed around 1998 that all NBA teams except 1 were
using this software…… the “other team” was Chicago.
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Components of Data Mining Algorithms
• Representation:
– Determining the nature and structure of the representation to be
used
• Score function
– quantifying and comparing how well different representations fit
the data
• Search/Optimization method
– Choosing an algorithmic process to optimize the score function;
and
• Data Management
– Deciding what principles of data management are required to
implement the algorithms efficiently.
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
What’s in a Data Mining Algorithm?
Task
Representation
Score Function
Search/Optimization
Data
Management
Models,
Parameters
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
An Example: Linear Regression
Task
Regression
Representation
Y = Weighted linear sum
of X’s
Score Function
Search/Optimization
Data Mining Lectures
Least-squares
Gaussian elimination
Data
Management
None specified
Models,
Parameters
Regression
coefficients
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
An Example: Decision Trees (C4.5 or CART)
Task
Classification
Representation
Data Mining Lectures
Hierarchy of axis-parallel
linear class boundaries
Score Function
Cross-validated
accuracy
Search/Optimization
Greedy Search
Data
Management
None specified
Models,
Parameters
Decision tree
classifier
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
An Example: Hierarchical Clustering
Task
Clustering
Representation
Tree of clusters
Score Function
Various
Search/Optimization
Data Mining Lectures
Greedy search
Data
Management
None specified
Models,
Parameters
Dendrogram
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
An Example: Association Rules
Task
Pattern Discovery
Representation
Rules: if A and B then C
with prob p
No explicit
score
Score Function
Search/Optimization
Systematic search
Data
Management
Multiple linear scans
Models,
Parameters
Data Mining Lectures
Set of Rules
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Next Lecture
• Chapter 3
– Exploratory data analysis and visualization
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Related documents