Download Data Mining - Computer Science Unplugged

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining
LECTURE # 01
Introduction to Data Mining
Motivation: “Necessity is the Mother
of Invention”
• Data Explosion Problem
1. Automated data collection tools (e.g. web, sensor networks) and mature
database technology lead to tremendous amounts of data stored in databases,
data warehouses and other information repositories.
2. Currently enterprises are facing data explosion problem.
• Electronic Information an Important Asset for Business
Decisions
1. With the growth of electronic information, enterprises began to realizing that
the accumulated information can be an important asset in their business
decisions.
2. There is a potential business intelligence hidden in the large volume of data.
3. This intelligence can be the secret weapon on which the success of a business
may depend.
Extracting Business Intelligence
(Solution)
1.
It is not a Simple Matter to discover Business Intelligence
from Mountain of Accumulated Data.
2.
What is required are Techniques that allow the enterprise to
Extract the Most Valuable Information.
3.
The Field of Data Mining provides such Techniques.
4.
These techniques can Find Novel Patterns (unknown) that
may Assist an Enterprise in Understanding the business
better and in forecasting.
Data Mining vs SQL, EIS, and OLAP
• SQL. SQL is a query language, difficult for business people
to use
• EIS = Executive Information Systems. EIS systems
provide graphical interfaces that give executives a preprogrammed (and therefore limited) selection of reports,
automatically generating the necessary SQL for each.
• OLAP allows views along multiple dimensions, and drilldrown, therefore giving access to a vast array of analyses.
However, it requires manual navigation through scores of
reports, requiring the user to notice interesting patterns
themselves.
• Data Mining picks out interesting patterns. The user
can then use visualization tools to investigate further.
4
An Example of OLAP Analysis and its
Limits
Walking Sticks Sales by City
• What is driving sales of walking sticks ?
Step 1
50
• Step 1: View some OLAP graphs:
e.g. walking stick sales by city.
10
Karachi
Lahore
Islamabad
• Step 2: Noticing that Islamabad has high sales
you decide to investigate further.
• (Before OLAP, you would have to have written a very
complex SQL query instead of just simply clicking to
drill-down).
• It seems that old people are responsible for most
walking stick sales.
You confirm this by viewing a chart of age
distributions by state.
• But imagine if you had to do this
manual investigation for all of the
10,000 products in your range !
Here, OLAP gives way to Data Mining.
400
Walking Sticks Sales in
Islamabad by Age
Step 2
10 30
Less than 20
20 to 60
360
Older than 60
Age Distribution by City
80
60
Younger than 20
40
20 to 60
20
Older than 60
0
Karachi
Lahore
Islamabad
5
Data Mining vs Expert Systems
• Expert Systems = Rule-Driven Deduction
Top-down: From known rules (expertise) and data to
decisions.
Rules
Data
Expert
System
Decisions
• Data Mining = Data-Driven Induction
Bottom-up: From data about past decisions to
discovered rules (general rules induced from the data).
Data
(including past decisions)
Data
Mining
Rules
6
Difference b/w Machine Learning and
Data Mining
• Machine Learning techniques are designed to deal with a limited
amount of artificial intelligence data. Where the Data Mining
Techniques deal with large amount of databases data.
• Data Mining (Knowledge Discovery in Databases)
– Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) information or patterns from data in large
databases.
• What is not Data Mining?
– (Deductive) query processing.
– Expert systems or small ML/statistical programs
Data Mining (Example)
• Random Guessing vs. Potential Knowledge
– Suppose we have to Forecast the Probability of Rain in Islamabad city
for any particular day.
– Without any Prior Knowledge the probability of rain would be 50%
(pure random guess).
– If we had a lot of weather data, then we can extract potential
rules using Data Mining which can then forecast the chance of rain
better than random guessing.
• Example: The Rule
if [Temperature = ‘hot’ and Humidity = ‘high’] then there is 66.6%
chance of rain.
Temperature
hot
hot
hot
mild
cool
cool
Humidity
high
high
high
high
normal
normal
Windy
false
true
false
false
false
true
Rain
No
Yes
Yes
No
No
Yes
The Data Mining Process
• Step 0: Determine Business Objective
- e.g. Forecasting the probability of rain
- Must have relevant prior knowledge and goals of application.
• Step 1: Prepare Data
- Noisy and Missing values handling (Data Cleaning).
- Data Transformation (Normalization/Discretization).
- Attribute/Feature Selection.
• Step 2: Choosing the Function of Data Mining
- Classification, Clustering, Association Rules
• Step 3: Choosing The Mining Algorithm
- Selection of correct algorithm depending upon the quality of data.
- Selection of correct algorithm depending upon the density of data.
• Step 4: Data Mining
- Search patterns of interest:- A typical data mining algorithm can
mine millions of patterns.
• Step 5: Visualization/Knowledge Representation
- Visualization/Representation of interesting patterns, etc
9
Data Mining: A KDD Process
– Data mining: the core of
knowledge discovery
process.
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Data Cleaning
Data Integration
Databases
Data Mining: On What Kind of Data?
1.
2.
3.
4.
Relational databases
Data warehouses
Transactional databases
Advanced DB and information repositories
–
–
–
–
–
Time-series data and temporal data
Text databases
Multimedia databases
Data Stream (Sensor Networks Data)
WWW
Data Mining Functionalities (1)
• Data Preprocessing
– Handling Missing and Noisy Data (Data Cleaning).
– Techniques we will cover.
• Missing values Imputation using Mean, Median and Mod.
• Missing values Imputation using K-Nearest Neighbor.
• Missing values Imputation using Association Rules Mining.
• Data Binning for Noisy Data.
TID Refund Country Taxable Income Cheat
1
Yes
2
3
No
USA
125K
No
UK
100K
No
Australia
70K
No
120K
No
95K
Yes
4
5
No
NZL
Data Mining Functionalities (1)
• Data Preprocessing
– Data Transformation (Discretization and Normalization).
– With the help of data transformation rules become more General and
Compact.
– General and Compact rules increase the Accuracy of Classification.
Age
Age
15
Child
18
40
Child = (0 to 20)
33
Young = (21 to 47)
55
Old = (48 to 120)
Child
Young
Young
Old
48
Old
12
Child
23
Young
1. If attribute 1 = value1 & attribute 2 = value2 and Age = 08
then Buy_Computer = No.
2. If attribute 1 = value1 & attribute 2 = value2 and Age = 09
then Buy_Computer = No.
3. If attribute 1 = value1 & attribute 2 = value2 and Age = 10
then Buy_Computer = No.
1. If attribute 1 = value1 &
attribute 2 = value2 and
Age = Child then
Buy_Computer = No.
Data Mining Functionalities (1)
•
Data Preprocessing
–
–
Attribute Selection/Feature Selection
•
Selection of those attributes which are more relevant to data mining
task.
•
Advantage1: Decrease the processing time of mining task.
•
Advantage2: Generalize the rules.
Example
•
If our mining goal is to find that countries which has more Cheat
on which Taxable Income.
•
Then obviously the date attribute will not be an important factor
in our mining task.
Date
Refund Country Taxable Income
Cheat
11/02/2002
Yes
USA
125K
No
13/02/2002
Yes
UK
100K
No
16/02/2002
No
Australia
120K
Yes
21/03/2002
No
Australia
120K
Yes
26/02/2002
No
NZL
95K
Yes
Data Mining Functionalities (1)
•
Data Preprocessing
•
Principle Component Analysis
•
Wrapper Based
•
Filter Based
Data Mining Functionalities (2)
• Association Rule Mining
•
In Association Rule Mining Framework we have to find all the
rules in a transactional/relational dataset which contain a support
(frequency) Greater than some minimum support (min_sup)
threshold (provided by the user).
•
For example with min_sup = 50%.
Transaction ID
2000
1000
4000
5000
Items Bought
Bread,Butter,Egg
Bread,Butter, Egg
Bread,Butter, Tea
Butter, Ice cream, Cake
Itemset
{Butter}
{Bread}
{Egg}
{Bread,Butter}
{Bread, Butter, Egg}
Support
4
3
2
3
2
Data Mining Functionalities (2)
• Association Rule Mining
•
Topic we will cover
–
–
–
–
–
–
Frequent Itemset Mining Algorithms (Apriori, FP-Growth, Bitvector ).
Fault-Tolerant/Approximate Frequent Itemset Mining.
N-Most Interesting Frequent Itemset Mining.
Closed and Maximal Frequent Itemset Mining.
Incremental Frequent Itemset Mining
Sequential Patterns.
Data Mining Functionalities (2)
• Classification and Prediction
– Finding models (functions) that describe and distinguish classes or
concepts for future prediction
– Example: Classify rainy/un-rainy cities based on Temperature,
Humidify and Windy Attributes.
– Must have known the previous business decisions (Supervised
Learning).
City
Lahore
Islamabad
Islamabad
Multan
Karachi
Rawalpindi
Temperature
hot
hot
hot
mild
cool
hot
Prediction of
unknown record
Humidity
low
high
high
low
normal
high
Windy
false
true
false
false
false
true
City
Muree
Sibi
Rain
No
Yes
Yes
No
No
Yes
Rule
•
If Temperature = Hot &
Humidity = High then
Rain = Yes.
Temperature
hot
mild
Humidity Windy
high
false
low
true
Rain
?
?
Data Mining Functionalities (2)
• Cluster Analysis
– Group data to form new classes based on un-labels class data.
– Business decisions are unknown (Also called unsupervised Learning).
– Example: Classify rainy/un-rainy cities based on Temperature,
Humidify and Windy Attributes.
City
Lahore
Islamabad
Islamabad
Multan
Karachi
Rawalpindi
Temperature
hot
hot
hot
mild
cool
hot
Humidity
low
high
high
low
normal
high
Windy
false
true
false
false
false
true
Rain
?
?
?
?
?
?
3 clusters
Data Mining Functionalities (3)
• Outlier Analysis
– Outlier: A data object that does not comply with the general behavior
of the data.
– It can be considered as noise or exception but is quite useful in fraud
detection, rare events analysis
City
Lahore
Islamabad
Islamabad
Multan
Karachi
Rawalpindi
Temperature
hot
hot
hot
mild
cool
hot
Humidity
low
high
high
low
normal
high
Windy
false
true
false
false
false
true
Rain
?
?
?
?
?
?
2 outliers
Are All the “Discovered” Patterns
Interesting?
• A data mining system/query may generate thousands of
patterns, not all of them are interesting.
– Suggested approach: Query-based, Constraint
mining
• Interestingness Measures: A pattern is interesting if
it is easily understood by humans, valid on new or test
data with some degree of certainty, potentially useful,
novel, or validates some hypothesis that a user seeks to
confirm
Can We Find All and Only Interesting
Patterns?
• Find all the interesting patterns: Completeness
– Can a data mining system find all the interesting patterns?
– Remember most of the problems in Data Mining are NP-Complete.
– There is no global best solution for any single problem.
• Search for only interesting patterns: Optimization
– Can a data mining system find only the interesting patterns?
– Approaches
• First general all the patterns and then filter out the uninteresting
ones.
• Generate only the interesting patterns—Constraint based mining (Give
threshold factors in mining)
Reading Assignment
• Book Chapter
– Chapter 1 of “Jiawei Han and Micheline Kamber” book
“Data Mining: Concepts and Techniques”.
Data Mining ------- Where?
• Some Nice Resources
– ACM Special Interest Group on Knowledge Discovery and Data
Mining (SIGKDD) http://www.acm.org/sigs/sigkdd/.
– Knowledge Discovery Nuggets www.kdnuggests.com.
– IEEE Transactions on Knowledge and Data Engineering –
http://www.computer.org/tkde/.
– IEEE Transactions on Pattern Analysis and Machine Intelligence –
http://www.computer.org/tpami/.
– Data Mining and Knowledge Discovery - Publisher: Springer
Science+Business Media B.V., Formerly Kluwer Academic
Publishers B.V. http://www.kluweronline.com/issn/13845810/. current and previous offerings of Data Mining course at
Stanford, CMU, MIT and Helsinki.
Text and Reference Material
•
The course will be mainly based on research
literature, following text may however be
consulted:
–
Jiawei Han and Micheline Kamber. “Data Mining: Concepts and
Techniques”.
1.
David Hand, Heikki Mannila and Padhraic Smyth. “Principles of
Data Mining”. Pub. Prentice Hall of India, 2004.
2.
Sushmita Mitra and Tinku Acharya. “Data Mining: Multimedia,
Soft Computing and Bioinformatics”. Pub. Wiley an Sons Inc.
2003.
3.
Usama M. Fayyad et al. “Advances in Knowledge Discovery and
Data Mining”, The MIT Press, 1996.