Download CS490D: Introduction to Data Mining Chris Clifton

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Introduction
to
Data Mining
Dr. Sushil Kulkarni
Jai Hind College
([email protected])
Introduction to Data Mining
1
Road Map
— Introduction to database
— A Problem and A
Solution
— What Is Data Mining?
— Goal of Data Mining
— What is (not) Data
Mining?
— Convergence of 3 key
Technologies
— Data mining Functions
— Kinds of Data Mining
Problems
Introduction to Data Mining
2
What is Database?
A database is
any organized
collection of data.
Introduction to Data Mining
3
Examples
Introduction to Data Mining
Co-workers
4
Examples
Introduction to Data Mining
Patient Information
5
Examples
Introduction to Data Mining
Airline reservation system
6
Data vs. information
• What is information?
• What is data?
– Data is unprocessed – Information is data that have been
information.
organized and communicated in a
coherent and meaningful manner.
– Data is converted into
information, and information is
converted into knowledge.
– Knowledge; information
evaluated and organized so that it
can be used purposefully.
Introduction to Data Mining
7
Why do we need a database?
• Keep records of our:
– Clients
– Staff
– Volunteers
• To keep a record of
activities and
interventions
• Keep sales records
• Develop reports
• Perform research
Introduction to Data Mining
8
Purpose of Database system
Is to transform
Data
Information
Introduction to Data Mining
Knowledge
Action
9
Database
• Database: Shared collection of logically related
data (and a description of this data), designed to
meet the information needs of an organization.
• Database management System: A software
system that enables users to define, create, and
maintain the database and that provides
controlled access to this database.
Introduction to Data Mining
10
Who and How to do it ?
• Database Management System (DBMS) does this
job.
• Using Software tools: Access, FileMaker, Lotus Notes,
Oracle or SQL Server, …….
• It includes tools to add, modify or delete data from the
database, ask questions (or queries) about the data
stored in the database and produce reports
summarizing selected contents.
Introduction to Data Mining
11
hmm.. Let’s jump to Data Mining
• With this background we will now see what is data
Mining
Introduction to Data Mining
12
A Problem …
• You are a marketing manager of a brokerage
company
— Problem: Churn is too high
> Turnover is 40%
(after six month introductory period ends)
— Customers receive incentives
(average cost: ₹160) when account is opened
— Giving new incentives to everyone who might
leave is very expensive (as well as wasteful)
— Bringing back a customer after they leave is
both difficult and costly
Introduction to Data Mining
13
A Solution …
— One month before the end of the introductory period is
over, predict which customers will leave
— If you want to keep a customer that is predicted to churn,
offer them something based on their predicted value
> The ones that are not predicted to churn need no
attention
— If you don’t want to keep the customer, do nothing
— How can you predict future behavior?
> Tarot Cards
> Magic 8 Ball
Introduction to Data Mining
14
KDD Process
• Knowledge discovery in databases (KDD) is a
multi step process of finding useful information and
patterns in data
• Data Mining is the use of algorithms to extract
information and patterns derived by the KDD
process.
• Many texts treat KDD and Data Mining as the
same process, but it is also possible to think of
Data Mining as the discovery part of KDD.
Introduction to Data Mining
15
Steps of KDD Process
• Many texts treat KDD and Data Mining as
the same process, but it is also possible to
think of Data Mining as the discovery part
of KDD.
• Knowledge discovery in databases
(KDD) is a multi step process of finding
useful information and patterns in data
Introduction to Data Mining
16
Steps of KDD Process
1. SelectionData Extraction -Obtaining Data from heterogeneous
data sources -Databases, Data warehouses, World
wide web or other information repositories.
2. PreprocessingData Cleaning- Incomplete , noisy, inconsistent data
to be cleaned- Missing data may be ignored or
predicted, erroneous data may be deleted or
corrected.
3. TransformationData Integration- Combines data from multiple
sources into a coherent store -Data can be encoded
in common formats, normalized, reduced.
Introduction to Data Mining
17
Steps of KDD Process
4. Data mining –
Apply algorithms to transformed data an extract
patterns.
5. Pattern Interpretation/evaluation
Pattern Evaluation- Evaluate the interestingness of
resulting patterns or apply interestingness measures to
filter out discovered patterns.
Knowledge presentation- present the mined knowledgevisualization techniques can be used.
Introduction to Data Mining
18
What Is Data Mining?
Some Definitions
• “The nontrivial extraction of implicit, previously unknown, and
potentially useful information from data” (Piatetsky-Shapiro)
• "...the automated or convenient extraction of patterns
representing knowledge implicitly stored or captured in large
databases, data warehouses, the Web, ... or data streams." (Han,
pg xxi)
• “...the process of discovering patterns in data. The process
must be automatic or (more usually) semiautomatic. The
patterns discovered must be meaningful...” (Witten, pg 5)
• “...finding hidden information in a database.” (Dunham, pg 3)
• “...the process of employing one or more computer learning
techniques to automatically analyse and extract knowledge from
data contained within a database.” (Roiger, pg 4)
Introduction to Data Mining
19
Why Data Mining?
• That all sounds ... complicated. Why should I learn about
Data Mining?
• What's wrong with just a relational database? Why would I
want to go through these extra [complicated] steps?
• Isn't it expensive? It sounds like it takes a lot of skill,
programming, computational time and storage space.
• Where's the benefit?
• Data Mining isn't just a cute academic exercise, it has very
profitable real world uses. Practically all large companies and
many governments perform data mining as part of their
planning and analysis.
Introduction to Data Mining
20
Goal of Data Mining
— Simplification and automation of the overall
statistical process, from data source (s) to model
application
— Changed over the years
> Statistician replace data to a model
> Many different data mining algorithms / tools
available
> Statistical expertise required to build intelligence
into the software
Introduction to Data Mining
21
Data Mining is …
Introduction to Data Mining
22
What is (not) Data Mining?
What is not Data
Mining?
– Look up phone
number in phone
directory
– Query a Web
search engine
for information
about Amazon”
Introduction to Data Mining
What is Data Mining?
– Certain names are more
common in certain
locations of Mumbai
(Kulkarni, Shah, Iyer… )
– Group together similar
documents returned by
search engine according
to their context (e.g.
Amazon rainforest,
Amazon.com,)
23
DB VS DM Processing
• Query
– Well defined
– SQL
Data
– Operational data
Output
– Precise
– Subset of
database
Introduction to Data Mining
• Query
– Poorly defined
– No precise query language
Data
– Not operational data
Output
– Fuzzy
– Not a subset
of database
24
Convergence of 3 key Technologies
Introduction to Data Mining
25
1. Increasing Computing Power
— Moore’s law doubles computing power
every 18 months
— Powerful workstations became common
— Cost effective servers (SMPs) provide
parallel processing to the mass market
— Interesting tradeoff:
< Small number of large analyses vs. large
number of small analyses
Introduction to Data Mining
26
1. The Data Explosion
• The rate of data creation is accelerating each year. In
2003, UC Berkeley estimated that the previous year
generated 5 exabytes of data, of which 92% was
stored on electronically accessible media.
Mega < Giga < Tera < Peta < Exa ... All the data in all
the books in the US Library of Congress is ~136
Terabytes. So 37,000 New Libraries of Congress in
2002.
• VLBI Telescopes produce 16 Gigabytes of data every
second.
• Google searches 18 billion+ accessible web pages.
Introduction to Data Mining
27
1. The Data Explosion Implications
• As the amount of data increases, the proportion of
information decreases.
• As more and more data is generated automatically, we
need to find automatic solutions to turn those stored
raw results into information.
• Companies need to turn stored data into profit ...
Otherwise why are they storing it?
Introduction to Data Mining
28
2. Improved Data Collection and Management
— Data Collection ? Access ? Navigation ? Mining
— The more data the better (usually)
Introduction to Data Mining
29
3. Statistical & Machine Learning Algorithms
— Techniques have often been waiting for computing
technology to catch up
— Statisticians already doing “manual data mining”
— Good machine learning is just the intelligent
application of statistical processes
— A lot of data mining research focused on tweaking
existing techniques to get small percentage gains
Introduction to Data Mining
30
3.Data/Information/Knowledge/Wisdom
• For example, a data mining application may tell
you that there is a correlation between buying
music magazines and beer, but it doesn't tell
you how to use that knowledge. Should you
put the two close together to reinforce the
tendency, or should you put them far apart as
people will buy them anyway and thus stay in
the store longer?
• Data mining can help managers plan strategies
for a company, it does not give them the
strategies.
Introduction to Data Mining
31
Data mining Functions
• All Data Mining functions can be thought of as
attempting to find a model to fit the data.
• Each function needs criteria to create one model over
another.
• Each function needs a technique to compare the data.
• Two types of model:
– Predictive models predict unknown values based
on known data
– Descriptive models identify patterns in data
Introduction to Data Mining
32
Data mining Functions
Introduction to Data Mining
33
Predictive Model
— A “black box” that makes predictions about
the future based on information from the
past and present
— Large number of inputs usually available
Introduction to Data Mining
34
Kinds of Data Mining problems
Database
– Find all credit applicants with Aditi as first name
– Identify customers who have purchased
more than ₹ 10,000 in the last month
– Find all customers who have purchased milk
Data Mining
– Find all credit applicants who are poor credit risks.
(classification)
– Identify customers with similar buying habits.
(Clustering)
– Find all items which are frequently purchased with
milk. (association rules)
Introduction to Data Mining
35
Kinds of Data Mining problems
• Classification
• Clustering
• Association Rule
Introduction to Data Mining
36
Classification
Classification Model
Introduction to Data Mining
37
Definition of Classification Problem
Given a database
D={t1,t2,…,tn} and a set of
classes C={C1,…,Cm}, the
Classification Problem is
to define a mapping
f: DgC where each t i is
assigned to one class.
Introduction to Data Mining
38
Example: Credit Card
Refund Marital
Status
Taxable
Income Cheat
No
No
Single
75 Cr
?
100 Cr
No
Yes
Married
50 Cr
?
Single
70 Cr
No
No
Married
150 Cr
?
Yes
Married
120 Cr
No
Yes
Divorced 90 Cr
?
5
No
Divorced 95 Cr
Yes
No
Single
40 Cr
?
6
No
Married
No
No
Married
80 Cr
?
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125 Cr
2
No
Married
3
No
4
60 Cr
10
7
Yes
Divorced 220 Cr
No
8
No
Single
85 Cr
Yes
9
No
Married
75 Cr
No
10
No
Single
90 Cr
Yes
Test
Set
10
Training
Set
Introduction to Data Mining
Learn
Classifier
Model
39
Another Example ...
• In which group, these object belongs to ?
Target Object
oopps
Group 1: Delia
Group 2: Roses
(Experiment reported on in Cognitive Science, 2002)
Introduction to Data Mining
40
Resemblance
• People classify things by finding other
items that are similar which have
already been classified.
• For example: Is a new species a bird?
Does it have the same attributes as lots
of other birds? If so, then it's probably a
bird too.
• A combination of rote memorization and the notion of
'resembles'.
• Although kiwis can't fly like most other birds, they resemble
birds more than they resemble other types of animals.
• So the problem is to find which instances most closely
resemble the instance to be classified.
Introduction to Data Mining
41
Few More Examples
• Loan
The data
generated by
companies
can
“giveengines
you
airplane
can be
• Cell phone companies
results
in
used
to
determine
when it
can classify
customers
minutes” by
needs
to be
serviced.
into
those
likely
to leave,By
classifying
you
and
need
discovering
the patterns
into hence
a good
credit
enticement,
and
those
risk or
bad
risk,
that
areaindicative
of
that
areon
likely
to stay
based
your
problems,
companies can
regardless.
personal
service working engines
information and a
less
(increasing
largeoften
supply
of
profit)
andsimilar
discover faults
previous,
customers.
before they materialise
(increasing safety).
Introduction to Data Mining
42
Clustering
• Classification is supervised learning the supervision
comes from labeling the instances with the class.
• Clustering is unsupervised learning -- there are no
predefined class labels, no training set.
• So our clustering algorithm needs to assign a cluster to
each instance such that all objects with the same
cluster are more similar than others.
Introduction to Data Mining
43
Clustering
• Finding groups of objects such that the objects in a group will
be similar (or related) to one another and different from (or
unrelated to) the objects in other groups
• The goal is to find the most 'natural' groupings of the instances.
- Within a cluster: Maximize similarity between instances.
- Between clusters: Minimize similarity between instances.
Intra-cluster
distances are
minimized
Introduction to Data Mining
Inter-cluster
distances are
maximized
44
Clustering
• For example, we might have the following data:
• Where the axes are two dimensions and shape is a
third, nominal attribute.
Introduction to Data Mining
45
Clustering
• A clustering algorithm might find three clusters:
• Even though there are some squares and circles mixed
together.
Introduction to Data Mining
46
Outliers
Outliers
Cluster 1
Cluster 2
Introduction to Data Mining
47
What is a natural grouping among these objects?
Clustering is subjective
Tatkare’s Family School Employees
Introduction to Data Mining
Females
Males
48
What is Similarity?
The quality or state of being similar; likeness; resemblance; as, a similarity
of features.
Webster's Dictionary
Similarity is
hard to define,
but…
“We know it
when we see it”
The real
meaning of
similarity is a
philosophical
question. We
will take a
more
pragmatic
approach.
Introduction to Data Mining
49
Clustering Problem
• Given a database D={t1,t2,…,tn} of tuples and an
integer value k, the Clustering Problem is to
define a mapping f:Dg{1,..,k} where each ti is
assigned to one cluster Kj, 1<=j<=k.
• A Cluster, Kj, contains precisely those tuples
mapped to it.
• Unlike classification problem, clusters are not
known a priori.
Introduction to Data Mining
50
Applications
• Marketing:
Discover consumer
groups based on
their purchasing
habits
• City Planning:
Identify groups of
buildings by type,
value, location
Introduction to Data Mining
51
Applications
• Image Processing:
Identify clusters of similar
images (eg horses)
• Biological: Discover
groups of plants/animals
with similar properties
Introduction to Data Mining
52
Applications
• Given:
– A source of textual
documents
– Similarity measure
• e.g., how many
words are common
in these documents
Documents
source
Similarity
measure
Clustering
System
• Find:
•
Several clusters of documents
that are relevant to each
other
Introduction to Data Mining
Doc
Doc
Doc
Doc
Doc
Doc
Do
Doc
Docc
Doc
53
Association Rules
• A common application
is market basket
analysis which
(1) items are frequently
sold together at a
supermarket
(2) arranging items on
shelves which items
should be promoted
together
Introduction to Data Mining
54
Association Rule Discovery
Introduction to Data Mining
55
Association Rule Discovery
• Given a set of records each of
which contain some number of
items from a given collection;
– Produce dependency rules which will
predict occurrence of an item based
on occurrences of other items.
TID
Items
1
2
3
4
5
Bread, Coke, Milk
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
Introduction to Data Mining
Rules Discovered:
{Milk} --> {Coke}
{Diaper, Milk} --> {Beer}
56
Association Rule Discovery
Market basket:
Rule form: “Body ead [support,
confidence]”.
buys(X, `beer')  buys(X, “snacks')
[1%, 60%]
(a) If a customer X purchased `beer',
60% of them purchased `snacks'
(b) 1% of all transactions contain the
items `beer' and `snacks‘ together
Introduction to Data Mining
57
A Weka bird is a strong brown bird which is native to New
Zealand and grows to be about the same size as a chicken.
The Weka was once fairly common on the North and South
Islands of New Zealand but over the years has heavily
declined on the North Island due to the major damage of their
habitats.
Introduction to Data Mining
58
• Three graphical user interfaces
– “The Explorer” (exploratory data
analysis)
– “The Experimenter”
(experimental environment)
– “The KnowledgeFlow” (new
process model inspired
interface)
WEKA is available at
http://www.cs.waikato.ac.nz/ml/weka
Introduction to Data Mining
59
References
• Witten, Ian and Eibe Frank, Data Mining: Practical
Machine Learning Tools and Techniques, Second Edition,
Morgan Kaufmann, 2005
• Dunham, Margaret H, Data Mining: Introductory and
Advanced Topics, Prentice Hall, 2003
Introduction to Data Mining
60
References: Yahoo Group
• ‘dbmsnotes’ http://tech.groups.yahoo.com/group/dbmsnotes/
Introduction to Data Mining
61
Introduction to Data Mining
62