Download Data Mining -1- Course Web Page Reference Books SYLLABUS

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Forecasting wikipedia , lookup

Data assimilation wikipedia , lookup

Transcript
Course Web Page
y http://kisi.deu.edu.tr/engin.yildiztepe/
Data Mining
-1-
y Lecture Slides
y Announcements
y Assignments
Dr. Engin
g YILDIZTEPE
2
Reference Books
SYLLABUS
y Data Mining – Introduction
y Han, J. , Kamber, M., Pei, J., (2011). Data Mining: Concepts
and Techniques. Third edition. San Francisco: Morgan Kaufmann
Publishers.
y Larose, Daniel T. (2005). Discovering Knowledge In Data – An
Introduction to Data Mining. New Jersey: John Wiley and Sons
Ltd.
y Alpaydın, E. (2010). Introduction to Machine Learning. Second
Ed. London:MIT Press.
y Databases
y Data Warehouses
y OLAP
y Data mining process
y Data mining tasks
y Clustering
y Classification
y Association Rules
y Evaluation
3
4
1
Grading
y Midterm examination
y Final examination
y Homework
Introduction
5
are drowning in information but
starved for knowledge." (John Naisbitt)
y Moore’s Law
In 1965, Intel Corporation cofounder Gordon Moore
predicted that the density of transistors in an
integrated circuit would double approximately every
two years. (often quoted as 18 months)
y Experts on ants estimate that there are 1016 to 1017
ants on earth. In the year 1997, there was one
transistor per ant.
6
y
y
y
y
y
y
y
y
y
y
y
y
7
y "We
40%
50%
10%
Computer History
What is the idea?
Abacus (BC 2600)
Calculating Clock - Wilhelm Schickard (1623)
P li – Blaise
Pascaline
Bl i Pascal
P l (1642)
Leibniz wheel - Gottfried Leibniz (1672)
Analytical Engine* - Charles Babbage (1837)
Marc – 1 - Howard H. Aiken (1944)
ENIAC - John Mauchly and J. Presper Eckert (1947)
EDVAC – J. Mauchly, J. P. Eckert and John von Neumann- (1951)
MANIAC - John von Neumann
Neumann- (1952)
First Microprocessor – 4004 – Intel (1971)
APPLE - Steve Wozniak and Steve Jobs – (1976)
IBM Personal Computer (1981)
y The aim of the data mining is to extract hidden
knowledge from data sets.
8
2
What is Data Mining
KDD Process
y “Data mining is the exploration and analysis, by automatic or
semiautomatic means, of large quantities of data in order to
discover meaningful patterns and rules.” 1
y “Data
“D t mining
i i is
i the
th analysis
l i off (often
( ft large)
l
) observational
b
ti
l data
d t
sets to find unsuspected relationships and to summarize the
data in novel ways that are both understandable and useful to
the data owner.”2
y “Data mining is an interdisciplinary field bringing together
techniques from machine learning, pattern recognition,
statistics,, databases,, and visualization to address the issue of
information extraction from large data bases.”3
y
1 Berry, Michael J.A. and Linoff, Gordon, “ Data Mining Techniques: For Marketing, Sales, and Customer Support” , John
Wiley & Sons, Inc. 1997.
2 David Hand, Heikki Mannila, and Padhraic Smyth, Principles of Data Mining, MIT Press, Cambridge, MA, 2001
3 Peter Cabena, PabloHadjinian, Rolf Stadler, JaapVerhees, andAlessandro Zanasi,Discovering Data Mining: From
Concept to Implementation, Prentice Hall, Upper Saddle River,NJ, 1998.
9
10
Why Data Mining
y A relational database is a collection of tables.
y The storing of the data in data warehouses, so that the
y
y
y
11
Fayyad U., et all.,”From Data Mining to Knowledge Discovery in Databases”, 1996.
Database
y The explosive growth in data collection,
y
Figure 1. An Overview of the Steps That Compose the Knowledge Discovery in Databases Process.
y Tables consist of a set of columns.
columns
entire
ti enterprise
t
i h
has access to
t a reliable
li bl currentt
database
The availability of increased access to data from Web
navigation and intranets
The competitive pressure to increase market share in a
globalized economy
The development of the commercial data mining
softwares
The tremendous growth in computing power and
storage capacity
y Tables store a large set of tuples (records or
rows).
y A database system consists of a collection of
interrelated data.
y Relational data can be accessed by queries
written in a relational query language (SQL)
12
3
Data warehouse
Data warehouse
y A data warehouse is a repository of information
collected from multiple sources, stored under a
unified
ifi d schema,
h
and
d which
hi h usually
ll resides
id att a single
i l
site.
y Data warehouses are constructed via a process of
data cleansing, data transformation, data integration,
data loading, and periodic data refreshing.
y Data warehousing provides architectures and tools
f
for
b i
business
executives
ti
t systematically
to
t
ti ll organize,
i
understand, and use their data to make strategic
decisions.
13
14
Operational Database – Transactional Database
what exactly is a data warehouse?
y Operational Database consists of the data used to
y A data warehouse refers to a database that is
maintained separately from an organization's
operational databases.
databases
y A data warehouse collects information about
subjects that span an entire organization, and thus
its scope is enterprise-wide.
y
y
y
y
15
run the day-to-day operations of the business.
An operational database contains enterprise data
which are up to date and modifiable.
The operational database is the source of the data
warehouse.
Transactional database consists of transactions.
Each record in a transactional database captures a
transaction, such as a customer’s purchase.
16
4
Data Mart
OLAP
y By
providing multidimensional data views and the
precomputation of summarized data, data warehouse systems
are wellll suited
i d for
f On-Line
O Li
A l i l Processing,
Analytical
P
i
or
OLAP.
y OLAP operations make use of background knowledge
regarding the domain of the data being studied in order to allow
the presentation of data at different levels of abstraction. Such
operations accommodate different user viewpoints.
y Examples
E
l off OLAP operations
i
i l d drill-down
include
d ill d
andd roll-up,
ll
which allow the user to view the data at differing degrees of
summarization.
y Data marts are a subset of data warehouse data
y A data mart is a department subset of a data
warehouse.
y It focuses on selected objects, and thus its scope
is department-wide.
17
18
OLAP
OLAP
y Traditional query and report tools describe what is
in a database.
y OLAP goes further; it’s used to answer why
certain things are true.
y The user forms a hypothesis about a relationship
and verifies it with a series of queries against the
data.
19
Han, J. , Kamber, M., Pei, J., (2011). Data Mining: Concepts and Techniques. Third edition. San Francisco: Morgan Kaufmann
Publishers.
20
5
OLAP vs. Data Mining
OLAP vs. Data Mining
y The OLAP analyst generates a series of hypothetical
y Data mining is different from OLAP because rather than
patterns and relationships and uses queries against the
database to verify them or disprove them.
y OLAP analysis is essentially a deductive process.
y But what happens when the number of variables being
analyzed is in the dozens or even hundreds?
y It becomes much more difficult and time-consuming to find a
goodd hypothesis.
h
h
verify hypothetical patterns, it uses the data itself to
uncover such patterns.
y It is essentially an inductive process.
y For example, suppose the analyst who wanted to identify the
risk factors for loan default were to use a data mining tool.
21
22
OLAP vs. Data Mining
Data Mining Tasks
y Where data mining and OLAP can complement each other?
y The most common data mining tasks are as follows:
y OLAP is also complementary
p
y in the earlyy stages
g of the KDD
y Description
p
y It can help you explore your data, for instance by focusing
y Estimation
attention on important variables, identifying exceptions, or
finding interactions.
y Prediction
y Classification
y Clustering
y Association
23
24
6
CROSS-INDUSTRY STANDARD PROCESS: CRISP–DM
CROSS-INDUSTRY STANDARD PROCESS: CRISP–DM
The data mining process must be reliable and repeatable
1. Business understanding phase
1. Business understanding phase
y The project objectives and requirements understanding
2 Data understanding phase
2.
y Data mining problem definition.
definition
3. Data preparation phase
y Prepare strategy for achieving these objectives.
4. Modeling phase
5. Evaluation phase
6. Deployment phase
25
27
26
CROSS-INDUSTRY STANDARD PROCESS: CRISP–DM
CROSS-INDUSTRY STANDARD PROCESS: CRISP–DM
2. Data understanding phase
y Initial data collection.
y Exploratory data analysis
y Identification of the data quality problems.
3. Data preparation phase
y Prepare the final data set
y Select the records and variables you want to analyze
y Perform transformations on certain variables
y Clean the raw data
28
7
CROSS-INDUSTRY STANDARD PROCESS: CRISP–DM
CROSS-INDUSTRY STANDARD PROCESS: CRISP–DM
4. Modeling phase
y Select and apply appropriate modeling techniques.
y Calibrate parameters to optimize results.
results
y Several different techniques may be used for the same problem.
y If necessary, loop back to the data preparation phase
5. Evaluation phase
y Evaluate the one or more models for quality and effectiveness.
y Determine whether the model in fact achieves the objectives set
y Come to a decision regarding use of the data mining results.
29
30
CROSS-INDUSTRY STANDARD PROCESS: CRISP–DM
Data Preprocessing
6. Deployment phase
y Make use of the models created.
y Why
Wh preprocess the
h ddata??
y Data cleaning
y Data integration and transformation
y Data reduction
y Discretization and concept hierarchy generation
y Summary
31
8
Data Preprocessing
Data Preprocessing
y Much of the raw data contained in databases is unpreprocessed,
y Data preparation tasks are likely to be
incomplete, noisy and inconsistent.
p
performed
multiple
p times and not in anyy
prescribed order.
y Tasks include;
y incomplete:
l
l k attribute
lacking
b values,
l
llacking
k certain attributes
b
off
interest, or containing only aggregate data
y noisy: containing errors or outliers
y inconsistent: containing discrepancies in codes or names
y table, record and attribute selection,
y For example, the databases may contain:
y
y
y
y
y
y transformation,
Fields that are obsolete or redundant
Missing values
Outliers
Data in a form not suitable for data mining models
Values not consistent with policy or common sense.
33
y cleaning of data
34
Tasks in Data Preprocessing
Forms of data preprocessing
y Clean Data
y To fill missing values
y To smooth out noise, identify or remove outliers
y To correct inconsistencies
y Integrate Data
y To combine data from multiple sources
y Data transformation
y The production of derived attributes
y Format transformations
y Normalization (scaling to a specific range)
y Aggregation
y Data reduction
y Obtains reduced representation in volume but produces the same or similar
analytical results
35
y Data discretization: with particular importance, especially for numerical data
y Data aggregation, dimensionality reduction, data compression,generalization
36
9
Data Preprocessing
Data Cleaning
y Real-world data tend to be incomplete, noisy, and
inconsistent.
y Data cleaning routines attempt to:
y Why
Wh preprocess the
h ddata??
y fill in missing values,
y Data cleaning
y smooth out noise while identifying outliers,
y Data integration and transformation
y correct inconsistencies in the data
y Other data problems which requires data cleaning
y duplicate
d li t records
d
y incomplete data
y Data reduction
y Discretization and concept hierarchy generation
y Summary
38
Data Cleaning – missing data
Exercise – kdnuggets.com
y Data is not available. Many tuples have no recorded value for
y
y
y
y
y
y
39
y Datasets - UCI Machine Learning Repository
several attributes, such as customer income in sales data. How
can you go about filling in the missing values for this attribute?
Let's look at the following methods.
Ignore the tuple: usually done when class label is missing
Fill in the missing value manually. (tedious + infeasible)
Use a global constant to fill in the missing value. (a new class?)
Use the attribute mean to fill in the missing value.
Use the attribute mean for all samples belonging to the same
class as given tuple.
Use the most probable value to fill in the missing value:
inference-based such as regression, decision tree
y Adult data set
y Predict whether income exceeds $50K/yr based on census
data
40
10
smooth out noise
y Noise is a random error or variance in a measured variable.
y Data smoothingg techniques:
q
y Binning
y Clustering
y Combined computer and human inspection
y Regression
How to Handle Noisy Data?
y Binning method:
y first sort data and partition into (equi-depth) bins
y then one can smooth by bin means, smooth by bin median, smooth
by bin boundaries, etc.
y used also for discretization.
y Clustering
y detect and remove outliers
y Semi-automated method: combined computer
p
and human
inspection
y detect suspicious values and check manually
y Regression
y smooth by fitting the data into regression functions
41
Simple Discretization Methods: Binning
Simple Discretization Methods: Binning
y Equal-width (distance) partitioning:
y Equal-depth (frequency) partitioning:
y It divides the range
g into N intervals of equal
q size
y if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B-A)/N.
y The most straightforward
y It divides the range
g into N intervals,, each containingg approximately
pp
y
same number of samples (except the last one)
y Good data scaling, good handling of skewed data.
y Managing categorical attributes can be tricky.
y But outliers may dominate presentation
y Skewed data is not handled well.
11
Example – equi-width binning
Binning Methods for Data Smoothing
y Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
y Partition into (equi-depth) bins:
Bi 1:
Bin
1 4,
4 8,
8 9,
9 15
Bin 2: 21, 21, 24, 25
Bin 3: 26, 28, 29, 34
y Smoothing by bin means:
Bin 1: 9, 9, 9, 9
Bin 2: 23, 23, 23, 23
Bin 3: 29,
29 29,
29 29,
29 29
y Smoothing by bin boundaries:
Bin 1: 4, 4, 4, 15
Bin 2: 21, 21, 25, 25
Bin 3: 26, 26, 26, 34
• Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
• Take bin width=(40-0)/4=10
Bin #
Bin elements
Bin boundaries
1
{4,8,9}
[0,10)
2
{15}
[10,20)
3
{21,21,24,25,26,28,29} [20,30)
4
{34}
[30,40)
Smoothing by bin means:
- Bin 1: 7,7,7
- Bin 2: 15
- Bin 3: 25,25,25,25,25,25,25
- Bin 4:34
Smoothing by bin boundaries:
- Bin 1: 4,9,9
- Bin 2: 15
- Bin 3: 21,21,21,21,29,29,29
- Bin 4: 34
46
Cluster Analysis
Regression
y
Y1
y=x+1
Y1’
•Linear regression (best line to fit
two variables)
•Multiple linear regression (more
than two variables, fit to a
multidimensional surface
X1
x
12
Data Preprocessing
y Why
Wh preprocess the
h ddata??
y Data cleaning
y Data integration and transformation
y Data reduction
y Discretization and concept hierarchy generation
y Summary
Handling Redundant Data in Data Integration
y Redundant data occur often when integrating multiple DBs
Data Integration
y Data integration:
y combines data from multiple sources into a coherent store
y Schema integration
y integrate metadata from different sources
y Entity identification problem: identify real world entities from
multiple data sources, e.g., A.cust-id ≡ B.cust-#
y Detecting and resolving data value conflicts
y for
f the
th same reall world
ld entity,
tit attribute
tt ib t values
l ffrom diff
differentt sources
are different
y possible reasons: different representations, different scales, e.g., metric
vs. British units, different currency
Data Transformation
y Smoothing: remove noise from data (binning, clustering,
regression)
y The same attribute mayy have different names in different databases
y Aggregation: summarization
summarization, data cube construction
y One attribute may be a “derived” attribute in another table, e.g., annual
y Generalization: where low level data are replaced by higher level
revenue
y Redundant data may be able to be detected by correlation analysis
Σ( A − A)( B − B)
rA, B =
(n − 1)σ Aσ B
y Careful integration can help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality
concepts through the use of concept hierarchies. Ex: street attribute can be
generalized like city; age Æchild, young,middle aged, senior
y Normalization: scaled to fall within a small, specified range
y min-max normalization
y z-score normalization
y normalization by decimal scaling
y Attribute/feature construction
y New attributes constructed from the given ones
13
Data Transformation: Normalization
Particularly useful for classification (NNs, distance measurements,nn classification, etc)
y min-max normalization
x' =
x − min
i A
(new _ maxA − new _ minA) + new _ minA
maxA − minA
y minA and maxA are the minimum and maximum values of an attribute
y Min-max normalization maps a value of x to x
x′ in the range
[new_minA,new_maxA].
Data Preprocessing
y Why preprocess the data?
y Data cleaning
y Data integration and transformation
y Data reduction
y Discretization and concept hierarchy generation
y Summary
Data Transformation: Normalization
y z-score normalization :This method of normalization is useful
when the actual minimum and maximum of attribute A are unknown,
or when there are outliers that dominate the min-max normalization.
x'=
x − mean
stand dev
A
A
y normalization by decimal scaling
v'=
v
10 j
Where j is the smallest integer such that Max(| v '|)<1
Data Reduction Strategies
y Data warehouse may store terabytes of data: Complex data
analysis/mining may take a very long time to run on the
complete
co
p ete data
ata set
y Data reduction
y Obtains a reduced representation of the data set that is much smaller in
volume but yet produces the same (or almost the same) analytical
results
y Data reduction strategies
y Data cube aggregation
y Dimensionality reduction
y Numerosity reduction
y Discretization and concept hierarchy generation
14
Data Cube Aggregation
y The lowest level of a data cube
y the
h aggregatedd ddata ffor an iindividual
di id l entity
i off iinterest
y Multiple levels of aggregation in data cubes
y Further reduce the size of data to deal with
y Reference appropriate levels
y Use the smallest representation which is enough to solve the task
Dimensionality Reduction
y Problem: Feature selection (i.e., attribute subset selection):
y Select a minimum set of features such that the probability distribution of
different classes given the values for those features is as close as possible to
the original
g
distribution given
g
the values of all features
y irrelevant, weakly relevant, or redundant features are detected and
removed.
y Nice side-effect: reduces # of attributes in the discovered patterns (which
are now easier to understand)
y Solution: Heuristic methods (due to exponential # of choices)
g
y
usuallyy greedy:
y
y
y
y
Heuristic Feature Selection Methods
y There are 2d possible sub-features of d features
y SSeverall h
heuristic
i i ffeature selection
l i methods:
h d
y Best single features under the feature independence assumption:
choose by significance tests.
y Step-wise feature selection:
y The best single-feature is picked first
y Then next best feature condition to the first, ...
y Step-wise
Step wise feature elimination:
y Repeatedly eliminate the worst feature
y Combined feature selection and elimination:
Optimal branch and bound:
step-wise forward selection
step-wise backward elimination
combining forward selection and backward elimination
decision-tree induction
Numerosity Reduction
y Parametric methods
y Assume the data fits some model, estimate model parameters,
store only the parameters, and discard the data (except possible
outliers)
y E.g.: Log-linear models: obtain value at a point in m-D space as the
product on appropriate marginal subspaces
y Non-parametric
N
t i methods
th d
y Do not assume models
y Major families: histograms, clustering, sampling
y Use feature elimination and backtracking
15
Regression and Log-Linear Models
Regression Analysis and Log-Linear Models
y Linear regression: Data are modeled to fit a straight line:
y Often uses the least-square method to fit the line
y Multiple regression: allows a response variable y to be
modeled as a linear function of multidimensional feature
vector (predictor variables)
y Log-linear model: approximates discrete multidimensional
Clustering
Histograms
y Partition data set into clusters, and store cluster representation only
y Approximate data distributions 40
store average (sum) for each
bucket
y A bucket represents an
attribute-value/frequency pair
y Can be constructed optimally
in one dimension using
dynamic programming
y Related to quantization
problems.
y Multiple regression: Y = b0 + b1 X1 + b2 X2.
y Many nonlinear functions can be transformed into the above.
y Log-linear models:
y The multi-way table of joint probabilities is approximated by a
product of lower-order tables.
y Probability: p(a, b, c, d) = αab βacχad δbcd
joint probability distributions
y Divide data into buckets and
y Linear regression: Y = α + β X
y Two parameters , α and β specify the line and are to be estimated by
using the data at hand.
y using the least squares criterion to the known values of Y1,Y2, …, X1,
X2, ….
y Quality of clusters measured by their diameter (max distance between any
35
two objects in the cluster) or centroid distance (avg. distance of each cluster
object from its centroid)
30
25
y Can be very effective if data is clustered but not if data is “smeared”
20
y Can have hierarchical clustering (possibly stored in multi-dimensional index
tree structures (B+
(B+-tree
tree, R
R-tree
tree, quad
quad-tree
tree, etc))
15
y There are many choices of clustering definitions and clustering algorithms
10
(further details later)
5
0
10000
30000
50000
70000
90000
16
Sampling
Sampling
y Allow a mining algorithm to run in complexity that is potentially sub-
linear to the size of the data
y Cost of sampling: proportional to the size of the sample, increases
lilinearly
l with
ith th
the number
b off dimensions
di
i
y Choose a representative subset of the data
y Simple random sampling may have very poor performance in the presence of skew
y Develop adaptive sampling methods
y Stratified sampling:
y Approximate the percentage of each class (or subpopulation of interest) in the
overall database
y Used in conjunction with skewed data
y Sampling may not reduce database I/Os (page at a time).
y Sampling: natural choice for progressive refinement of a reduced data
set.
Raw Data
Data Preprocessing
Sampling
Raw Data
Cluster/Stratified Sample
y Why preprocess the data?
y Data cleaning
y Data integration and transformation
y Data reduction
y Discretization and concept hierarchy generation
y Summary
17
Discretization/Quantization
y Three types of attributes:
y Nominal — values from an unordered set
y Ordinal — values from an ordered set
y Continuous — real numbers
y Discretization/Quantization:
divide the range of a continuous attribute into intervals
x1
yy1
x2
yy2
x3
yy3
x4
yy4
x5
yy5
Discretization and Concept Hierarchy
y Discretization
y reduce the number of values for a given continuous attribute by
dividing the range of the attribute into intervals. Interval labels can
then be used to replace actual data values.
y Concept Hierarchies
yy6
y Some classification algorithms only accept categorical attributes.
y Reduce data size by discretization
y Prepare for further analysis
Discretization and concept hierarchy
generation for numeric data
y Hierarchical and recursive decomposition using:
y reduce
e uce the
t e data
ata byy collecting
co ect g and
a replacing
ep ac g low
ow level
eve co
concepts
cepts
(such as numeric values for the attribute age) by higher level
concepts (such as young, middle-aged, or senior).
Concept hierarchy generation w/o data semantics Specification of a set of attributes
Concept hierarchy can be automatically generated based on the
number of distinct values per attribute in the given attribute
set. The
h attribute
b withh the
h most ddistinct values
l is placed
l d at the
h
lowest level of the hierarchy (limitations?)
y Binning (data smoothing)
y Histogram analysis (numerosity reduction)
country
15 distinct values
y Clustering analysis (numerosity reduction)
province_or_ state
y Entropy-based discretization
65 distinct values
city
3567 distinct values
street
674,339 distinct values
18
Summary
y Data preparation is a big issue for both warehousing and
mining
y Data preparation includes
y Data cleaning and data integration
y Data reduction and feature selection
y Discretization
y A lot of methods have been developed but still an active
area of research
19