Download Slide 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Chapter Three
Data, pre-processing and
exploration
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Chapter Overview
•
•
•
•
•
•
•
•
Data, data types and operations
Properties of various data sets
Data source and data warehouse
Issues of data quality
Data pre-processing operations
Data summary and visualisation
Online analytic processing (OLAP)
Data exploration and visualisation in Weka
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data, Data Types and Operations
• Data object and attributes
– Data object or instance: individual independent
recording of a real life object/event.
– Characterised by its recorded values on a fixed set of
features or attributes
– Feature or attribute: a specific property or
characteristic of the data object.
– Measurement: assigning a valid value to an attribute
according to an appropriate measurement scale.
– Collection: collecting measurement results or
recorded values
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data, Data Types and Operations
• Data object and attributes (cont’d)
– An example
123, “John Smith”, “03/02/1990”, 20, “male”, 1.82, 78
Name
collected
ID number,
collected
Birthday
collected
Age
calculated
Gender
collected
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Body height
measured
Body weight
measured
Data, Data Types and Operations
• Data object and attributes (cont’d)
– Measurement and measurement errors
• Precision: the closeness of measurements to one another,
represented by the standard deviation of the measurements,
e.g. repeated measure of body temperature
• Bias: a systematic variation of measurements from the
intended quantity measurement, only known when external
reference available, e.g. bias in weight measure instrument
• Accuracy: the closeness of the measure to the true value,
indicated by the number of significant digits used in the
measurement, e.g. measure of money: pound vs. penny
– Collection errors
• Incorrect data recording at the point of entry, e.g. “Hongpo
Do” as for “Hongbo Du”
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data, Data Types and Operations
• Attribute domain types and operations
– Categorical/Qualitative types
• Nominal, e.g. Gender (M, F)
– A set of names: no concept of order nor difference
– Operators applicable: =, 
– 1:1 transformation permissible, e.g. ID: 11  e901
• Ordinal, e.g. Grade (A, B, C, D, E)
– A set of names: with order but no concept of difference
– Operator applicable: =, , <, >, , 
– Order-preserving transformation permitted,
e.g. Grade: A  First, B  Second, C  Third, D 
Pass, E  BarePass.
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data, Data Types and Operations
• Attribute domain types and operations
– Numeric/Quantitative types
• Interval, e.g. Temperature in C
–
–
–
–
A set of numeric values: both order and difference exist
Operators applicable: =, , <, >, , , +, e.g. temperature (F and C), calendar year
Transformation new = a*old + b permitted, e.g. F  C
• Ratio, e.g. Length
–
–
–
–
A set of numeric values: order, difference and ratio
The set has an absolute zero
Operator applicable: =, , <, >, , , +, -, , 
Transformation new = a*old permitted, e.g. meter  feet
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Sets
• Various forms
– Table of records
•
•
•
•
Relational table
Join of relational tables
Numerical spreadsheet (data matrix)
Boolean strings (document-term matrix)
– Ordered data
• Time series and temporal sequence
• Data sequence
• Spatial data
– Graph-based data
– Non record-based data
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Sets
• Various forms (illustrated)
Age Group Own Car
young
yes
young
no
middle aged
yes
middle aged
no
middle aged
yes
young
yes
middle aged
no
retired
yes
retired
no
retired
yes
Income Band
low
low
middle
high
low
high
low
middle
middle
high
Class
risky
risky
risky
safe
risky
risky
safe
safe
safe
safe
TID
100
200
300
400
500
Relational Table
Page1
link1
link2
Page2
link3
Page4
Items
apple, beer, newspaper
apple, beef, beer, newspaper, potato
beef, potato
beef, noodles
beef, potato
Transaction Database
Page3
xxxx
yyyy
Data Matrix
GGTTCCGCCTTCAGCC
CCGCGCCCGCAGGG…
www
zzzz
Web Structure
Data Sequence
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Spatial Data
Data Sets
• Properties
– Type: file structure, e.g. ARFF for Weka, DAT for See5
– Size: measured in terms of the total number of
records or total number of bytes, e.g. small (MB), medium
(GB) and large (TB)
– Dimensionality: number of attributes
– Sparsity:
• Values are skewed to some extreme or sub-ranges
• Asymmetric values (some are more important than others)
– Resolution
• Right level of data details
• Related to the intended purpose
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Sets
• Properties (example insurance data set)
Type: ARFF
Size: 14722 records
Dimensionality: 7
Asymmetric: Y/N
Resolution: detailed
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Skewed?
Data Source and Data Warehouse
• Sources of data
–
–
–
–
Local data source available
Local operational systems from different departments
Third-party external data source
Enterprise/Organisational data warehouse
•
•
•
•
•
•
•
An organisational database for decision making
A central data repository separate from operational systems
Enforcing organisation-wide data consistency and integration
Providing data details as well as data summarisation
Providing data values as well as meta-data
Equipped with data analysis and reporting tools
As a data source for data mining
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Source and Data Warehouse
• Star schema for data warehouse
– Central fact table
– Dimension tables
– Limited use of join operations
Part(p#, pname, weight, colour)
Project(pj#, jname, status, date)
Supplier(s#, sname, city, status)
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Supply(s#, p#, pj#, qty)
Issues of Data Quality
• Main quality indicators
– Accuracy: data recorded with sufficient precision and
little bias
– Correctness: data recorded without error and spurious
objects
– Completeness: any parts of data records missing
– Consistency: compliance with established rules and
constraints
– Redundancy: unnecessary duplicates
Using the indicators to quantify quality of a data set
Improving quality if possible
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Issues of Data Quality
• Some examples
– Accuracy & correctness with the road accident reports in
Exercise 1.3(c).
– Completeness with the UK family expenditure surveys in
Exercise 1.3(a).
– Incompleteness introduced by data integration using outer
join operation
– Consistency in questionnaires, e.g. eating fruit & veg.
Q1: “give the fruit&veg portion consumed yesterday”: 2
Q2: “give the fruit&veg portion consumed today:” 3
Q3: “do you eat more today than yesterday?” No.
– Redundancy in a local company’s database of 40,000
records about 15,000 client companies.
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Issues of Data Quality
• Why is quality important?
– “Garbage in, garbage out!”
– Total data quality control requires a cultural change
(comparing with total product quality control)
– For data mining, tackling the quality issue at the data
source cannot be always expected
• By cleaning the data as much as possible
• By developing and using more tolerate mining solutions
– Data quality is relevant to the intended purpose of data
mining, e.g. Do spelling errors in student names really matter
when only the increase/decrease of student numbers in particular
subject areas over the years is of interest?
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Pre-processing
• Overview
– Purpose: for speedy, cost-effective and high quality
outcomes of data mining
– Pre-processing tasks (not all are independent from
each other)
•
•
•
•
•
•
•
•
Data aggregation
Data sampling
Dimension reduction
Feature selection
Feature creation
Discretisation/binarisation
Variable transformation
Dealing with missing values
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Pre-processing
• Data aggregation
– What: to summarise low
level data details to higher
level data abstraction
– Why: to reduce the time of
mining, to rescale data
values, and to discover
more stable patterns
– How:
• By generalisation using a
given concept hierarchy
• By applying aggregate
functions (e.g. count, sum,
average)
• Dropping some attributes
TID
……
32144
11122
11122
11123
22244
22244
23311
……
Date
……
06/06/2006
04/04/2006
04/04/2006
04/04/2006
04/04/2006
04/04/2006
05/04/2006
……
Item
……
milk
watch
battery
beer
beer
nappies
beer
……
Store
……
Buckingham
Buckingham
Buckingham
Buckingham
MK
MK
MK
……
Date
……
06/06/2006
04/04/2006
04/04/2006
05/04/2006
……
Store
……
Buckingham
Buckingham
MK
MK
……
Number of Items
……
1
3
2
……
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Price
……
1.99
25.99
3.99
9.99
6.99
10.89
6.99
……
AveragePrice
……
1.99
13.32
8.94
6.99
……
TotalPrice
……
1.99
36.97
27.87
……
Clubcard#
……
1111
1011
1011
1022
1022
1022
1011
……
……
……
……
……
……
……
……
……
……
……
……
……
……
……
……
……
……
Clubcard#
……
1111
1011
1022
……
……
……
……
……
……
……
Data Pre-processing
• Data sampling
– What: selecting a subset of the
given data set
– Why: to make it possible to use
sophisticated mining algorithms
within a time limit.
– Caution: the sample must be
representative of the original data
set
– How:
•
•
•
•
Random sampling
Stratified sampling
Progressive sampling
With or without replacement
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data
population
Sampling
method
Selected
subset
Data Pre-processing
• Feature selection
– What: reducing dimensionality by
selecting a subset of attributes
– Purposes:
attributes
Subset
selection
• To remove/reduce redundant features
• To remove irrelevant features with no
useful information for the mining task
One subset
– How:
evaluation
• Manually with common sense and
domain knowledge
• Letting the mining solution to select
suitable features (the embedded
approach)
• Filter and wrapper approaches
ok
Selected
subset
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Stopping
criterion
Not ok
Validate with
Mining task
Data Pre-processing
• Data dimension reduction
– What: reduce redundancy implied among attributes
e.g. are all 9600 dimensions for a 120x80 pixel image
necessary?
– Curse of dimensions: as dimensionality increases
• Data become more diverse, and any patterns are getting
less significant and more peculiar.
• The processing time may increase substantially.
– Why: to reduce redundancy and effects of the curse
– How:
• Linear algebra techniques
– Principal component analysis (PCA)
– Independent component analysis (ICA)
– Single value decomposition (SVD)
• Feature selection (as described before)
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Pre-processing
• Feature creation
– What: to create a new set of features from the original
features
– Purpose: in the new feature space, meaningful and
relevant patterns can be extracted more easily. The
number of features may be reduced.
– How:
• Using feature extraction methods to extract new features from the
existing ones, e.g. extracting colour, texture and shape from image
of pixel values
• Mapping data to a new space, e.g. wavelet transformation of pixel
values of images to a frequency domain
• Constructing new features from the existing ones using domain
knowledge, e.g. using transaction dates to construct a new feature
customer tenure that indicates the loyalty of the customer to the
company
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Pre-processing
• Data discretisation
– What: to convert continuous
attribute values to discrete
categorical values
– The purposes:

Determine the
number & locations
of the split points
• Requirement for some data mining
solutions
• Better data mining results (not
always)
– How:
t1
t2
t3
1. Deciding how many categories to
Mapping values within
have and where split points should  each sub-range to a
be
category label
2. Mapping values to categories
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
t4
Data Pre-processing
• Data discretisation (cont’d)
– Discretisation methods:
• Unsupervised: without concern to the outcome of a specific
attribute, normally used for clustering and association rule
mining
e.g. equal width, equal depth, clustering
• Supervised: with respect to the outcome of the class attribute,
normally used for classification
– Simple methods: sorting according to the class attribute, and
then discretising the attribute values for each class.
– Sophisticated methods: the discretisation of the attribute values
purifies the outcome of the class, e.g. using entropy to measure
the degree of purity, and deciding the split points recursively,
similar to decision tree induction
– Merging methods, merging small intervals into a larger one with
a stop criterion
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Pre-processing
• Data binarisation
– What: to convert discrete categorical values to binary
Boolean attribute values
– The purpose: the same as for discretisation
– How:
• Convert m categorical values to values in [0, m-1]
• Convert each to binary number of n bits where n = log2m
• Use m asymmetric binary variables to represent each of m
values
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Pre-processing
• Variable transformation
– What: transform all values of an attribute to other
values
– The purposes:
• Remove the effect of the outlier values
• Make the result data visualisation more interpretable
• Make the values more comparable
– How:
• Transformation using function
e.g. log(x)
• Standardisation/normalisation
e.g. division-by-range
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Pre-processing
 Handling missing values
– What: to treat attributes with null values
– The purposes:
• Improve data quality
• Better mining results
– How:
• Elimination (may not always be possible)
• Using sensible default, e.g. Spending Amount is set to 0
• By data imputation
– Average, median, or mode of the whole data population
– Average, median or mode of the nearest neighbours
• Postponing the handling and making the mining methods
adaptive to missing values
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Exploration
•
Exploring data before mining
– Knowing data is essential for successful data mining
– Purposes:
• Better understanding of the characteristics of data
• Better decision over data pre-processing tasks
• Even being able to discover some hidden patterns
– Categories of data exploration techniques
• Summary statistics: using a small set of descriptors to
describe the characteristics of a large data set
• Data visualisation: using graphical or tabular forms to reveal
hidden data patterns
• Online Analytic Processing (OLAP)
– Data exploration and exploratory data analysis (EDA)
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Exploration
•
Summary statistics
– Frequency and mode for categorical attributes:
• Frequency of value
• Mode: the most frequently occurred value
– Percentiles for ordinal or continuous attributes:
• Given an attribute x and an integer p (0p100), the
percentile xp is a value of x such that p% observed values of x
are less than xp.
– Mean and median for continuous attributes:
• Mean and median
• Median is a better indication of “average” when data
distribution is skewed or outliers are present
– Trimmed mean and median (after trimming top and
bottom p%)
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Exploration
•
Summary statistics (cont’d)
– Measures of spread:
• Range
range( x)  max( x)  min( x)
2 
• Variance (2)
• Standard Deviation ()
• Absolute average deviation (AAD)
1 m
( xi  x) 2
m  1 i 1

1 m

( xi  x) 2
m  1 i 1

1
AAD( x) 
m
m
| x  x |
i
i 1
– Multivariate summary statistics
• Mean vector
• Matrix of covariance
• Correlation
x  ( x1 , x2 ,..., xn )
1 m
covariance( x, y ) 
( xi  x)( yi  y )
m  1 i 1

correlation ( x, y ) 
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
covariance( x, y )
 x y
Data Exploration
•
Data visualisation
– Rationale: human eyes are good at spotting patterns,
particularly visual patterns.
– Major ways of visualising data
• Tabular form
• Graphical form
• Points and links
– Visual representation must be related to the data types
of the attributes
– Visualising data as well as all its implicit relationships
– The visualisation must be comprehensible
– The visualisation of data must tell the truth
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Exploration
•
Data visualisation techniques
Stem & Leaf Plot
Parallel Dimension Chart
Pie Chart
Bar Chart
Scatter Plot
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Star Dimension Chart
Data Exploration
•
Online analytic processing (OLAP)
–
–
–
–
Interactive reporting tool
Treating a data set as a multidimensional hypercube
Fast operation and fast result delivery
A typical OLAP query:
“For each product, find its market share in its category today
minus its market share in its category in 1994”
– Result of the OLAP query:
Products
Dell 17"
HP 15"
Intel MotherB
…
Market Share Today Market Share in 1994 Difference
17%
83%
56%
…
10%
90%
93%
…
7%
-7%
-37%
…
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Exploration
• OLAP: Multidimensional hypercube
Branch Name
Buckingham
Buckingham
……
Milton Keynes
Milton Keynes
……
Northampton
…………
Customer Name
Helen Miles
Mary Laughton
……
Alen Young
Susan Young
……
Frank Sinatra
….
Month
April
April
….
Feb
April
….
April
….
Year
2000
1999
….
2000
2000
….
1998
….
2000
1999
1998
• Total Customer = 5
• Customer Names
Northampton
Milton Keynes
March
Milton Keynes
1999
Buckingham
Jan
Feb
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
March
Dec
Data Exploration
•
OLAP: Hierarchies
winter
January
2000
1999
1998
summer
February
March
July
August
September
Northampton
Milton Keynes
Buckingham
spring
April
May
autumn
June
October
November
winter spring summer autumn
December
2000
1999
1998
Northampton
Milton Keynes
Buckingham
Jan
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Feb March
Dec
Data Exploration
•
OLAP: Operations
– Pivoting
• Selecting attributes to define the cube
• Visually rotating the cube to show a face
– Slicing and dicing
• Selecting a part of a cube
• Visually slicing a segment of a cube along a dimension
– Rolling-up
• Moving up along a hierarchy
– Drilling-down
• Moving down along a hierarchy
– Performing aggregate functions while rolling-up or
drilling-down
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Exploration in Weka Explorer
• ARFF file format
Data set name
Numeric attribute
names and types
Schema section
Categorical attribute
name and values
Data section
One data record per line;
Values separated by “,”;
“?” represents unknown.
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Exploration in Weka Explorer
• Glance of an opened data set
Summary
statistics
Visualisation of
value distribution
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Exploration in Weka Explorer
• Visualisation in Weka (limited)
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Exploration in Weka Explorer
• Filters for pre-processing
–
–
–
–
Many filters
Supervised/unsupervised
Attribute/instance
Choose followed by
parameter setting in
command line
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Chapter Summary
• The domain types determine the validity of operations
applied.
• Transformation from one domain to another must preserve
the domain characteristics.
• Data sets can be of various forms and from different
sources.
• Data warehouse serves as a data source for data mining.
• Data quality is relevant to the intended application purpose.
• Data pre-processing operations are essential for good
mining.
• Knowing the data is important for good data mining.
• Understanding of data is achieved via exploring,
summarising and visualising data.
• OLAP serves as a data exploration and summarisation tool.
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
References
Read Chapter 3 of Data Mining Techniques and
Application
Useful further references
• Tan, P-N., Steinbach, M. and Kumar, V. (2006),
Introduction to Data Mining, Addison-Wesley, Chapters 2
and 3
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning