Download Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Data Mining: Data
Lecture Notes for Chapter 2
Introduction to Data Mining
by
Tan, Steinbach, Kumar
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
What is Data?

Collection of data objects and
their attributes

An attribute is a property or
characteristic of an object

Attributes
Tid Refund Marital
Status
Taxable
Income Cheat
– Examples: eye color of a
person, temperature, etc.
1
Yes
Single
125K
No
2
No
Married
100K
No
– Attribute is also known as
variable, field, characteristic,
or feature
Objects
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
A collection of attributes
describe an object
– Object is also known as
record, point, case, sample,
entity, or instance
© Tan,Steinbach, Kumar
Introduction to Data Mining
60K
10
4/18/2004
‹#›
Types of Attributes

There are different types of attributes
– Nominal

Examples: ID numbers, eye color, zip codes
– Ordinal

Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height in {tall, medium, short}
– Interval

Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
– Ratio

Examples: monetary, currency
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Attribute
Type
Description
Examples
Nominal
The values of a nominal attribute are
just different names, i.e., nominal
attributes provide only enough
information to distinguish one object
from another. (=, )
zip codes, employee
ID numbers, eye color,
sex: {male, female}
mode, entropy,
correlation, 2 test
Ordinal
The values of an ordinal attribute
provide enough information to order
objects. (<, >)
hardness of minerals,
{good, better, best},
grades, street numbers
median, percentiles,
rank correlation
Interval
For interval attributes, the
differences between values are
meaningful, i.e., a unit of
measurement exists.
(+, - )
calendar dates,
temperature in Celsius
or Fahrenheit
mean, standard
deviation, Pearson's
correlation, t and F
tests
For ratio variables, both differences
and ratios are meaningful. (*, /)
monetary quantities,
electrical current
geometric mean,
harmonic mean,
percent variation
Ratio
Operations
Discrete and Continuous Attributes

Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a collection of
documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete attributes

Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented
using a finite number of digits.
– Continuous attributes are typically represented as floating-point
variables.
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Types of data sets



Record
–
Data Matrix
–
Document Data
–
Transaction Data
Graph
–
World Wide Web
–
Molecular Structures
Ordered
–
Spatial Data
–
Temporal Data
–
Sequential Data
–
Genetic Sequence Data
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Record Data

Data that consists of a collection of records, each
of which consists of a fixed set of attributes
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
10
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Data Matrix


If data objects have the same fixed set of numeric
attributes, then the data objects can be thought of as
points in a multi-dimensional space.
Such data set can be represented by an m by n matrix,
where there are m rows, one for each object, and n
columns, one for each attribute
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Document Data

Each document becomes a `term' vector,
– each term is a component (attribute) of the vector,
– the value of each component is the number of times
the corresponding term occurs in the document.
team
coach
pla
y
ball
score
game
wi
n
lost
timeout
season
Document 1
3
0
5
0
2
6
0
2
0
2
Document 2
0
7
0
2
1
0
0
3
0
0
Document 3
0
1
0
0
1
2
2
0
3
0
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Transaction Data

A special type of record data, where
– each record (transaction) involves a set of items.
– For example, consider a grocery store. The set of
products purchased by a customer during one
shopping trip constitute a transaction, while the
individual products that were purchased are the items.
TID
Items
1
Bread, Coke, Milk
2
3
4
5
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Graph Data

Examples: Generic graph and HTML Links
2
1
5
2
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
<a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li>
<a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of Equations </a>
<li>
<a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
5
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Graph Data
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Chemical Data

Benzene Molecule: C6H6
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Ordered Data

Sequences of transactions
Items/Events
An element of
the sequence
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Ordered Data

Genomic sequence data
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Ordered Data

Genomic sequence data
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Ordered Data

Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Data Preprocessing
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Why Preprocess the Data?

Measures for data quality:
– Accuracy: correct or wrong, accurate or not
– Completeness: not recorded, unavailable, …
– Consistency: some modified but some not …
– Believability: how trustable the data are correct?
– Interpretability: how easily the data can be
understood?
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
20
Major Tasks in Data Preprocessing



Data cleaning
– Fill in missing values, smooth noisy data
Data Reduction
– Sampling
– Data Compression
Data transformation and data discretization
– Normalization
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
21
Data Cleaning

Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
– incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
e.g.,
Occupation = “ ” (missing data)
– noisy: containing noise, errors, or outliers
e.g.,
Salary = “−10” (an error)
– inconsistent: containing discrepancies in codes or names, e.g.,
Age
Was
= “42”, Birthday = “03/07/2010”
rating “1, 2, 3”, now rating “A, B, C”
discrepancy
between duplicate records
– Intentional (e.g., disguised missing data)
Jan.
© Tan,Steinbach, Kumar
1 as everyone’s birthday?
Introduction to Data Mining
4/18/2004
‹#›
22
How to Handle Missing Data?

Ignore the tuple: usually done when class label is missing
(when doing classification)—not effective when the % of
missing values per attribute varies considerably

Fill in the missing value manually: tedious + infeasible?

Fill in it automatically with
– a global constant : e.g., “unknown”, a new class?!
– the attribute mean
– the attribute mean for all samples belonging to the
same class: smarter
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
23
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Noisy Data


Noise: random error or variance in a measured variable
Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
25
How to Handle Noisy Data?

Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin
boundaries, etc.
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
26
Binning Methods for Data Smoothing

Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
27
Data Reduction: Sampling

Sampling: obtaining a small sample s to represent the
whole data set N

Key principle: Choose a representative subset of the data
– Simple random sampling may have very poor
performance in the presence of skew
– Develop adaptive sampling methods, e.g., stratified
sampling:
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
28
Types of Sampling

Simple random sampling
– There is an equal probability of selecting any particular item

Sampling without replacement
– Once an object is selected, it is removed from the population

Sampling with replacement
– A selected object is not removed from the population

Stratified sampling:
– Partition the data set, and draw samples from each partition
(proportionally, i.e., approximately the same percentage of the
data)
– Used in conjunction with skewed data
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
29
Sampling: With or without Replacement
Raw Data
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
30
‹#›
Sampling: Cluster or Stratified Sampling
Cluster/Stratified Sample
Raw Data
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
31
‹#›
Sample Size
8000 points
© Tan,Steinbach, Kumar
2000 Points
Introduction to Data Mining
500 Points
4/18/2004
‹#›
Data Reduction : Data Compression


String compression
– There are extensive theories and well-tuned algorithms
– Typically lossless, but only limited manipulation is possible
without expansion
Audio/video compression
– Typically lossy compression, with progressive refinement
– Sometimes small fragments of signal can be reconstructed
without reconstructing the whole
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
33
‹#›
Data Compression
Original Data
Compressed
Data
lossless
Original Data
Approximated
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
34
‹#›
Data Transformation

A function that maps the entire set of values of a given attribute to a new
set of replacement values s.t. each old value can be identified with one of
the new values

Methods
– Smoothing: Remove noise from data
– Normalization: Scaled to fall within a smaller, specified range
min-max
z-score
normalization
normalization
normalization
© Tan,Steinbach, Kumar
by decimal scaling
Introduction to Data Mining
4/18/2004
35
‹#›
Normalization

Min-max normalization: to [new_minA, new_maxA]
v' 
v  minA
(new _ maxA  new _ minA)  new _ minA
maxA  minA
– Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].
73,600  12,000
Then $73,000 is mapped to
(1.0  0)  0  0.716
98,000  12,000

Z-score normalization (μ: mean, σ: standard deviation):
v  A
v' 

A
– Ex. Let μ = 54,000, σ = 16,000. Then

73,600  54,000
 1.225
16,000
Normalization by decimal scaling
v
Where j is the smallest integer such that Max(|ν’|) < 1
v' 
10 j
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
36
‹#›
Similarity and Dissimilarity
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Similarity and Dissimilarity

Similarity
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]

Dissimilarity
– Numerical measure of how different are two data
objects
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies

Proximity refers to a similarity or dissimilarity
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Euclidean Distance

Euclidean Distance
dist 
n
 ( pk  qk )
2
k 1
Where n is the number of dimensions (attributes) and pk and qk
are, respectively, the kth attributes (components) or data
objects p and q.
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Euclidean Distance
3
point
p1
p2
p3
p4
p1
2
p3
p4
1
p2
0
0
1
2
3
4
5
y
2
0
1
1
6
p1
p1
p2
p3
p4
x
0
2
3
5
0
2.828
3.162
5.099
p2
2.828
0
1.414
3.162
p3
3.162
1.414
0
2
p4
5.099
3.162
2
0
Distance Matrix
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Minkowski Distance

Minkowski Distance is a generalization of Euclidean
Distance
n
dist  (  | pk  qk
k 1
1
r r
|)
Where r is a parameter, n is the number of dimensions
(attributes) and pk and qk are, respectively, the kth attributes
(components) or data objects p and q.
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Minkowski Distance: Examples

r = 1. City block (Manhattan, taxicab, L1 norm) distance.
– A common example of this is the Hamming distance, which is just the
number of bits that are different between two binary vectors

r = 2. Euclidean distance
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Minkowski Distance
Manhattan Distance Matrix
point
p1
p2
p3
p4
x
0
2
3
5
© Tan,Steinbach, Kumar
y
2
0
1
1
L1
p1
p2
p3
p4
p1
0
4
4
6
p2
4
0
2
4
p3
4
2
0
2
p4
6
4
2
0
Euclidean Distance Matrix
L2
p1
p2
p3
p4
Introduction to Data Mining
p1
0
2.828
3.162
5.099
p2
2.828
0
1.414
3.162
p3
3.162
1.414
0
2
4/18/2004
p4
5.099
3.162
2
0
‹#›
Common Properties of a Distance

Distances, such as the Euclidean distance,
have some well known properties.
1.
d(p, q)  0 for all p and q and d(p, q) = 0 only if
p = q. (Positive definiteness)
2.
d(p, q) = d(q, p) for all p and q. (Symmetry)
3.
d(p, r)  d(p, q) + d(q, r) for all points p, q, and r.
(Triangle Inequality)
where d(p, q) is the distance (dissimilarity) between
points (data objects), p and q.

A distance that satisfies these properties is a
metric
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Similarity Between Binary Vectors

Common situation is that objects, i and j, have only
binary attributes
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Example
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Cosine Similarity

If d1 and d2 are two document vectors, then
cos( d1, d2 ) = (d1  d2) / ||d1|| ||d2|| ,
where  indicates vector dot product and || d || is the length of vector d.

Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
d1  d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245
cos( d1, d2 ) = .3150
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Correlation

Correlation measures the linear relationship
between objects
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Visually Evaluating Correlation
Scatter plots
showing the
similarity from
–1 to 1.
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›