Download lecture2 - users.cs.umn.edu

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Prognostics wikipedia , lookup

Intelligent maintenance system wikipedia , lookup

Transcript
CSci 8980: Data Mining (Fall 2002)
Vipin Kumar
Army High Performance Computing Research Center
Department of Computer Science
University of Minnesota
http://www.cs.umn.edu/~kumar
© Vipin Kumar
CSci 8980 Fall 2002
‹#›
What is Data?

Objects and the attributes of objects
– Attribute: variable, field, characteristic,
feature, or observation
– Object: record, point, case, sample, entity, or
item
– Objects have attributes.
– Attributes describe objects

A data set is collection of data objects.
© Vipin Kumar
CSci 8980 Fall 2002
‹#›
Types of Attributes

There are different types of attributes
– Nominal: Values are just labels.

Examples: ID numbers, eye color, zip codes
– Ordinal: The values can be ordered.
Examples: street numbers, rankings (e.g., taste of potato chips
on a scale from 1-10), grades

– Interval: Differences are meaningful
Examples: calendar dates, temperatures in Celsius or
Fahrenheit.

– Ratio: Ratios are meaningful

© Vipin Kumar
Examples: temperature in Kelvin, length, time, counts
CSci 8980 Fall 2002
‹#›
Measurement of Length

The way you measure an attribute is somewhat may not match
the attributes properties.
5
A
1
B
7
2
C
8
3
D
10
4
E
15
© Vipin Kumar
5
CSci 8980 Fall 2002
‹#›
Properties

The type of an attribute depends on which of
the following properties it has.
– Distinctness: = 
– Order: < >
– Addition: + – Multiplication: * /

Length has all these properties
© Vipin Kumar
CSci 8980 Fall 2002
‹#›
Attribute
Type
Description
Examples
Nominal
The values of a nominal attribute are
just different names, i.e., nominal
attributes provide only enough
information to distinguish one object
from another. (=, )
zip codes, employee
ID numbers, eye color,
sex: {male, female}
mode, entropy,
contingency
correlation, 2 test
Ordinal
The values of an ordinal attribute
provide enough information to order
objects. (<, >)
hardness of minerals,
{good, better, best},
grades, street numbers
median, percentiles,
rank correlation,
run tests, sign tests
Interval
For interval attributes, the
differences between values are
meaningful, i.e., a unit of
measurement exists.
(+, - )
calendar dates,
temperature in Celsius
or Fahrenheit
mean, standard
deviation, Pearson's
correlation, t and F
tests
For ratio variables, both differences
and ratios are meaningful. (*, /)
temperature in Kelvin,
monetary quantities,
counts, age, mass,
length, electrical
current
geometric mean,
harmonic mean,
percent variation
Ratio
Operations
Attribute
Level
Transformation
Comments
Nominal
Any permutation of values
If all employee ID numbers
were reassigned, would it
make any difference?
Ordinal
An order preserving change of
values, i.e.,
new_value = f(old_value)
where f is a monotonic function.
Interval
new_value =a * old_value + b
where a and b are constants
An attribute encompassing
the notion of good, better
best can be represented
equally well by the values
{1, 2, 3} or by { 0.5, 1,
10}.
Thus, the Fahrenheit and
Celsius temperature scales
differ in terms of where
their zero value is and the
size of a unit (degree).
Ratio
new_value = a * old_value
Length can be measured in
meters or feet.
Types of data sets


Many different types
Common Types
– Record
– Graph
– Ordered

Two Important Attributes
– Dimensionality
– Sparsity
© Vipin Kumar
CSci 8980 Fall 2002
‹#›
Discrete and Continuous Attributes

Discrete
– A discrete attribute has only a finite or countably infinite set of values,
e.g., zip codes, counts, or the set of words in a collection of documents.
Discrete attributes are often represented as integer variables. Note that
binary attributes are a special case of discrete attributes and assume
only two values, e.g., true/false, yes/no, male/female. Binary attributes
are often represented as Boolean variables, or as integer variables that
take on the values 0 or 1.

Continuous
– A continuous attribute is one whose values that are real numbers, e.g.,
temperature, height, or weight. (Practically, real values can only be
measured and represented to a finite number of digits.) Continuous
attributes are typically represented as floating-point variables.
© Vipin Kumar
CSci 8980 Fall 2002
‹#›
Record Data

Much of the original data mining work and much of today's
current work is focused around record data, i.e., data that consists
of a collection of records (data objects), each of which consists of a
fixed set of data fields (attributes).
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
10
© Vipin Kumar
CSci 8980 Fall 2002
‹#›
Data Matrix

If the data objects in a collection of data all have the same fixed set of
numeric attributes, then the data objects can be thought of as points
(vectors) in a multi-dimensional space, where each dimension represents
a distinct attribute describing the object. Thus, a set of data objects can
be interpreted as an m by n matrix, where there are $m$ rows, one for
each object, and $n$ columns, one for each attribute.
Projection
of x Load
© Vipin Kumar
Projection
of y load
Distance
Load
Thickness
10.23
5.27
15.22
2.7
1.2
12.65
6.25
16.22
2.2
1.1
CSci 8980 Fall 2002
‹#›
Document Data

Each document becomes a `term' vector, where each term is a
component (attribute) of the vector, and where the value of each
component of the vector is the number of times the corresponding
term occurs in the document.
team
coach
play
ball
score
game
win
lost
timeout
season
© Vipin Kumar
Document 1
3
0
5
0
2
6
0
2
0
2
Document 2
0
7
0
2
1
0
0
3
0
0
Document 3
0
1
0
0
1
2
2
0
3
0
CSci 8980 Fall 2002
‹#›
Transaction Data

Transaction data is a special type of record data, where each record
(transaction) involves a set of items. For example, consider a grocery
store. The set of products purchased by a customer during one shopping
trip constitute a transaction, while the individual products that were
purchased are the items.
© Vipin Kumar
TID
Items
1
Bread, Coke, Milk
2
3
4
5
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
CSci 8980 Fall 2002
‹#›
Graph Data

Generic graph and HTML Links
2
1
5
2
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
<a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li>
<a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of Equations </a>
<li>
<a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
5
© Vipin Kumar
CSci 8980 Fall 2002
‹#›
Chemical Data

Benzene Molecule: C6H6.
© Vipin Kumar
CSci 8980 Fall 2002
‹#›
Ordered Data

Sequences of transactions
© Vipin Kumar
CSci 8980 Fall 2002
‹#›
Ordered Data

Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
© Vipin Kumar
CSci 8980 Fall 2002
‹#›
Ordered Data: Spatio-Temporal Data
Ocean and Land Temperature (Jan 1982)
Research Goals:

Find global climate patterns of
interest to Earth Scientists
A key interest is finding
connections between the
ocean and the land.
NPP
.
Pressure
NPP
.
Pressure
.
Precipitation
Precipitation
SST
SST
Latitude
grid cell
© Vipin Kumar
Longitude
Time

Global snapshots of values for
a number of variables on land
surfaces or water.

Monthly over a range of 10 to
50 years.
zone
CSci 8980 Fall 2002
‹#›
Data Quality

How can we detect problems with the data?

What can we do about these problems?

We need to know what kinds of problems are possible, i.e.,
what sorts of situations correspond to poor data quality. The
following are some well known problems:

noise and outliers

missing values

duplicate data

inconsistent values
© Vipin Kumar
CSci 8980 Fall 2002
‹#›
Missing Values
Eliminate Data Objects A simple and effective strategy is to eliminate those records with missing
values. A related strategy is to eliminate attributes which have missing values.
Estimate Missing Values Sometimes the data set is such that missing data can be reliably
estimated. For example, consider a time series that changes in a reasonably smooth fashion, but
has a few, widely scattered missing values. In such cases, the missing values can be estimated
(interpolated) by using the remaining values. As another example, consider a data set that has
many similar data points. In this situation, a nearest neighbor approach can be used to estimate
the missing value. More specifically, the attribute values of the points closest to the point with
the missing value are used to estimate the missing value. If the attribute is continuous, then the
average attribute value of the nearest neighbors is used, while if the attribute is categorical, then
the most commonly occurring attribute value can be taken.
Ignore the Missing Value During Analysis Many data mining approaches can be modified to
operate by ignoring missing values. For example, suppose that objects are being clustered and
the similarity between pairs of data objects needs to be calculated. If one or both objects of a pair
have missing values for some attributes, then the similarity can be calculated by using only the
other attributes. It is true that the similarity will only be approximate, but unless the number of
attributes is small and/or the number of missing values is high, this degree of inaccuracy may not
matter much. Likewise, many classification schemes can handle missing values relatively
straightforwardly.
© Vipin Kumar
CSci 8980 Fall 2002
‹#›