Download Attribute Types

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data mining wikipedia , lookup

World Values Survey wikipedia , lookup

Time series wikipedia , lookup

Transcript
10/26/2015
Getting to Know Your Data
• Data Objects and Attribute Types
CENG 464
Introduction to Data Mining
• Basic Statistical Descriptions of Data
• Data Visualization
• Measuring Data Similarity and Dissimilarity
• Summary
What’s an attribute?
•Each instance/Object/record/point/
case/sample/entity is described by a fixed
predefined set of features, its “attributes”
• An attribute/variable/ field/
characteristic/ feature is a property or
characteristic of an object
•
Examples: eye color of a person,
temperature, etc.
Objects
•Possible attribute types (“levels of
measurement”):
•
•
Attribute Types
Attributes
Data: Collection of data objects and their
attributes
Qualitative: Nominal, ordinal,
Quantitative (numeric): interval and ratio
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
10
1
10/26/2015
Attribute Types
Nominal quantities
•
Values are distinct symbols
•
•
•
•
Ordinal quantities
•
•
•
Values themselves serve only as labels or names
Nominal comes from the Latin word for name
Categorical, binary
•
•
•
•
Values: “engineer” <“chief” <“manager”
Attribute satisfaction:
•
•
•
values: “small” < “medium” < “large”
Attribute rank:
•
•
Values: “hot” > “mild” > “cool”
Attribute size:
•
Values: “sunny”,”overcast”, and “rainy”
No relation is implied among nominal values (no ordering or distance measure)
Only equality tests can be performed
Operators:
= ≠ mode (central tendency)
Impose order on values, ranking
But: no distance between values defined
Example:
• Attribute “temperature” in weather data
•
Example: attribute “outlook” from weather data
•
•
•
•
Attribute Types
Values: 0: dissatisfied < 1: neutral < 2: very satisfied
Note: addition and subtraction don’t make sense
Example rule:
temperature < hot c play = yes
Operations: < > >= <= mode and median
Distinction between nominal and ordinal not always clear (e.g.
attribute “outlook”)
2
10/26/2015
Interval quantities (Numeric)
Ratio quantities (Numeric)
•
•
•
•
•
•
Interval quantities are not only ordered but measured in fixed and equal units
Have order
Difference of two values makes sense
Can compare and quantify
Sum or product doesn’t make sense
•
•
•
•
•
•
Zero point is not defined!
Example : attribute “temperature” expressed in degrees Fahrenheit
Example: 0 degree Celsius does not mean “no temperature”
Cant say one temperature is multiple of another
Example : attribute “year”
Operators: difference, mean, mode
•
•
Ratio quantities are ones for which the measurement
scheme defines a zero point
Can say value is multiple of another
Ratio quantities are treated as real numbers
•
Example: attribute “distance”
•
•
•
All mathematical operations are allowed
Distance between an object and itself is zero
Example : count of students, years of experience,
weight, height
Attribute types: Summary
Why specify attribute types?
• Nominal, e.g. eye color=brown, blue, …
• Q: Why Machine Learning algorithms need to know about attribute
type?
• A: To be able to make right comparisons and learn correct concepts,
e.g.
• only equality tests
• important special case: boolean (True/False)
• Ordinal, e.g. grade=k,1,2,..,12
• Outlook > “sunny” does not make sense, while
• Temperature > “cool” or
• Humidity > 70 does
• Additional uses of attribute type: check for valid values, deal with
missing, etc.
3
10/26/2015
Nominal vs. ordinal
Discrete and Continuous Attributes
•
• Discrete Attribute
Attribute “age” nominal
•
•
•
•
If age = young and astigmatic = no
and tear production rate = normal
then recommendation = soft
If age = old and astigmatic = no
and tear production rate = normal
then recommendation = soft
Has only a finite or countable infinite set of values
Examples: zip codes, counts, or the set of words in a collection of documents
Often represented as integer variables.
Note: binary attributes are a special case of discrete attributes
• Continuous Attribute
•
•
•
•
•
Attribute “age” ordinal
(e.g. “young” < “adult” < “old”)
If age  adult and astigmatic = no
and tear production rate = normal
then recommendation = soft
Has real numbers as attribute values
Examples: temperature, height, or weight.
Practically, real values can only be measured and represented using a finite number of digits.
Continuous attributes are typically represented as floating-point variables.
Que
Que
•
What type of variable is ud_req (number of user data requests as part of a
criminal investigation)?
(a) numerical, continuous
(b) numerical, discrete: The number of data requests is a counted
variable, that can only
take on positive whole numbers. This falls into the definition of
discrete
numerical variables.
• What type of variable is ud_req (number of user data requests as part of a
criminal investigation)?
(c) Categorical (d) categorical, ordinal
•
What type of variable is hdi (human Development Index, combining
indicators of life expectancy, educational attainment, and income, with levels
(a) numerical, continuous
(b) numerical, discrete
very high, high, medium, and low human development)?
(c) Categorical
(d) categorical, ordinal
(a) numerical, continuous
• What type of variable is hdi (human Development Index, combining
indicators of life expectancy, educational attainment, and income, with levels
very high, high, medium, and low human development)?
(a) numerical, continuous
(b) numerical, discrete
(c) Categorical
(d) categorical, ordinal
(b) numerical, discrete (c) categorical
categorical, ordinal:There is an inherent ordering to the
levels of this categorical
(d)
variable (from very high to low), and hence this is an ordinal
categorical variable
4
10/26/2015
Data Matrix
Record Data
• If data objects have the same fixed set of numeric attributes,
then the data objects can be thought of as points in a multidimensional space, where each dimension represents an
attribute
• Data consist of a collection of records, each of which consists of a
fixed set of attributes
• Such data set can be represented by an m by n matrix, where
there are m rows, one for each object, and n columns, one for
each attribute
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
10.23
5.27
15.22
2.7
1.2
9
No
Married
75K
No
12.65
6.25
16.22
2.2
1.1
10
No
Single
90K
Yes
60K
Q: what is a sparse data set?
Projection
of x Load
Projection
of y load
Distance
Load
Thickness
10
17
18
Document Data
Transaction Data
• Each document becomes a `term' vector,
• each term is a component (attribute) of the vector,
• Term can be n-grams, phrases, etc.
• the value of each component is the number of times the
corresponding term occurs in the document.
• A special type of record data, where
• each record (transaction) has a set of items.
• For example, consider a grocery store. The set of products purchased by a
customer during one shopping trip constitute a transaction, while the
individual products that were purchased are the items.
• Set based
Q: what is a sparse data set?
season
timeout
wi
n
lost
score
game
pla
y
ball
team
coach
19
Document 1
3
0
5
0
2
6
0
2
0
2
Document 2
0
7
0
2
1
0
0
3
0
0
Document 3
0
1
0
0
1
2
2
0
3
0
TID
Items
1
Bread, Coke, Milk
2
3
4
5
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
20
5
10/26/2015
Graph Data
Ordered Data-temporal data
• Examples: Directed graph and URL Links
2
1
5
2
• Sequences of transactions
Items/Events
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
<a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li>
<a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of Equations </a>
<li>
<a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
5
An element of the
sequence
21
22
Ordered Data
Data Quality
• Genomic sequence data-no time stamps
• What kinds of data quality problems?
• How can we detect problems with the data?
• What can we do about these problems?
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
23
• Examples of data quality problems:
• Noise and outliers
• missing values
• duplicated data
24
6
10/26/2015
Outliers
• Outliers are data objects with characteristics that are
considerably different than most of the other data
objects in the data set
Missing Values
• Reasons for missing values
• Information is not collected
(e.g., people decline to give their age and weight)
• Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
• Are they noise points, or meaningful outliers?
• Handling missing values
•
•
•
•
•
Eliminate Data Objects
Estimate Missing Values
Ignore the Missing Value During Analysis
Replace with all possible values (weighted by their probabilities)
Missing as meaningful…
25
26
Data Preprocessing
Aggregation
• Aggregation and Noise Removal
• Sampling
• Dimensionality Reduction
• Feature subset selection
• Feature creation and transformation
• Discretization
• Combining two or more attributes (or objects) into a single attribute
(or object)
• Purpose
• Data reduction
• Reduce the number of attributes or objects
• Change of scale
• Cities aggregated into regions, states, countries, etc
• De-noise: more “stable” data
• Q: How much % of the data mining process is data preprocessing?
27
• Aggregated data tends to have less variability
28
7
10/26/2015
Sampling
Aggregation
Variation of Precipitation in Australia
• Sampling is the main technique employed for data
selection.
• It is often used for both the preliminary investigation of the
data and the final data analysis.
• Reasons:
• too expensive or time consuming to obtain or to process the
data.
29
Standard Deviation of
Average Monthly
Precipitation
Standard Deviation of
Average Yearly Precipitation
30
Curse of Dimensionality
Dimensionality Reduction
• When dimensionality increases, data
becomes increasingly sparse in the
space that it occupies
• Purpose:
• Definitions of density and distance
between points, which is critical for
clustering and outlier detection, become
less meaningful
• Thus, harder and harder to classify the
data!
• Techniques (supervised and unsupervised methods)
•
•
•
•
Avoid curse of dimensionality
Reduce amount of time and memory required by data mining algorithms
Allow data to be more easily visualized
May help to eliminate irrelevant features or reduce noise
• Principle Component Analysis
• Singular Value Decomposition
• Others: supervised and non-linear techniques
• Randomly generate 500 points
• Compute difference between max and min
distance between any pair of points
31
32
8
10/26/2015
Dimensionality Reduction: PCA
Question
• Goal is to find a projection that captures the largest amount of
variation in data
• What is the difference between sampling and dimensionality
reduction?
• Thining vs. shortening of data
x2
e
x1
33
34
Discretization
Simple Discretization Methods: Binning
• Three types of attributes:
• Equal-width (distance) partitioning:
• Nominal — values from an unordered set
• It divides the range into N intervals of equal size: uniform grid
• if A and B are the lowest and highest values of the attribute, the width
of intervals will be: W = (B –A)/N.
• Example: attribute “outlook” from weather data
• Values: “sunny”,”overcast”, and “rainy”
• Ordinal — values from an ordered set
• The most straightforward
• But outliers may dominate presentation: Skewed data is not handled well.
• Example: attribute “temperature” in weather data
• Values: “hot” > “mild” > “cool”
• Equal-depth (frequency) partitioning:
• Continuous — real numbers
• It divides the range into N intervals, each containing approximately
same number of samples
• Good data scaling
• Managing categorical attributes can be tricky.
• Discretization:
•
•
•
•
35
divide the range of a continuous attribute into intervals
Some classification algorithms only accept categorical attributes.
Reduce data size by discretization
Supervised (entropy) vs. Unsupervised (binning)
36
9
10/26/2015
Getting to Know Your Data
Transforming Ordinal to Boolean
• Simple transformation allows to code ordinal attribute with n values using n-1 boolean
attributes
• Data Objects and Attribute Types
• Example: attribute “temperature”
• Basic Statistical Descriptions of Data
Temperature
Temperature > cold
Temperature > medium
Cold
False
False
Medium
True
False
Hot
True
True
Original data
• Data Visualization
• Measuring Data Similarity and Dissimilarity
Transformed data
• Why? Not introducing distance concept between different colors: “Red” vs. “Blue” vs.
“Green”.
• Summary
37
Graphic Displays of Basic Statistical Descriptions
Mining Data Descriptive Characteristics
•
Motivation
•
•
Data dispersion characteristics
•
•
To better understand the data: central tendency, variation and
spread
median, max, min, quantiles, outliers, variance, etc.
•
Boxplot: graphic display of five-number summary
•
Histogram: x-axis are values, y-axis repres. frequencies
•
Quantile plot: each value xi is paired with fi indicating that approximately 100 fi % of
data are  xi
Numerical dimensions correspond to sorted intervals
•
Data dispersion: analyzed with multiple granularities of precision
•
Boxplot or quantile analysis on sorted intervals
•
Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution against
the corresponding quantiles of another
•
Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane
10
10/26/2015
visualizing
numerical data
scatterplots
• scatterplots for paired data
• other visualizations for describing distributions of numerical variables
Scatter plot
Evaluating Relationships
• Provides a first look at bivariate data to see clusters of points,
outliers, etc
• Each pair of values is treated as a pair of coordinates and
plotted as points in the plane
11
10/26/2015
Positively and Negatively Correlated Data
UnCorrelated Data
• The left half fragment is positively
correlated
• The right half is negative correlated
Histogram Analysis
histogram
• provides a view of the data
density
• especially useful for
describing the shape of the
distribution
• Histogram: Graph display of
tabulated frequencies, shown as
bars
• It shows what proportion of cases
fall into each of several categories
• The categories are usually
specified as non-overlapping
intervals of some variable. The
categories (bars) must be adjacent
40
35
30
25
20
15
10
5
0
10000
30000
50000
70000
90000
12
10/26/2015
Boxplot Analysis
Histograms Often Tell More than Boxplots
• Five-number summary of a distribution

• Minimum, Q1, Median, Q3, Maximum
• Boxplot
• Data is represented with a box
• The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR

• The median is marked by a line within the box
• Whiskers: two lines outside the box extended to
Minimum and Maximum

• Outliers: points beyond a specified outlier
threshold, plotted individually
skewness
The two histograms
shown in the left may
have the same boxplot
representation
The same values for:
min, Q1, median, Q3,
max
But they have rather
different data
distributions
modality
• distributions are skewed to the side of the long tail
13
10/26/2015
histogram & bin width
dotplot
• The chosen bin width can alter the story the histogram is telling.
• useful when individual values are of interest
• can get busy as the sample size increases
Measures of center
Measuring the Central Tendency
•
•
•
1 n
x   xi
n i 1
Mean (algebraic measure) (sample vs. population):
•
Weighted arithmetic mean:
•
Trimmed mean: chopping extreme values
•
Sensitive to outliers
x

N
n
x
 wi xi
• Mode (most frequent observation)
i 1
n
w
i 1
i
Example: 9 students exam scores: 75, 69, 88, 93,95, 54, 87,88,27
Median: A holistic measure- expensive
•
Middle value if odd number of values, or average of the middle two values
otherwise
•
Estimated by interpolation (for grouped data):
Mode
median  L1  (
•
Value that occurs most frequently in the data
•
Unimodal, bimodal, trimodal
•
Empirical formula:
• Mean- arithmetic average
• Median: midpont of the distribution (50 percentile)
n / 2  ( f )l
f median
Mean=sum/9=75.1
Mode=88
Median= 27 54 69 75 87 88 88 93 95  ??
)c
Example: 9 students exam scores: mode?
27 54 69 75 87 90 90 93 95 100 mode=???
mean  mode  3  (mean  median )
14
10/26/2015
Skewness and measures of center
Symmetric vs. Skewed Data
• Median, mean and mode of symmetric, positively
and negatively skewed data
mode<median and mode>median
Unimodal frequency
curve with perfect
symmetric distribution
Question:
Which of the following is true? (Hint: Sketching the distributions
might be useful.)
(a) In a symmetric distribution, more than 50% of the data are below and less than 50% are
above the mean.
(b) In a left skewed distribution, roughly 50% of the data are below and 50% are above the
mean.
negatively skewed data
(c) In a right skewed distribution, less than 50% of the data are below the mean.
(d) In a left skewed distribution, less than 50% of the data are below the mean.
positively skewed data
October 26, 2015
Data Mining: Concepts and Techniques
Measures of spread
• range (max-min)
• Quartiles
• Variance
• Standard deviation
• Inter-quartile range
57
Quartiles
• Quantile: points taken at regular intervals of
data distribution dividing it into equally
consecutive sets
• kth q-quantile: value of x such that at most k/q data values are
less than x and at most (q-k)/q of values >x when 0<k<q
e.g. 2-quartile divides data set in to 2??
4-quartile: 3 data points splitting data distribution into 4 equal parts
each part represents ¼ of data dist.quartiles
100-quartiles: percentiles-divide data into 100 equal sized sets
Q1: 25th percentile- 1st quartile
Q2: median- 2nd quartile
Q3: 75th percentile- 3rd quartile
15
10/26/2015
variance
• roughly the average squared
deviation from the mean
interquartile range
• Example: Given that the average life
expectancy is 70.5, and there are 201
countries in the dataset:
• variance=??
• Std=?
Box-plots
• range of the middle 50% of the data, distance
between the first quartile (25th percentile) and
third quartile (75th percentile)
The Quartiles divide the data into divisions of 25%,
• IQR=Q1-Q3
• Quartile 3 (Q3) can be called the 75th percentile
• IQR=Q1-Q3
• Quartile 2 (Q2) can be called the 50th percentile
Measuring the Dispersion of Data
•
• range of the middle 50% of the data, distance
between the first quartile (25th percentile) and
third quartile (75th percentile)
• Quartile 1 (Q1) can be called the 25th percentile
Quartiles, outliers and boxplots
Five number summary:
•
•
•
•
•
Median
Q1
Q3
Minimum
Maximum
•
•
Quartiles: Q1 (25th percentile), Q3 (75th percentile): Ex: 4-quantiles means 3 points
splits data into 4 equal parts, each representing ¼ of the distribution
•
Inter-quartile range: IQR = Q3 – Q1
•
Five number summary: min, Q1, Median, Q3, max
•
Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot
outlier individually
•
Outlier: usually, a value higher/lower than 1.5 x IQR from Q1 and Q3
Variance and standard deviation (sample: s, population: σ)
•
s2 
Variance: (algebraic, scalable computation)
1 n
1 n 2 1 n 2
xi  ( xi ) ]
 ( xi  x )2  n  1[
n  1 i 1
n i 1
i 1
•
1 n
1
 2   ( xi   ) 2 
N i 1
N
Standard deviation s (or σ) is the square root of variance s2 (or σ2)
n
x
i 1
i
2
 2
16
10/26/2015
robust statistics
transformations
• we define robust statistics as
measures on which extreme
observations have little effect
• a transformation is a rescaling of the
data using a function
WHY?
• to see the data structure differently
• to reduce skew assist in modeling
• Log transformation:
• to straighten a nonlinear relationship in
a scatterplot
Exploring categorical variables
• Frequency table
• Bar plot
bar plots vs histograms
• barplots for categorical variables,
histograms for numerical variables
• x-axis on a histogram is a number
line, and the
• ordering of the bars are not
interchangeable
17
10/26/2015
contingency table
Bar plot
• useful for visualizing conditional
frequency distributions
• compare relative frequencies to
explore the relationship between
the variables
• Mosaic plot?
18