Download Data Mining: EXPLORING DATA

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Data mining wikipedia , lookup

Time series wikipedia , lookup

Transcript
DATA MINING:
EXPLORING DATA
Instructor: Dr. Chun Yu
School of Statistics
Jiangxi University of Finance and Economics
Fall 2016
What is data exploration?
• A preliminary exploration of the data to better understand
its characteristics.
• In our discussion of data exploration, we focus on
• Summary statistics
• Visualization
• Online Analytical Processing (OLAP)
Summary Statistics
• Summary statistics are numbers that summarize
properties of the data
• Summarized properties include frequency, location and spread
•
Examples:
location - mean
spread - standard deviation
• Most summary statistics can be calculated in a single pass through
the data
Frequency
•
Mode
• The mode of an attribute is the most frequent attribute
value
Class
Size
Frequency
Freshmen
Sophomore
Junior
Senior
Total
200
160
130
110
600
200/600 = 0.33
160/600 = 0.27
130/600 = 0.22
110/600 = 0.18
1.00
• The mode of the class attribute is freshmen, with a frequency of 0.33
• The notions of frequency and mode are typically used
with categorical data
Measures of Location: Mean and Median
• The mean and the median are the most common
measures of the location of a set of points.
Median
• A sample median is the middle sorted observation. That
is, we want a value such that half of the data is below it
and half above it.
• How to calculate the median?
• Step 1: Sort the data from smallest to largest.
• Step 2:
• If n is odd, pick the middle observation.
• If n is even, average the two middle observations.
Mean and Median
• Example 1: Computation of the median with an odd
number of data points.
• The data: 7, 11, 7, 14, 13
• The data put in order: 7, 7, 11, 13, 14
• Median: 11
• Mean: (7 + 7 + 11 + 13 + 14)/5 = 10.4
Mean and Median
• Example 2: Computation of the median with an even
number of data points.
• The data: 7, 11, 7, 14, 13, 15
• The data put in order: 7, 7, 11, 13, 14, 15
• Median: 12, the average of 11 and 13
• Mean: (7 + 7 + 11 + 13 + 14 + 15)/6 = 11.17
Effect of Outlier on Mean and Median
• Begin with data 7, 7, 11, 13, 14 as in Example 1.
• What happens to the mean and median when the
largest value is changed from 14 to 140?
• Change affects the mean but not the median.
• Median is still 11 but mean is 35.6.
• The mean “chases after” extreme observations.
Mean and Median
• When the data are symmetric, the median and mean will be about the
same.
• When the data are skewed right, the mean is greater than the
median. (Ex: Income)
• When the data are skewed left, the mean is less than the median.
(Ex: Exam scores.)
Percentiles
• For continuous data, the notion of a percentile is more
useful.
Given an ordinal or continuous attribute x and a number p
between 0 and 100, the pth percentile is a value xp of x
such that p% of the observed values of x are less than xp .
• For instance, the 50th percentile is the value x50 such
that 50% of all values of x are less than x50 . The median
is the 50th percentile.
Calculating the pth Percentile
Data: X1, X2, X3, …, Xn
• 1. Sort the data from smallest to largest.
• 2. Compute the index:
i =(p/100)*n
• 3. If i is:
• (a) an integer. Find the ith observation in the ordered data and the
(i+1)th observation. The average of these two is the pth percentile.
• (b) not an integer, round UP to the next largest integer. This
observation in the ordered data is the pth percentile.
Calculating the pth Percentile
• Example
• Sorted heights (cm): 165, 165, 167, 168,170,172,173,175,
180, 190. What is the 50th percentile?
• Compute the index: i = (50/100)*10 = 5
• The 50th percentile is the average of 5th and 6th
observations
• The 50th percentile is: (170+172)/2 = 171
Quartiles
• Divide the data into four groups
• Q1 = first quartile = 25th percentile
• Q2 = second quartile = 50th percentile = median
• Q3 = third quartile = 75th percentile
• In the previous example,
• What is the first quartile?
• What is the median (second quartile)?
• What is the third quartile?
Quartiles
• 1. First quartile, Q1
i = (25/100)*10 = 2.5
Q1 is the third observation, that is, Q1 = 167
• 2. Median, Q2 = 171
• 3. Third quartile, Q3
i = (75/100)*10 = 7.5
Q3 is the 8th observation, that is, Q3 = 175
Measures of Spread: Range and Variance
•
Standard Deviation
• Data: 1, 2, 3, 4, 5
xi
1
-2
4
2
-1
1
3
0
0
4
1
1
5
2
4
4+1+0+1+4
= 10
10/4
= 2.5
Sqrt(2.5)
= 1.58
• What is the standard deviation of 6,7,8,9,10?
• What is the standard deviation of -1, -2, -3, -4, -5?
• What is the standard deviation of 5, 10, 15, 20, 25?
Visualization
Visualization is the conversion of data into a visual or
tabular format so that the characteristics of the data and
the relationships among data items or attributes can be
analyzed or reported.
• Visualization of data is one of the most powerful and
appealing techniques for data exploration.
• Humans have a well developed ability to analyze large amounts of
information that is presented visually
• Can detect general patterns and trends
• Can detect outliers and unusual patterns
Visualization Techniques: Histograms
• Histogram
• Usually shows the distribution of values of a single variable
• Divide the values into bins and show a bar plot of the number of
objects in each bin.
• The height of each bar indicates the number of objects
• Shape of histogram depends on the number of bins
• Example: Petal Width (10 and 20 bins, respectively)
Two-Dimensional Histograms
• Show the joint distribution of the values of two attributes
• Example: petal width and petal length
• What does this tell us?
Stem and Leaf Plot
• Each number is broken into a stem and a leaf such that
the last digit is leaf and all other leading digits are a stem
• Place the stems in increasing order to the left of a vertical
line
• To the right of the vertical line, place the leaves in
ascending order
• Exam scores for n = 30 students:
41 46 47 58 59 67 68 70 70 70 74 75 77 77 78
80 81 82 82 83 84 84 85 85 86 87 92 94 96 97
Stem and Leaf Plot
4| 1 6 7
5| 8 9
6| 7 8
7| 0 0 0 4 5 5 7 8
8| 0 1 2 2 3 4 4 5 5 6 7
9| 2 4 6 7
• Advantages:
• shows the actual data values
• Shows the rank order of the data
• Shows the shape of the data set
• Easy and quick to do by hand for small data sets
Visualization Techniques: Box Plots
• Box Plots
• Invented by J. Tukey
• Another way of displaying the distribution of data
• Following figure shows the basic part of a box plot
outlier
10th percentile
75th percentile
50th percentile
25th percentile
10th percentile
Example of Box Plots
• Box plots can be used to compare attributes
Pie Chart
• Present the percent frequency distribution
• Draw a circle
• Divide the circle into pieces that correspond to the percent
frequency distribution
Visualization Techniques: Scatter Plots
• Scatter plots
• Attributes values determine the position
• Two-dimensional scatter plots most common, but can have threedimensional scatter plots
• Often additional attributes can be displayed by using the size,
shape, and color of the markers that represent the objects
• It is useful to have arrays of scatter plots can compactly summarize
the relationships of several pairs of attributes
• See example on the next slide
Scatter Plot Array of Iris Attributes
OLAP
• On-Line Analytical Processing (OLAP) was proposed by
E. F. Codd, the father of the relational database.
• Relational databases put data into tables, while OLAP
uses a multidimensional array representation.
• Such representations of data previously existed in statistics and
other fields
• There are a number of data analysis and data exploration
operations that are easier with such a data
representation.
Data Warehouses and Example
• A data warehouse is usually modeled by a
multidimensional data structure, called a data cube
• Example: A data cube for a company
• The cube has three dimensions:
• address (with city values Chicago, New York, Toronto,
Vancouver)
• time (with quarter values Q1, Q2, Q3, Q4)
• item(with item type values home entertainment,
computer, phone, security)
A multidimensional data cube
A multidimensional data cube: drill down
and roll up
Thank you!