Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
DATA MINING: EXPLORING DATA Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2016 What is data exploration? • A preliminary exploration of the data to better understand its characteristics. • In our discussion of data exploration, we focus on • Summary statistics • Visualization • Online Analytical Processing (OLAP) Summary Statistics • Summary statistics are numbers that summarize properties of the data • Summarized properties include frequency, location and spread • Examples: location - mean spread - standard deviation • Most summary statistics can be calculated in a single pass through the data Frequency • Mode • The mode of an attribute is the most frequent attribute value Class Size Frequency Freshmen Sophomore Junior Senior Total 200 160 130 110 600 200/600 = 0.33 160/600 = 0.27 130/600 = 0.22 110/600 = 0.18 1.00 • The mode of the class attribute is freshmen, with a frequency of 0.33 • The notions of frequency and mode are typically used with categorical data Measures of Location: Mean and Median • The mean and the median are the most common measures of the location of a set of points. Median • A sample median is the middle sorted observation. That is, we want a value such that half of the data is below it and half above it. • How to calculate the median? • Step 1: Sort the data from smallest to largest. • Step 2: • If n is odd, pick the middle observation. • If n is even, average the two middle observations. Mean and Median • Example 1: Computation of the median with an odd number of data points. • The data: 7, 11, 7, 14, 13 • The data put in order: 7, 7, 11, 13, 14 • Median: 11 • Mean: (7 + 7 + 11 + 13 + 14)/5 = 10.4 Mean and Median • Example 2: Computation of the median with an even number of data points. • The data: 7, 11, 7, 14, 13, 15 • The data put in order: 7, 7, 11, 13, 14, 15 • Median: 12, the average of 11 and 13 • Mean: (7 + 7 + 11 + 13 + 14 + 15)/6 = 11.17 Effect of Outlier on Mean and Median • Begin with data 7, 7, 11, 13, 14 as in Example 1. • What happens to the mean and median when the largest value is changed from 14 to 140? • Change affects the mean but not the median. • Median is still 11 but mean is 35.6. • The mean “chases after” extreme observations. Mean and Median • When the data are symmetric, the median and mean will be about the same. • When the data are skewed right, the mean is greater than the median. (Ex: Income) • When the data are skewed left, the mean is less than the median. (Ex: Exam scores.) Percentiles • For continuous data, the notion of a percentile is more useful. Given an ordinal or continuous attribute x and a number p between 0 and 100, the pth percentile is a value xp of x such that p% of the observed values of x are less than xp . • For instance, the 50th percentile is the value x50 such that 50% of all values of x are less than x50 . The median is the 50th percentile. Calculating the pth Percentile Data: X1, X2, X3, …, Xn • 1. Sort the data from smallest to largest. • 2. Compute the index: i =(p/100)*n • 3. If i is: • (a) an integer. Find the ith observation in the ordered data and the (i+1)th observation. The average of these two is the pth percentile. • (b) not an integer, round UP to the next largest integer. This observation in the ordered data is the pth percentile. Calculating the pth Percentile • Example • Sorted heights (cm): 165, 165, 167, 168,170,172,173,175, 180, 190. What is the 50th percentile? • Compute the index: i = (50/100)*10 = 5 • The 50th percentile is the average of 5th and 6th observations • The 50th percentile is: (170+172)/2 = 171 Quartiles • Divide the data into four groups • Q1 = first quartile = 25th percentile • Q2 = second quartile = 50th percentile = median • Q3 = third quartile = 75th percentile • In the previous example, • What is the first quartile? • What is the median (second quartile)? • What is the third quartile? Quartiles • 1. First quartile, Q1 i = (25/100)*10 = 2.5 Q1 is the third observation, that is, Q1 = 167 • 2. Median, Q2 = 171 • 3. Third quartile, Q3 i = (75/100)*10 = 7.5 Q3 is the 8th observation, that is, Q3 = 175 Measures of Spread: Range and Variance • Standard Deviation • Data: 1, 2, 3, 4, 5 xi 1 -2 4 2 -1 1 3 0 0 4 1 1 5 2 4 4+1+0+1+4 = 10 10/4 = 2.5 Sqrt(2.5) = 1.58 • What is the standard deviation of 6,7,8,9,10? • What is the standard deviation of -1, -2, -3, -4, -5? • What is the standard deviation of 5, 10, 15, 20, 25? Visualization Visualization is the conversion of data into a visual or tabular format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported. • Visualization of data is one of the most powerful and appealing techniques for data exploration. • Humans have a well developed ability to analyze large amounts of information that is presented visually • Can detect general patterns and trends • Can detect outliers and unusual patterns Visualization Techniques: Histograms • Histogram • Usually shows the distribution of values of a single variable • Divide the values into bins and show a bar plot of the number of objects in each bin. • The height of each bar indicates the number of objects • Shape of histogram depends on the number of bins • Example: Petal Width (10 and 20 bins, respectively) Two-Dimensional Histograms • Show the joint distribution of the values of two attributes • Example: petal width and petal length • What does this tell us? Stem and Leaf Plot • Each number is broken into a stem and a leaf such that the last digit is leaf and all other leading digits are a stem • Place the stems in increasing order to the left of a vertical line • To the right of the vertical line, place the leaves in ascending order • Exam scores for n = 30 students: 41 46 47 58 59 67 68 70 70 70 74 75 77 77 78 80 81 82 82 83 84 84 85 85 86 87 92 94 96 97 Stem and Leaf Plot 4| 1 6 7 5| 8 9 6| 7 8 7| 0 0 0 4 5 5 7 8 8| 0 1 2 2 3 4 4 5 5 6 7 9| 2 4 6 7 • Advantages: • shows the actual data values • Shows the rank order of the data • Shows the shape of the data set • Easy and quick to do by hand for small data sets Visualization Techniques: Box Plots • Box Plots • Invented by J. Tukey • Another way of displaying the distribution of data • Following figure shows the basic part of a box plot outlier 10th percentile 75th percentile 50th percentile 25th percentile 10th percentile Example of Box Plots • Box plots can be used to compare attributes Pie Chart • Present the percent frequency distribution • Draw a circle • Divide the circle into pieces that correspond to the percent frequency distribution Visualization Techniques: Scatter Plots • Scatter plots • Attributes values determine the position • Two-dimensional scatter plots most common, but can have threedimensional scatter plots • Often additional attributes can be displayed by using the size, shape, and color of the markers that represent the objects • It is useful to have arrays of scatter plots can compactly summarize the relationships of several pairs of attributes • See example on the next slide Scatter Plot Array of Iris Attributes OLAP • On-Line Analytical Processing (OLAP) was proposed by E. F. Codd, the father of the relational database. • Relational databases put data into tables, while OLAP uses a multidimensional array representation. • Such representations of data previously existed in statistics and other fields • There are a number of data analysis and data exploration operations that are easier with such a data representation. Data Warehouses and Example • A data warehouse is usually modeled by a multidimensional data structure, called a data cube • Example: A data cube for a company • The cube has three dimensions: • address (with city values Chicago, New York, Toronto, Vancouver) • time (with quarter values Q1, Q2, Q3, Q4) • item(with item type values home entertainment, computer, phone, security) A multidimensional data cube A multidimensional data cube: drill down and roll up Thank you!