Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine Lecture 3 • Finish up material from Lecture 2 • Homework due this Thursday • Discuss projects in some detail • Exploratory Data Analysis and Visualization – Reading: Chapter 3 in the text Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine Exploratory Data Analysis (EDA) • get a general sense of the data • interactive and visual – (cleverly/creatively) exploit human visual power to see patterns • 3 to 5 dimensions (e.g. spatial, color, time, sound) – e.g. plot raw data/statistics, reduce dimensions as needed • data-driven (model-free) • especially useful in early stages of data mining – detect outliers (e.g. assess data quality) – test assumptions (e.g. normal distributions?) – identify useful raw data & transforms (e.g. log(x)) • http://www.itl.nist.gov/div898/handbook/eda/eda.htm Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine Summary Statistics • not visual • sample statistics of data X – – – – mean: = i Xi / n { minimizes i (Xi - )2 } mode: most common value in X median: X=sort(X), median = Xn/2 (half below, half above) quartiles of sorted X: Q1 value = X0.25n , Q3 value = X0.75 n • interquartile range: value(Q3) - value(Q1) • range: max(X) - min(X) = Xn - X1 – variance: 2 = i (Xi - )2 / n – skewness: i (Xi - )3 / [ (i (Xi - )2)3/2 ] • zero if symmetric; right-skewed more common (e.g. us … Gates) – number of distinct values for a variable (see unique.m in MATLAB) – Note: all of these are estimates based on the sample at hand – they may be different from the “true” values (e.g., median age in US). Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine Exploratory Data Analysis Tools for Displaying Single Variables Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine Histogram • Most common form: split data range into equal-sized bins Then for each bin, count the number of points from the data set that fall into the bin. – – • Vertical axis: Frequency (i.e., counts for each bin) Horizontal axis: Response variable The histogram graphically shows the following: 1. center (i.e., the location) of the data; 2. spread (i.e., the scale) of the data; 3. skewness of the data; 4. presence of outliers; and 5. presence of multiple modes in the data. These features can provide useful information of both - the proper distributional model for the data - Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine Issues with Histograms • For small data sets, histograms can be misleading. Small changes in the data or to the bucket boundaries can result in very different histograms. • For large data sets, histograms can be quite effective at illustrating general properties of the distribution. • example • Can smooth histogram using a variety of techniques – E.g., kernel density estimation (pages 59-61 in text) • Histograms effectively only work with 1 variable at a time – Difficult to extend to 2 dimensions, not possible for >2 – So histograms tell us nothing about the relationships among variables Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine Histogram Example classical bell-shaped, symmetric histogram with most of the frequency counts bunched in the middle and with the counts dying off out in the tails. From a physical science/engineering point of view, the Normal/Gaussian distribution often occurs in nature (due in part to the central limit theorem). Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ZipCode Data: Population 900 8000 K = 50 7000 K = 500 800 700 6000 600 5000 500 4000 400 3000 300 2000 200 1000 0 100 0 2 4 6 8 10 0 12 0 2 4 6 8 10 12 4 4 x 10 x 10 400 K = 50 350 300 250 200 150 100 50 0 Data Mining Lectures 0 500 1000 1500 2000 2500 3000 3500 4000 Lecture 3: EDA and Visualization 4500 5000 Padhraic Smyth, UC Irvine ZipCode Data: Population • MATLAB code: X = zipcode_data(:,2) % second column from zipcode array histogram(X, 50) % histogram of X with 50 bins histogram(X, 500) % 500 bins index = X < 5000; % identify X values lower than 5000 histogram(X(index),50) % now plot just these X values Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine Histogram Detecting Outlier (Missing Data) blood pressure = 0 ? Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine Right Skewness Example: Credit Card Usage similarly right-skewed are Power law distributions (Pi ~ 1/ia, where a >= 1) e.g. for a = 1 we have “Zipf’s law” For word frequencies in text Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine Box (and Whisker) Plots: Pima Indians Data plots all data outside whiskers Q3-Q1 box contains middle 50% of data up to 1.5 x Q3-Q1 Q2 (median) healthy Data Mining Lectures (or shorter, if no data that far above Q3) diabetic Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine Time Series Example 1 annual fees introduced in UK (many users cutback to 1 credit card) Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine Time Series Example 2 summer bifurcations in air travel (favor early/late) summer peaks New Year bumps Data Mining Lectures steady growth trend Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine Time-Series Example 3 mean weight vs mean age for 10k control group Scotland experiment: “ milk in kid diet better health” ? 20,000 kids: 5k raw, 5k pasteurize, 10k control (no supplement) Data Mining Lectures Possible explanations: Would expect smooth weight growth plot. Visually reveals unexpected pattern (steps), not apparent from raw data table. Lecture 3: EDA and Visualization Grow less early in year than later? No steps in height plots; so why height uniformly, weight spurts? Kids weighed in clothes: summer garb lighter than winter? Padhraic Smyth, UC Irvine Exploratory Data Analysis Tools for Displaying Pairs of Variables Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine A simple data set Data X 10.00 8.00 13.00 9.00 11.00 14.00 6.00 4.00 12.00 7.00 5.00 Y 8.04 6.95 7.58 8.81 8.33 9.96 7.24 4.26 10.84 4.82 5.68 Anscombe, Francis (1973), Graphs in Statistical Analysis, The American Statistician, pp. 195-199. Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine A simple data set Data X 10.00 8.00 13.00 9.00 11.00 14.00 6.00 4.00 12.00 7.00 5.00 Y 8.04 6.95 7.58 8.81 8.33 9.96 7.24 4.26 10.84 4.82 5.68 Summary Statistics N = 11 Mean of X = 9.0 Mean of Y = 7.5 Intercept = 3 Slope = 0.5 Residual standard deviation = 1.237 Correlation = 0.816 Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine A simple data set Data X 10.00 8.00 13.00 9.00 11.00 14.00 6.00 4.00 12.00 7.00 5.00 Y 8.04 6.95 7.58 8.81 8.33 9.96 7.24 4.26 10.84 4.82 5.68 Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine 3 more data sets Data Mining Lectures X2 Y2 X3 Y3 X4 Y4 10.00 9.14 10.00 7.46 8.00 6.58 8.00 8.14 8.00 6.77 8.00 5.76 13.00 8.74 13.00 12.74 8.00 7.71 9.00 8.77 9.00 7.11 8.00 8.84 11.00 9.26 11.00 7.81 8.00 8.47 14.00 8.10 14.00 8.84 8.00 7.04 6.00 6.13 6.00 6.08 8.00 5.25 4.00 3.10 4.00 5.39 19.00 12.50 12.00 9.13 12.00 8.15 8.00 5.56 7.00 7.26 7.00 6.42 8.00 7.91 5.00 4.74 5.00 5.73 8.00 6.89 Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine Summary Statistics Summary Statistics of Data Set 2 N = 11 Mean of X = 9.0 Mean of Y = 7.5 Intercept = 3 Slope = 0.5 Residual standard deviation = 1.237 Correlation = 0.816 Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine Summary Statistics Summary Statistics of Data Set 2 N = 11 Mean of X = 9.0 Mean of Y = 7.5 Intercept = 3 Slope = 0.5 Residual standard deviation = 1.237 Correlation = 0.816 Summary Statistics of Data Set 3 Summary Statistics of Data Set 4 N = 11 N = 11 Data Mining Lectures Lecture 3: EDA and Visualization Mean of X = 9.0 Mean of Y = 7.5 Intercept = 3 Slope = 0.5 Residual standard deviation = 1.237 Correlation = 0.816 Mean of X = 9.0 Mean of Y = 7.5 Intercept = 3 Slope = 0.5 Residual standard deviation = 1.237 Correlation = 0.816 Padhraic Smyth, UC Irvine Graphs reveals the mystery! Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine Displaying high-dimensional data • multiple bivariate graphs – scatter plot matrix – trellis plot • Icon plots – star graph – Chernoff’s faces • Parallel coordinates Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine 2D: Scatter Plots • standard tool for displaying relationship between two variables • A scatter plot is a plot of the values of Y versus the corresponding values of X: – Vertical axis: variable Y--usually the response variable – Horizontal axis: variable X--variable we suspect may be related • Scatter plots can provide answers to the following questions: 1. 2. 3. 4. Are variables X and Y related? Are variables X and Y linearly related? Are variables X and Y non-linearly related? Does the variation in Y change depending on X? 5. Are there outliers? Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine Scatter Plot: No relationship Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine Scatter Plot: Linear relationship Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine Scatter Plot: Quadratic relationship Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine Scatter plot: Homoscedastic Variation of Y Does Not Depend on X Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine Scatter plot: Heteroscedastic variation in Y differs depending on the value of X Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine 2D Scatter Plots • standard tool to display relation between 2 variables – e.g. y-axis = response, x-axis = suspected indicator • credit card repayment: low-low, high-high useful to answer: – x,y related? • no • linearly • nonlinearly – variance(y) depend on x? – outliers present? • MATLAB: – plot(X(1,:),X(2,:),’.’); Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine 5 2.5 x 10 MEDIAN HOUSEHOLD INCOME 2 1.5 1 0.5 0 0 2 4 6 8 10 MEDIAN PERCAPITA INCOME Data Mining Lectures Lecture 3: EDA and Visualization 12 14 4 x 10 Padhraic Smyth, UC Irvine Problems with Scatter Plots of Large Data appears: later apps older; reality: downward slope (more apps, more variance) 96,000 bank loan applicants scatter plot degrades into black smudge ... Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine Contour Plots Can Help recall: (same 96,000 bank loan apps as before) shows variance(y) with x is indeed due to horizontal skew in density unimodal skewed skewed Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine Problems with Scatter Plots of Large Data # weeks credit card buys gas vs groceries (10,000 customers) actual correlation (0.48) higher than appears (overprinting) also demands explanation Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine Exploratory Data Analysis Tools for Displaying Pairs of Variables Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine Scatter Plot Matrix Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine Trellis Plot Older Younger Male Data Mining Lectures Female Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine Star Plots: Using Icons to Encode Information • • 1 2 3 4 Price Mileage (MPG) 1978 Repair Record (1 = Worst, 5 = Best) 1977 Repair Record (1 = Worst, 5 = Best) 5 6 7 8 Headroom Rear Seat Room Trunk Space Weight Each star represents a single observation. Star plots are used to examine the relative values for a single data point The star plot consists of a sequence of equi-angular spokes, called radii, with each spoke representing one of the variables. • Useful for small data sets with up to 10 or so variables • Limitations? – – Small data sets, small dimensions Ordering of variables may affect perception 9 Length Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine Chernoff’s Faces • described by ten facial characteristic parameters: head eccentricity, eye eccentricity, pupil size, eyebrow slant, nose size, mouth shape, eye spacing, eye size, mouth length and degree of mouth opening • Chernoff faces applet • more icon plots • Limitations: – Similar to star plots Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine Parallel Coordinates (epileptic seizure data again) 1 (of n) cases dimensions (possibly all d of them!) (this case is a “brushed” one, with a darker line, to standout from the n-1 other cases) often (re)ordered to better distinguish among interesting subsets of n total cases interactive “brushing” is useful for seeing such distinctions Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine “Grand Tour” • scatter plot matrix only multi-bivariate • can achieve richer multivariate visualization by: – rotate direction of projection over all d (not just pick two) – user control over spin – random projection (“Grand Tour”) • e.g. XGOBI visualization package (available on the Web) Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine Summary • EDA and Visualization – Can be very useful for • data checking • getting a general sense of individual or pairs of variables – But… • do not necessarily reveal structure in high dimensions • Reading: Chapter 3 Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine