Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
10/26/2015 Getting to Know Your Data • Data Objects and Attribute Types CENG 464 Introduction to Data Mining • Basic Statistical Descriptions of Data • Data Visualization • Measuring Data Similarity and Dissimilarity • Summary What’s an attribute? •Each instance/Object/record/point/ case/sample/entity is described by a fixed predefined set of features, its “attributes” • An attribute/variable/ field/ characteristic/ feature is a property or characteristic of an object • Examples: eye color of a person, temperature, etc. Objects •Possible attribute types (“levels of measurement”): • • Attribute Types Attributes Data: Collection of data objects and their attributes Qualitative: Nominal, ordinal, Quantitative (numeric): interval and ratio Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K 10 1 10/26/2015 Attribute Types Nominal quantities • Values are distinct symbols • • • • Ordinal quantities • • • Values themselves serve only as labels or names Nominal comes from the Latin word for name Categorical, binary • • • • Values: “engineer” <“chief” <“manager” Attribute satisfaction: • • • values: “small” < “medium” < “large” Attribute rank: • • Values: “hot” > “mild” > “cool” Attribute size: • Values: “sunny”,”overcast”, and “rainy” No relation is implied among nominal values (no ordering or distance measure) Only equality tests can be performed Operators: = ≠ mode (central tendency) Impose order on values, ranking But: no distance between values defined Example: • Attribute “temperature” in weather data • Example: attribute “outlook” from weather data • • • • Attribute Types Values: 0: dissatisfied < 1: neutral < 2: very satisfied Note: addition and subtraction don’t make sense Example rule: temperature < hot c play = yes Operations: < > >= <= mode and median Distinction between nominal and ordinal not always clear (e.g. attribute “outlook”) 2 10/26/2015 Interval quantities (Numeric) Ratio quantities (Numeric) • • • • • • Interval quantities are not only ordered but measured in fixed and equal units Have order Difference of two values makes sense Can compare and quantify Sum or product doesn’t make sense • • • • • • Zero point is not defined! Example : attribute “temperature” expressed in degrees Fahrenheit Example: 0 degree Celsius does not mean “no temperature” Cant say one temperature is multiple of another Example : attribute “year” Operators: difference, mean, mode • • Ratio quantities are ones for which the measurement scheme defines a zero point Can say value is multiple of another Ratio quantities are treated as real numbers • Example: attribute “distance” • • • All mathematical operations are allowed Distance between an object and itself is zero Example : count of students, years of experience, weight, height Attribute types: Summary Why specify attribute types? • Nominal, e.g. eye color=brown, blue, … • Q: Why Machine Learning algorithms need to know about attribute type? • A: To be able to make right comparisons and learn correct concepts, e.g. • only equality tests • important special case: boolean (True/False) • Ordinal, e.g. grade=k,1,2,..,12 • Outlook > “sunny” does not make sense, while • Temperature > “cool” or • Humidity > 70 does • Additional uses of attribute type: check for valid values, deal with missing, etc. 3 10/26/2015 Nominal vs. ordinal Discrete and Continuous Attributes • • Discrete Attribute Attribute “age” nominal • • • • If age = young and astigmatic = no and tear production rate = normal then recommendation = soft If age = old and astigmatic = no and tear production rate = normal then recommendation = soft Has only a finite or countable infinite set of values Examples: zip codes, counts, or the set of words in a collection of documents Often represented as integer variables. Note: binary attributes are a special case of discrete attributes • Continuous Attribute • • • • • Attribute “age” ordinal (e.g. “young” < “adult” < “old”) If age adult and astigmatic = no and tear production rate = normal then recommendation = soft Has real numbers as attribute values Examples: temperature, height, or weight. Practically, real values can only be measured and represented using a finite number of digits. Continuous attributes are typically represented as floating-point variables. Que Que • What type of variable is ud_req (number of user data requests as part of a criminal investigation)? (a) numerical, continuous (b) numerical, discrete: The number of data requests is a counted variable, that can only take on positive whole numbers. This falls into the definition of discrete numerical variables. • What type of variable is ud_req (number of user data requests as part of a criminal investigation)? (c) Categorical (d) categorical, ordinal • What type of variable is hdi (human Development Index, combining indicators of life expectancy, educational attainment, and income, with levels (a) numerical, continuous (b) numerical, discrete very high, high, medium, and low human development)? (c) Categorical (d) categorical, ordinal (a) numerical, continuous • What type of variable is hdi (human Development Index, combining indicators of life expectancy, educational attainment, and income, with levels very high, high, medium, and low human development)? (a) numerical, continuous (b) numerical, discrete (c) Categorical (d) categorical, ordinal (b) numerical, discrete (c) categorical categorical, ordinal:There is an inherent ordering to the levels of this categorical (d) variable (from very high to low), and hence this is an ordinal categorical variable 4 10/26/2015 Data Matrix Record Data • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multidimensional space, where each dimension represents an attribute • Data consist of a collection of records, each of which consists of a fixed set of attributes • Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 10.23 5.27 15.22 2.7 1.2 9 No Married 75K No 12.65 6.25 16.22 2.2 1.1 10 No Single 90K Yes 60K Q: what is a sparse data set? Projection of x Load Projection of y load Distance Load Thickness 10 17 18 Document Data Transaction Data • Each document becomes a `term' vector, • each term is a component (attribute) of the vector, • Term can be n-grams, phrases, etc. • the value of each component is the number of times the corresponding term occurs in the document. • A special type of record data, where • each record (transaction) has a set of items. • For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items. • Set based Q: what is a sparse data set? season timeout wi n lost score game pla y ball team coach 19 Document 1 3 0 5 0 2 6 0 2 0 2 Document 2 0 7 0 2 1 0 0 3 0 0 Document 3 0 1 0 0 1 2 2 0 3 0 TID Items 1 Bread, Coke, Milk 2 3 4 5 Beer, Bread Beer, Coke, Diaper, Milk Beer, Bread, Diaper, Milk Coke, Diaper, Milk 20 5 10/26/2015 Graph Data Ordered Data-temporal data • Examples: Directed graph and URL Links 2 1 5 2 • Sequences of transactions Items/Events <a href="papers/papers.html#bbbb"> Data Mining </a> <li> <a href="papers/papers.html#aaaa"> Graph Partitioning </a> <li> <a href="papers/papers.html#aaaa"> Parallel Solution of Sparse Linear System of Equations </a> <li> <a href="papers/papers.html#ffff"> N-Body Computation and Dense Linear System Solvers 5 An element of the sequence 21 22 Ordered Data Data Quality • Genomic sequence data-no time stamps • What kinds of data quality problems? • How can we detect problems with the data? • What can we do about these problems? GGTTCCGCCTTCAGCCCCGCGCC CGCAGGGCCCGCCCCGCGCCGTC GAGAAGGGCCCGCCTGGCGGGCG GGGGGAGGCGGGGCCGCCCGAGC CCAACCGAGTCCGACCAGGTGCC CCCTCTGCTCGGCCTAGACCTGA GCTCATTAGGCGGCAGCGGACAG GCCAAGTAGAACACGCGAAGCGC TGGGCTGCCTGCTGCGACCAGGG 23 • Examples of data quality problems: • Noise and outliers • missing values • duplicated data 24 6 10/26/2015 Outliers • Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set Missing Values • Reasons for missing values • Information is not collected (e.g., people decline to give their age and weight) • Attributes may not be applicable to all cases (e.g., annual income is not applicable to children) • Are they noise points, or meaningful outliers? • Handling missing values • • • • • Eliminate Data Objects Estimate Missing Values Ignore the Missing Value During Analysis Replace with all possible values (weighted by their probabilities) Missing as meaningful… 25 26 Data Preprocessing Aggregation • Aggregation and Noise Removal • Sampling • Dimensionality Reduction • Feature subset selection • Feature creation and transformation • Discretization • Combining two or more attributes (or objects) into a single attribute (or object) • Purpose • Data reduction • Reduce the number of attributes or objects • Change of scale • Cities aggregated into regions, states, countries, etc • De-noise: more “stable” data • Q: How much % of the data mining process is data preprocessing? 27 • Aggregated data tends to have less variability 28 7 10/26/2015 Sampling Aggregation Variation of Precipitation in Australia • Sampling is the main technique employed for data selection. • It is often used for both the preliminary investigation of the data and the final data analysis. • Reasons: • too expensive or time consuming to obtain or to process the data. 29 Standard Deviation of Average Monthly Precipitation Standard Deviation of Average Yearly Precipitation 30 Curse of Dimensionality Dimensionality Reduction • When dimensionality increases, data becomes increasingly sparse in the space that it occupies • Purpose: • Definitions of density and distance between points, which is critical for clustering and outlier detection, become less meaningful • Thus, harder and harder to classify the data! • Techniques (supervised and unsupervised methods) • • • • Avoid curse of dimensionality Reduce amount of time and memory required by data mining algorithms Allow data to be more easily visualized May help to eliminate irrelevant features or reduce noise • Principle Component Analysis • Singular Value Decomposition • Others: supervised and non-linear techniques • Randomly generate 500 points • Compute difference between max and min distance between any pair of points 31 32 8 10/26/2015 Dimensionality Reduction: PCA Question • Goal is to find a projection that captures the largest amount of variation in data • What is the difference between sampling and dimensionality reduction? • Thining vs. shortening of data x2 e x1 33 34 Discretization Simple Discretization Methods: Binning • Three types of attributes: • Equal-width (distance) partitioning: • Nominal — values from an unordered set • It divides the range into N intervals of equal size: uniform grid • if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N. • Example: attribute “outlook” from weather data • Values: “sunny”,”overcast”, and “rainy” • Ordinal — values from an ordered set • The most straightforward • But outliers may dominate presentation: Skewed data is not handled well. • Example: attribute “temperature” in weather data • Values: “hot” > “mild” > “cool” • Equal-depth (frequency) partitioning: • Continuous — real numbers • It divides the range into N intervals, each containing approximately same number of samples • Good data scaling • Managing categorical attributes can be tricky. • Discretization: • • • • 35 divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical attributes. Reduce data size by discretization Supervised (entropy) vs. Unsupervised (binning) 36 9 10/26/2015 Getting to Know Your Data Transforming Ordinal to Boolean • Simple transformation allows to code ordinal attribute with n values using n-1 boolean attributes • Data Objects and Attribute Types • Example: attribute “temperature” • Basic Statistical Descriptions of Data Temperature Temperature > cold Temperature > medium Cold False False Medium True False Hot True True Original data • Data Visualization • Measuring Data Similarity and Dissimilarity Transformed data • Why? Not introducing distance concept between different colors: “Red” vs. “Blue” vs. “Green”. • Summary 37 Graphic Displays of Basic Statistical Descriptions Mining Data Descriptive Characteristics • Motivation • • Data dispersion characteristics • • To better understand the data: central tendency, variation and spread median, max, min, quantiles, outliers, variance, etc. • Boxplot: graphic display of five-number summary • Histogram: x-axis are values, y-axis repres. frequencies • Quantile plot: each value xi is paired with fi indicating that approximately 100 fi % of data are xi Numerical dimensions correspond to sorted intervals • Data dispersion: analyzed with multiple granularities of precision • Boxplot or quantile analysis on sorted intervals • Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution against the corresponding quantiles of another • Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane 10 10/26/2015 visualizing numerical data scatterplots • scatterplots for paired data • other visualizations for describing distributions of numerical variables Scatter plot Evaluating Relationships • Provides a first look at bivariate data to see clusters of points, outliers, etc • Each pair of values is treated as a pair of coordinates and plotted as points in the plane 11 10/26/2015 Positively and Negatively Correlated Data UnCorrelated Data • The left half fragment is positively correlated • The right half is negative correlated Histogram Analysis histogram • provides a view of the data density • especially useful for describing the shape of the distribution • Histogram: Graph display of tabulated frequencies, shown as bars • It shows what proportion of cases fall into each of several categories • The categories are usually specified as non-overlapping intervals of some variable. The categories (bars) must be adjacent 40 35 30 25 20 15 10 5 0 10000 30000 50000 70000 90000 12 10/26/2015 Boxplot Analysis Histograms Often Tell More than Boxplots • Five-number summary of a distribution • Minimum, Q1, Median, Q3, Maximum • Boxplot • Data is represented with a box • The ends of the box are at the first and third quartiles, i.e., the height of the box is IQR • The median is marked by a line within the box • Whiskers: two lines outside the box extended to Minimum and Maximum • Outliers: points beyond a specified outlier threshold, plotted individually skewness The two histograms shown in the left may have the same boxplot representation The same values for: min, Q1, median, Q3, max But they have rather different data distributions modality • distributions are skewed to the side of the long tail 13 10/26/2015 histogram & bin width dotplot • The chosen bin width can alter the story the histogram is telling. • useful when individual values are of interest • can get busy as the sample size increases Measures of center Measuring the Central Tendency • • • 1 n x xi n i 1 Mean (algebraic measure) (sample vs. population): • Weighted arithmetic mean: • Trimmed mean: chopping extreme values • Sensitive to outliers x N n x wi xi • Mode (most frequent observation) i 1 n w i 1 i Example: 9 students exam scores: 75, 69, 88, 93,95, 54, 87,88,27 Median: A holistic measure- expensive • Middle value if odd number of values, or average of the middle two values otherwise • Estimated by interpolation (for grouped data): Mode median L1 ( • Value that occurs most frequently in the data • Unimodal, bimodal, trimodal • Empirical formula: • Mean- arithmetic average • Median: midpont of the distribution (50 percentile) n / 2 ( f )l f median Mean=sum/9=75.1 Mode=88 Median= 27 54 69 75 87 88 88 93 95 ?? )c Example: 9 students exam scores: mode? 27 54 69 75 87 90 90 93 95 100 mode=??? mean mode 3 (mean median ) 14 10/26/2015 Skewness and measures of center Symmetric vs. Skewed Data • Median, mean and mode of symmetric, positively and negatively skewed data mode<median and mode>median Unimodal frequency curve with perfect symmetric distribution Question: Which of the following is true? (Hint: Sketching the distributions might be useful.) (a) In a symmetric distribution, more than 50% of the data are below and less than 50% are above the mean. (b) In a left skewed distribution, roughly 50% of the data are below and 50% are above the mean. negatively skewed data (c) In a right skewed distribution, less than 50% of the data are below the mean. (d) In a left skewed distribution, less than 50% of the data are below the mean. positively skewed data October 26, 2015 Data Mining: Concepts and Techniques Measures of spread • range (max-min) • Quartiles • Variance • Standard deviation • Inter-quartile range 57 Quartiles • Quantile: points taken at regular intervals of data distribution dividing it into equally consecutive sets • kth q-quantile: value of x such that at most k/q data values are less than x and at most (q-k)/q of values >x when 0<k<q e.g. 2-quartile divides data set in to 2?? 4-quartile: 3 data points splitting data distribution into 4 equal parts each part represents ¼ of data dist.quartiles 100-quartiles: percentiles-divide data into 100 equal sized sets Q1: 25th percentile- 1st quartile Q2: median- 2nd quartile Q3: 75th percentile- 3rd quartile 15 10/26/2015 variance • roughly the average squared deviation from the mean interquartile range • Example: Given that the average life expectancy is 70.5, and there are 201 countries in the dataset: • variance=?? • Std=? Box-plots • range of the middle 50% of the data, distance between the first quartile (25th percentile) and third quartile (75th percentile) The Quartiles divide the data into divisions of 25%, • IQR=Q1-Q3 • Quartile 3 (Q3) can be called the 75th percentile • IQR=Q1-Q3 • Quartile 2 (Q2) can be called the 50th percentile Measuring the Dispersion of Data • • range of the middle 50% of the data, distance between the first quartile (25th percentile) and third quartile (75th percentile) • Quartile 1 (Q1) can be called the 25th percentile Quartiles, outliers and boxplots Five number summary: • • • • • Median Q1 Q3 Minimum Maximum • • Quartiles: Q1 (25th percentile), Q3 (75th percentile): Ex: 4-quantiles means 3 points splits data into 4 equal parts, each representing ¼ of the distribution • Inter-quartile range: IQR = Q3 – Q1 • Five number summary: min, Q1, Median, Q3, max • Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot outlier individually • Outlier: usually, a value higher/lower than 1.5 x IQR from Q1 and Q3 Variance and standard deviation (sample: s, population: σ) • s2 Variance: (algebraic, scalable computation) 1 n 1 n 2 1 n 2 xi ( xi ) ] ( xi x )2 n 1[ n 1 i 1 n i 1 i 1 • 1 n 1 2 ( xi ) 2 N i 1 N Standard deviation s (or σ) is the square root of variance s2 (or σ2) n x i 1 i 2 2 16 10/26/2015 robust statistics transformations • we define robust statistics as measures on which extreme observations have little effect • a transformation is a rescaling of the data using a function WHY? • to see the data structure differently • to reduce skew assist in modeling • Log transformation: • to straighten a nonlinear relationship in a scatterplot Exploring categorical variables • Frequency table • Bar plot bar plots vs histograms • barplots for categorical variables, histograms for numerical variables • x-axis on a histogram is a number line, and the • ordering of the bars are not interchangeable 17 10/26/2015 contingency table Bar plot • useful for visualizing conditional frequency distributions • compare relative frequencies to explore the relationship between the variables • Mosaic plot? 18