Download lecture2 - users.cs.umn.edu

CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer Science University of Minnesota http://www.cs.umn.edu/~kumar © Vipin Kumar CSci 8980 Fall 2002 ‹#› What is Data?  Objects and the attributes of objects – Attribute: variable, field, characteristic, feature, or observation – Object: record, point, case, sample, entity, or item – Objects have attributes. – Attributes describe objects  A data set is collection of data objects. © Vipin Kumar CSci 8980 Fall 2002 ‹#› Types of Attributes  There are different types of attributes – Nominal: Values are just labels.  Examples: ID numbers, eye color, zip codes – Ordinal: The values can be ordered. Examples: street numbers, rankings (e.g., taste of potato chips on a scale from 1-10), grades  – Interval: Differences are meaningful Examples: calendar dates, temperatures in Celsius or Fahrenheit.  – Ratio: Ratios are meaningful  © Vipin Kumar Examples: temperature in Kelvin, length, time, counts CSci 8980 Fall 2002 ‹#› Measurement of Length  The way you measure an attribute is somewhat may not match the attributes properties. 5 A 1 B 7 2 C 8 3 D 10 4 E 15 © Vipin Kumar 5 CSci 8980 Fall 2002 ‹#› Properties  The type of an attribute depends on which of the following properties it has. – Distinctness: =  – Order: < > – Addition: + – Multiplication: * /  Length has all these properties © Vipin Kumar CSci 8980 Fall 2002 ‹#› Attribute Type Description Examples Nominal The values of a nominal attribute are just different names, i.e., nominal attributes provide only enough information to distinguish one object from another. (=, ) zip codes, employee ID numbers, eye color, sex: {male, female} mode, entropy, contingency correlation, 2 test Ordinal The values of an ordinal attribute provide enough information to order objects. (<, >) hardness of minerals, {good, better, best}, grades, street numbers median, percentiles, rank correlation, run tests, sign tests Interval For interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists. (+, - ) calendar dates, temperature in Celsius or Fahrenheit mean, standard deviation, Pearson's correlation, t and F tests For ratio variables, both differences and ratios are meaningful. (*, /) temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current geometric mean, harmonic mean, percent variation Ratio Operations Attribute Level Transformation Comments Nominal Any permutation of values If all employee ID numbers were reassigned, would it make any difference? Ordinal An order preserving change of values, i.e., new_value = f(old_value) where f is a monotonic function. Interval new_value =a * old_value + b where a and b are constants An attribute encompassing the notion of good, better best can be represented equally well by the values {1, 2, 3} or by { 0.5, 1, 10}. Thus, the Fahrenheit and Celsius temperature scales differ in terms of where their zero value is and the size of a unit (degree). Ratio new_value = a * old_value Length can be measured in meters or feet. Types of data sets   Many different types Common Types – Record – Graph – Ordered  Two Important Attributes – Dimensionality – Sparsity © Vipin Kumar CSci 8980 Fall 2002 ‹#› Discrete and Continuous Attributes  Discrete – A discrete attribute has only a finite or countably infinite set of values, e.g., zip codes, counts, or the set of words in a collection of documents. Discrete attributes are often represented as integer variables. Note that binary attributes are a special case of discrete attributes and assume only two values, e.g., true/false, yes/no, male/female. Binary attributes are often represented as Boolean variables, or as integer variables that take on the values 0 or 1.  Continuous – A continuous attribute is one whose values that are real numbers, e.g., temperature, height, or weight. (Practically, real values can only be measured and represented to a finite number of digits.) Continuous attributes are typically represented as floating-point variables. © Vipin Kumar CSci 8980 Fall 2002 ‹#› Record Data  Much of the original data mining work and much of today's current work is focused around record data, i.e., data that consists of a collection of records (data objects), each of which consists of a fixed set of data fields (attributes). Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K 10 © Vipin Kumar CSci 8980 Fall 2002 ‹#› Data Matrix  If the data objects in a collection of data all have the same fixed set of numeric attributes, then the data objects can be thought of as points (vectors) in a multi-dimensional space, where each dimension represents a distinct attribute describing the object. Thus, a set of data objects can be interpreted as an m by n matrix, where there are $m$ rows, one for each object, and $n$ columns, one for each attribute. Projection of x Load © Vipin Kumar Projection of y load Distance Load Thickness 10.23 5.27 15.22 2.7 1.2 12.65 6.25 16.22 2.2 1.1 CSci 8980 Fall 2002 ‹#› Document Data  Each document becomes a `term' vector, where each term is a component (attribute) of the vector, and where the value of each component of the vector is the number of times the corresponding term occurs in the document. team coach play ball score game win lost timeout season © Vipin Kumar Document 1 3 0 5 0 2 6 0 2 0 2 Document 2 0 7 0 2 1 0 0 3 0 0 Document 3 0 1 0 0 1 2 2 0 3 0 CSci 8980 Fall 2002 ‹#› Transaction Data  Transaction data is a special type of record data, where each record (transaction) involves a set of items. For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items. © Vipin Kumar TID Items 1 Bread, Coke, Milk 2 3 4 5 Beer, Bread Beer, Coke, Diaper, Milk Beer, Bread, Diaper, Milk Coke, Diaper, Milk CSci 8980 Fall 2002 ‹#› Graph Data  Generic graph and HTML Links 2 1 5 2 <a href="papers/papers.html#bbbb"> Data Mining </a> <li> <a href="papers/papers.html#aaaa"> Graph Partitioning </a> <li> <a href="papers/papers.html#aaaa"> Parallel Solution of Sparse Linear System of Equations </a> <li> <a href="papers/papers.html#ffff"> N-Body Computation and Dense Linear System Solvers 5 © Vipin Kumar CSci 8980 Fall 2002 ‹#› Chemical Data  Benzene Molecule: C6H6. © Vipin Kumar CSci 8980 Fall 2002 ‹#› Ordered Data  Sequences of transactions © Vipin Kumar CSci 8980 Fall 2002 ‹#› Ordered Data  Genomic sequence data GGTTCCGCCTTCAGCCCCGCGCC CGCAGGGCCCGCCCCGCGCCGTC GAGAAGGGCCCGCCTGGCGGGCG GGGGGAGGCGGGGCCGCCCGAGC CCAACCGAGTCCGACCAGGTGCC CCCTCTGCTCGGCCTAGACCTGA GCTCATTAGGCGGCAGCGGACAG GCCAAGTAGAACACGCGAAGCGC TGGGCTGCCTGCTGCGACCAGGG © Vipin Kumar CSci 8980 Fall 2002 ‹#› Ordered Data: Spatio-Temporal Data Ocean and Land Temperature (Jan 1982) Research Goals:  Find global climate patterns of interest to Earth Scientists A key interest is finding connections between the ocean and the land. NPP . Pressure NPP . Pressure . Precipitation Precipitation SST SST Latitude grid cell © Vipin Kumar Longitude Time  Global snapshots of values for a number of variables on land surfaces or water.  Monthly over a range of 10 to 50 years. zone CSci 8980 Fall 2002 ‹#› Data Quality  How can we detect problems with the data?  What can we do about these problems?  We need to know what kinds of problems are possible, i.e., what sorts of situations correspond to poor data quality. The following are some well known problems:  noise and outliers  missing values  duplicate data  inconsistent values © Vipin Kumar CSci 8980 Fall 2002 ‹#› Missing Values Eliminate Data Objects A simple and effective strategy is to eliminate those records with missing values. A related strategy is to eliminate attributes which have missing values. Estimate Missing Values Sometimes the data set is such that missing data can be reliably estimated. For example, consider a time series that changes in a reasonably smooth fashion, but has a few, widely scattered missing values. In such cases, the missing values can be estimated (interpolated) by using the remaining values. As another example, consider a data set that has many similar data points. In this situation, a nearest neighbor approach can be used to estimate the missing value. More specifically, the attribute values of the points closest to the point with the missing value are used to estimate the missing value. If the attribute is continuous, then the average attribute value of the nearest neighbors is used, while if the attribute is categorical, then the most commonly occurring attribute value can be taken. Ignore the Missing Value During Analysis Many data mining approaches can be modified to operate by ignoring missing values. For example, suppose that objects are being clustered and the similarity between pairs of data objects needs to be calculated. If one or both objects of a pair have missing values for some attributes, then the similarity can be calculated by using only the other attributes. It is true that the similarity will only be approximate, but unless the number of attributes is small and/or the number of missing values is high, this degree of inaccuracy may not matter much. Likewise, many classification schemes can handle missing values relatively straightforwardly. © Vipin Kumar CSci 8980 Fall 2002 ‹#›

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download lecture2 - users.cs.umn.edu