Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
EECS 647: Introduction to Database Systems Instructor: Luke Huan Spring 2007 Administrative Final project is due May 9th I will not accept late hand-in for the final project Start early you need to take quick action if you haven’t divided the tasks yet For those in the implementation phase (this is the right phase that you should be in), thinking about testing plans Class review is April 30th. Provide your feedbacks If you like the course, recommend it to other students 5/22/2017 Luke Huan Univ. of Kansas 2 Review What Is Data Mining? Data mining (knowledge discovery from data) Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data What is classification? Predict the value of unseen data What is clustering Grouping similar objects into groups 5/22/2017 Luke Huan Univ. of Kansas 3 Today’s Topic Continue the introduction to data mining Sequential patterns Regression Knowing the nature of your data Discover association in your data 5/22/2017 Luke Huan Univ. of Kansas 4 Sequential Pattern Discovery: Definition Given is a set of objects, with each object associated with its own timeline of events, find rules that predict strong sequential dependencies among different events. (A B) (C) (D E) Rules are formed by first disovering patterns. Event occurrences in the patterns are governed by timing constraints. (A B) <= xg (C) (D E) >ng <= ws <= ms 5/22/2017 Luke Huan Univ. of Kansas 5 Sequential Pattern Discovery: Examples In telecommunications alarm logs, (Inverter_Problem Excessive_Line_Current) (Rectifier_Alarm) --> (Fire_Alarm) In point-of-sale transaction sequences, Computer Bookstore: (Intro_To_Visual_C) (C++_Primer) --> (Perl_for_dummies,Tcl_Tk) Athletic Apparel Store: (Shoes) (Racket, Racketball) --> (Sports_Jacket) 5/22/2017 Luke Huan Univ. of Kansas 6 Regression Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency. Greatly studied in statistics, neural network fields. Examples: 5/22/2017 Predicting sales amounts of new product based on advetising expenditure. Predicting wind velocities as a function of temperature, humidity, air pressure, etc. Time series prediction of stock market indices. Luke Huan Univ. of Kansas 7 Deviation/Anomaly Detection Detect significant deviations from normal behavior Applications: Credit Card Fraud Detection Network Intrusion Detection Typical network traffic at University level may reach over 100 million connections per day 5/22/2017 Luke Huan Univ. of Kansas 8 Challenges of Data Mining Scalability Dimensionality Complex and Heterogeneous Data Data Quality Data Ownership and Distribution Privacy Preservation Streaming Data 5/22/2017 Luke Huan Univ. of Kansas 9 Knowing the Nature of Your Data Data types: nominal, ordinal, interval, ratio. Data quality Data preprocessing 5/22/2017 Luke Huan Univ. of Kansas 10 What is Data? Collection of data objects and their attributes An attribute is a property or characteristic of an object Examples: eye color of a person, temperature, etc. Attribute is also known as variable, field, Objects characteristic, or feature A collection of attributes describe an object Object is also known as record, point, case, sample, entity, or instance 5/22/2017 Luke Huan Univ. of Kansas Attributes Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K 6 No Married 7 Yes Divorced 220K 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K Yes No No 10 11 Attribute Values Attribute values are numbers or symbols assigned to an attribute Distinction between attributes and attribute values Same attribute can be mapped to different attribute values Example: height can be measured in feet or meters Different attributes can be mapped to the same set of values Example: Attribute values for ID and age are integers But properties of attribute values can be different 5/22/2017 ID has no limit but age has a maximum and minimum value Luke Huan Univ. of Kansas 12 Types of Attributes There are different types of attributes Nominal Ordinal Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} Interval Examples: ID numbers, eye color, zip codes Examples: calendar dates, temperatures in Celsius or Fahrenheit. Ratio 5/22/2017 Examples: temperature in Kelvin, length, time, counts Luke Huan Univ. of Kansas 13 Properties of Attribute Values The type of an attribute depends on which of the following properties it possesses: 5/22/2017 Distinctness: Order: Addition: Multiplication: = < > + */ Nominal attribute: distinctness Ordinal attribute: distinctness & order Interval attribute: distinctness, order & addition Ratio attribute: all 4 properties Luke Huan Univ. of Kansas 14 Properties of Attribute Values Attribute Type Description Examples Nominal The values of a nominal attribute are just different names, i.e., nominal attributes provide only enough information to distinguish one object from another. (=, ) zip codes, employee ID numbers, eye color, sex: {male, female} Ordinal The values of an ordinal attribute provide enough information to order objects. (<, >) hardness of minerals, {good, better, best}, grades, street numbers Interval For interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists. (+, - ) calendar dates, temperature in Celsius or Fahrenheit For ratio variables, both differences and ratios are meaningful. (*, /) temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current Ratio 5/22/2017 Luke Huan Univ. of Kansas 15 Discrete and Continuous Attributes Discrete Attribute Has only a finite or countablely infinite set of values Examples: zip codes, counts, or the set of words in a collection of documents Often represented as integer variables. Note: binary attributes are a special case of discrete attributes Continuous Attribute 5/22/2017 Has real numbers as attribute values Examples: temperature, height, or weight. Practically, real values can only be measured and represented using a finite number of digits. Continuous attributes are typically represented as floating-point variables. Luke Huan Univ. of Kansas 16 Structured vs Unstructured Data Structured Data Semi-structured data Data in a relational database Graphs, trees, sequencs Un-structured data Image, text 5/22/2017 Luke Huan Univ. of Kansas 17 Important Characteristics Data Dimensionality Curse of Dimensionality Sparsity Only presence counts Resolution 5/22/2017 Patterns depend on the scale Luke Huan Univ. of Kansas 18 Record Data Data that consists of a collection of records, each of which consists of a fixed set of attributes Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K 10 5/22/2017 Luke Huan Univ. of Kansas 19 Data Matrix If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute Projection of x Load 5/22/2017 Projection of y load Distance Load Thickness 10.23 5.27 15.22 2.7 1.2 12.65 6.25 16.22 2.2 1.1 Luke Huan Univ. of Kansas 20 Document Data Each document becomes a `term' vector, team coach pla y ball score game wi n lost timeout season 5/22/2017 each term is a component (attribute) of the vector, the value of each component is the number of times the corresponding term occurs in the document. Document 1 3 0 5 0 2 6 0 2 0 2 Document 2 0 7 0 2 1 0 0 3 0 0 Document 3 0 1 0 0 1 2 2 0 3 0 Luke Huan Univ. of Kansas 21 Transaction Data A special type of record data, where 5/22/2017 each record (transaction) involves a set of items. For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items. TID Items 1 Bread, Coke, Milk 2 3 4 5 Beer, Bread Beer, Coke, Diaper, Milk Beer, Bread, Diaper, Milk Coke, Diaper, Milk Luke Huan Univ. of Kansas 22 Data Quality What kinds of data quality problems? How can we detect problems with the data? What can we do about these problems? Examples of data quality problems: 5/22/2017 Noise and outliers missing and duplicated data Luke Huan Univ. of Kansas 23 Noise Noise refers to modification of original values Examples: distortion of a person’s voice when talking on a poor phone and “snow” on television screen Two Sine Waves 5/22/2017 Two Sine Waves + Noise Luke Huan Univ. of Kansas 24 Mapping Data to a New Space Fourier transform Wavelet transform Two Sine Waves 5/22/2017 Two Sine Waves + Noise Luke Huan Univ. of Kansas Frequency 25 Outliers Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set One person’s outlier can be another one’s treasure!! 5/22/2017 Luke Huan Univ. of Kansas 26 Missing Values Reasons for missing values Information is not collected (e.g., people decline to give their age and weight) Attributes may not be applicable to all cases (e.g., annual income is not applicable to children) Handling missing values 5/22/2017 Eliminate Data Objects Estimate Missing Values Ignore the Missing Value During Analysis Replace with all possible values (weighted by their probabilities) Luke Huan Univ. of Kansas 27 Duplicate Data Data set may include data objects that are duplicates, or almost duplicates of one another Examples: Major issue when merging data from heterogeous sources Same person with multiple email addresses Data cleaning 5/22/2017 Process of dealing with duplicate data issues Luke Huan Univ. of Kansas 28 EDA: Exploratory Data Analysis Histogram Box plot Scatter plot Correlation 5/22/2017 Luke Huan Univ. of Kansas 29 Visualization Techniques: Histograms Histogram Usually shows the distribution of values of a single variable Divide the values into bins and show a bar plot of the number of objects in each bin. The height of each bar indicates the number of objects Shape of histogram depends on the number of bins Example: Petal Width (10 and 20 bins, respectively) 5/22/2017 Luke Huan Univ. of Kansas 30 Two-Dimensional Histograms Show the joint distribution of the values of two attributes Example: petal width and petal length 5/22/2017 What does this tell us? Luke Huan Univ. of Kansas 31 Visualization Techniques: Box Plots Box Plots Invented by J. Tukey Another way of displaying the distribution of data Following figure shows the basic part of a box plot outlier 10th percentile 75th percentile 50th percentile 25th percentile 10th percentile 5/22/2017 Luke Huan Univ. of Kansas 32 Example of Box Plots Box plots can be used to compare attributes 5/22/2017 Luke Huan Univ. of Kansas 33 Scatter Plot Array of Iris Attributes 5/22/2017 Luke Huan Univ. of Kansas 34 Correlation Correlation measures the linear relationship between objects To compute correlation, we standardize data objects, p and q, and then take their dot product pk ( pk mean( p)) / std ( p) qk (qk mean(q)) / std (q) correlation( p, q) p q 5/22/2017 Luke Huan Univ. of Kansas 35 Visually Evaluating Correlation Scatter plots showing the similarity from –1 to 1. 5/22/2017 Luke Huan Univ. of Kansas 36 Discover Association Rules Apriori Algorithm 5/22/2017 Luke Huan Univ. of Kansas 37 Association Rule Mining Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules TID Items 1 Bread, Milk 2 3 4 5 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke 5/22/2017 {Diaper} {Beer}, {Milk, Bread} {Eggs,Coke}, {Beer, Bread} {Milk}, Implication means co-occurrence, not causality! Luke Huan Univ. of Kansas 38 Definition: Frequent Itemset Itemset A collection of one or more items Example: {Milk, Bread, Diaper} k-itemset An itemset that contains k items Support count () Frequency of occurrence of an itemset E.g. ({Milk, Bread,Diaper}) = 2 Support Fraction of transactions that contain an itemset E.g. s({Milk, Bread, Diaper}) = 2/5 Frequent Itemset An itemset whose support is greater than or equal to a minsup threshold 5/22/2017 Luke Huan Univ. of Kansas TID Items 1 Bread, Milk 2 3 4 5 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke 39 Definition: Association Rule Association Rule An implication expression of the form X Y, where X and Y are itemsets Example: {Milk, Diaper} {Beer} TID Items 1 Bread, Milk 2 3 4 5 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke Rule Evaluation Metrics Support (s) Example: Fraction of transactions that {Milk , Diaper } Beer contain both X and Y Confidence (c) (Milk, Diaper, Beer ) 2 s 0.4 Measures how often items in Y |T| 5 appear in transactions that (Milk, Diaper, Beer ) 2 contain X c 0.67 (Milk, Diaper ) 5/22/2017 Luke Huan Univ. of Kansas 3 40 Mining Association Rules TID Items 1 Bread, Milk 2 3 4 5 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke Example of Rules: {Milk,Diaper} {Beer} (s=0.4, c=0.67) {Milk,Beer} {Diaper} (s=0.4, c=1.0) {Diaper,Beer} {Milk} (s=0.4, c=0.67) {Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5) Observations: • All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} • Rules originating from the same itemset have identical support but can have different confidence • Thus, we may decouple the support and confidence requirements 5/22/2017 Luke Huan Univ. of Kansas 41 An Exercise Transaction-id Items bought 100 f, a, c, d, g, I, m, p 200 a, b, c, f, l,m, o 300 b, f, h, j, o 400 b, c, k, s, p 500 a, f, c, e, l, p, m, n The support value of pattern Transaction database TDB {acm} is Sup(acm)=3 The support of pattern {ac} is Sup(ac)=3 Given min_sup=3, acm is The confidence of the rule: {ac} => {m} is 5/22/2017 Frequent 100% Luke Huan Univ. of Kansas 42 Mining Association Rules Two-step approach: 1. Frequent Itemset Generation – 2. Rule Generation – Generate all itemsets whose support minsup Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset Frequent itemset generation is still computationally expensive 5/22/2017 Luke Huan Univ. of Kansas 43 Frequent Itemset Generation null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE ABCDE 5/22/2017 Luke Huan Univ. of Kansas BCDE Given d items, there are 2d possible candidate itemsets 44 Apriori Algorithm A level-wise, candidate-generation-and-test approach (Agrawal & Srikant 1994) Data base D TID Items 10 a, c, d 20 b, c, e 30 40 1-candidates Freq 1-itemsets 2-candidates Itemset Sup Itemset Sup Itemset a 2 a 2 ab b 3 b 3 ac a, b, c, e c 3 c 3 ae b, e d 1 e 3 bc e 3 Scan D Min_sup=2 3-candidates Freq 2-itemsets Counting Itemset Itemset Sup Itemset Sup bce ac 2 ab 1 bc 2 ac 2 be 3 ae 1 ce 2 bc 2 be 3 ce 2 Scan D Freq 3-itemsets 5/22/2017 be Itemset Sup bce 2 Luke Huan Univ. of Kansas ce Scan D 45 Summary Nature of the data Data types: SSN Grade Temperature (degree) Length Nominal Ordinal Interval Ratio Data Quality Noise Outlier Missing/duplicated data 5/22/2017 Luke Huan Univ. of Kansas 46 Summary Common tools for exploratory data analysis Histogram Box plot Scatter plot Correlation Association Each rule: L => R has two parts: L, the left hand item set and R the right hand item set Each rule is measured by two parameters: 5/22/2017 Support Confidence Luke Huan Univ. of Kansas 47