Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I) Review: Information Retrieval – Similarity measures – Evaluation Metrics : Precision and Recall Question Answering Web Search Engine – An application of IR – Related to web mining © Prentice Hall 2 Data Mining Techniques Outline Goal: Provide an overview of basic data mining techniques Statistical – Point Estimation – Models Based on Summarization – Bayes Theorem – Hypothesis Testing – Regression and Correlation Similarity Measures © Prentice Hall 3 Point Estimation Point Estimate: estimate a population parameter. May be made by calculating the parameter for a sample. May be used to predict value for missing data. Ex: – – – – R contains 100 employees 99 have salary information Mean salary of these is $50,000 Use $50,000 as value of remaining employee’s salary. Is this a good idea? © Prentice Hall 4 Estimation Error Bias: Difference between expected value and actual value. Mean Squared Error (MSE): expected value of the squared difference between the estimate and the actual value: Root Mean Square Error (RMSE) © Prentice Hall 5 Jackknife Estimate Jackknife Estimate: estimate of parameter is obtained by omitting one value from the set of observed values. Named to describe a “handy and useful tool” Used to reduce bias Property: The Jackknife estimator lowers the bias from the order of 1/n to 1/n2 © Prentice Hall 6 Jackknife Estimate Definition: – Divide the sample size n into g groups of size m each, so n=mg. (often m=1 and g=n) – estimate j by ignoring the jth group. – _ is the average of j . – The Jackknife estimator is » Q = g– (g-1)_. Where is an estimator for the parameter theta. © Prentice Hall 7 Jackknife Estimator: Example 1 Estimate of mean for X={x1, x2, x3,}, n =3, g=3, m=1, = = (x1+ x2+ x3)/3 1 = (x2 + x3)/2, 2 = (x1 + x3)/2, 1 = (x1 + x2)/2, _ = (1 + 2 + 2)/3 Q = g-(g-1) _= 3-(3-1) _= (x1 + x2 + x3)/3 In this case, the Jackknife Estimator is the same as the usual estimator. © Prentice Hall 8 Jackknife Estimator: Example 2 Estimate of variance for X={1, 4, 4}, n =3, g=3, m=1, = 2 2 = ((1-3)2 +(4-3)2 +(4-3)2 )/3 = 2 1 = ((4-4)2 + (4-4)2 ) /2 = 0, 2 = 2.25 , 3 = 2.25 _ = (1 + 2 + 2)/3 = 1.5 Q = g-(g-1) _= 3-(3-1) _ =3(2)-2(1.5)=3 In this case, the Jackknife Estimator is different from the usual estimator. © Prentice Hall 9 Jackknife Estimator: Example 2(cont’d) In general, apply the Jackknife technique to the biased estimator 2 2 = (xi – x )2 / n then the jackknife estimator is s2 s2 = (xi – x )2 / (n -1) Which is known to be unbiased for 2 © Prentice Hall 10 Maximum Likelihood Estimate (MLE) Obtain parameter estimates that maximize the probability that the sample data occurs for the specific model. Joint probability for observing the sample data by multiplying the individual probabilities. Likelihood function: Maximize L. © Prentice Hall 11 MLE Example Coin toss five times: {H,H,H,H,T} Assuming a perfect coin with H and T equally likely, the likelihood of this sequence is: However if the probability of a H is 0.8 then: © Prentice Hall 12 MLE Example (cont’d) General likelihood formula: Estimate for p is then 4/5 = 0.8 © Prentice Hall 13 Expectation-Maximization (EM) Solves estimation with incomplete data. Obtain initial estimates for parameters. Iteratively use estimates for missing data and continue until convergence. © Prentice Hall 14 EM Example © Prentice Hall 15 EM Algorithm © Prentice Hall 16 Models Based on Summarization Basic concepts to provide an abstraction and summarization of the data as a whole. – Statistical concepts: mean, variance, median, mode, etc. Visualization: display the structure of the data graphically. – Line graphs, Pie charts, Histograms, Scatter plots, Hierarchical graphs © Prentice Hall 17 Scatter Diagram © Prentice Hall 18 Bayes Theorem Posterior Probability: P(h1|xi) Prior Probability: P(h1) Bayes Theorem: Assign probabilities of hypotheses given a data value. © Prentice Hall 19 Bayes Theorem Example Credit authorizations (hypotheses): h1=authorize purchase, h2 = authorize after further identification, h3=do not authorize, h4= do not authorize but contact police Assign twelve data values for all combinations of credit and income: 1 Excellent Good Bad x1 x5 x9 2 3 4 x2 x6 x10 x3 x7 x11 x4 x8 x12 From training data: P(h1) = 60%; P(h2)=20%; P(h3)=10%; P(h4)=10%. © Prentice Hall 20 Bayes Example(cont’d) Training Data: ID 1 2 3 4 5 6 7 8 9 10 Income 4 3 2 3 4 2 3 2 3 1 Credit Excellent Good Excellent Good Good Excellent Bad Bad Bad Bad © Prentice Hall Class h1 h1 h1 h1 h1 h1 h2 h2 h3 h4 xi x4 x7 x2 x7 x8 x2 x11 x10 x11 x9 21 Bayes Example(cont’d) Calculate P(xi|hj) and P(xi) Ex: P(x7|h1)=2/6; P(x4|h1)=1/6; P(x2|h1)=2/6; P(x8|h1)=1/6; P(xi|h1)=0 for all other xi. Predict the class for x4: – Calculate P(hj|x4) for all hj. – Place x4 in class with largest value. – Ex: »P(h1|x4)=(P(x4|h1)(P(h1))/P(x4) =(1/6)(0.6)/0.1=1. »x4 in class h1. © Prentice Hall 22 Hypothesis Testing Find model to explain behavior by creating and then testing a hypothesis about the data. Exact opposite of usual DM approach. H0 – Null hypothesis; Hypothesis to be tested. H1 – Alternative hypothesis © Prentice Hall 23 Chi-Square Test One technique to perform hypothesis testing Used to test the association between two observed variable values and determine if a set of observed values is statistically different. The chi-squared statistic is defines as: O – observed value E – Expected value based on hypothesis. © Prentice Hall 24 Chi-Square Test Given the average scores of five schools. Determine whether the difference is statistically significant. Ex: – O={50,93,67,78,87} – E=75 – c2=15.55 and therefore significant Examine a chi-squared significance table. – with a degree of 4 and a significance level of 95%, the critical value is 9.488. Thus the variance between the schools’ scores and the expected value cannot be associated with pure chance. © Prentice Hall 25 Regression Predict future values based on past values Fitting a set of points to a curve Linear Regression assumes linear relationship exists. y = c0 + c 1 x1 + … + c n xn – n input variables, (called regressors or predictors) – One out put variable, called response – n+1 constants, chosen during the modlong process to match the input examples © Prentice Hall 26 Linear Regression -- with one input value © Prentice Hall 27 Correlation Examine the degree to which the values for two variables behave similarly. Correlation coefficient r: • 1 = perfect correlation • -1 = perfect but opposite correlation • 0 = no correlation © Prentice Hall 28 Correlation Where X, Y are means for X and Y respectively. Suppose X=(1,3,5,7,9) and Y=(9,7,5,3,1) r=? Suppose X=(1,3,5,7,9) and Y=(2,4,6,8,10) r=? © Prentice Hall 29 Similarity Measures Determine similarity between two objects. Similarity characteristics: Alternatively, distance measure measure how unlike or dissimilar objects are. © Prentice Hall 30 Similarity Measures © Prentice Hall 31 Distance Measures Measure dissimilarity between objects © Prentice Hall 32 Next Lecture: Data Mining techniques (II) – Decision trees, neural networks and genetic algorithms Reading assignments: Chapter 3 © Prentice Hall 33