Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected] Tuesdays, Thursdays 1:00-2:20 – Lassonde 3033 Fall Semester, 2014 _____________________________________________________________________________________________ THE SMALL ASSIGNMENT (answers) _____________________________________________________________________________________________ 1. [Inductive Sets] Intuitively, the natural numbers begin with the number 1, and then there is 2, then 3, then 4, and so on. Does this process of “starting with 1” and “adding 1 repeatedly” result in all the natural numbers? We will use the concept of an inductive set to explore this idea. Definition. A set T that is a subset of Z is an inductive set provided that for each integer k, if k ∈ T , then k + 1 ∈ T . Consider what it means to say that a subset T of the integers Z is not an inductive set. Suppose that T is an inductive subset of the integers. Which of the following statements are true, which are false, and for which ones is it not possible to tell? a) 1 ∈ T and 5 ∈ T. It is not possible to tell if 1 ∈ T and 5 ∈ T b) If 1 ∈ T, then 5 ∈ T. True c) If 5 ∉ T, then 2 ∉ T. True. The contrapositive is, “If 2 ∈ T , then 5 ∈ T ,” which is true. d) For each integer k, if k ∈ T, then k + 7 ∈ T. True e) For each integer k, k ∉ T or k + 1 ∈ T False. If k ∈ T , then k + 1 ∈ T . f) There exists an integer k such that k ∈ T and k + 1 ∉ T. True, since “k ∉ T or k + 1 ∈ T” is logically equivalent to “If k ∈ T, then k + 1 ∈T” g) For each integer k, if k + 1 ∈ T, then k ∈ T. It is not possible to tell if this is true. It is the converse of the conditional statement, “For each integer k, if k ∈ T , then k + 1 ∈ T” h) For each integer k, if k + 1 ∉ T, then k ∉ T. True. This is the contrapositive of the conditional statement, “For each integer k, if k ∈ T, then k + 1 ∈ T” i) Prove the following by mathematical induction: For each natural number n, 2 2 2 1 + 2 + ... + n = [n (n + 1) (2n + 1) / 6 We will use a proof by mathematical induction. For each natural number n, we let P(n) be 2 2 2 1 + 2 + ... + n = [n (n + 1) (2n + 1) / 6 We first prove that P.1/ is true. Notice that [1 (1+1) (2x1+1)] / 6 = 1. This shows that 2 1 = [1 (1 + 1) (2x1 + 1) / 6 which proves that P(1) is true. CSE4412 & CSE6412 – Data Mining Page 1 For the inductive step, we prove that for each k ∈ N, if P(k) is true, then P(k+1) is true. So let k be a natural number and assume that P(k) is true. That is, assume that 2 2 2 1 + 2 + ... + k = [k (k + 1) (2k + 1) / 6 {1} The goal now is to prove that P(k+1) is true. That is, it must be proved that 2 2 2 2 1 + 2 + ... + k + (k + 1) = [(k + 1) [(k+1)+1] [2(k + 1)+1]] / 6 =[(k+1) (k+2) (2k+3)] / 6 {2} 2 To do this, we add (k+1) to both sides of equation {1} and algebraically rewrite the right side of the resulting equation. This gives 2 2 2 2 2 1 + 2 + ... + k + (k + 1) = {[k (k + 1) (2k+1)] / 6} + (k+1) 2 = [k (k+1) (2k+1) 6 (k+1) ] / 6 = [(k+1) [(k (2k+1) + 6 (k+1)]] / 6 2 = [(k+1) (2k + 7k + 6)] / 6 = (k+1) (k+2) (2k+3)] / 6 Comparing this result to equation (2), we see that if P(k) is true, then P(k+1) is true. Hence, the inductive step has been established, and by the Principle of Mathematical Induction, we have proved that for each natural number n, 2 2 2 1 + 2 + ... + n = [n (n + 1) (2n + 1) / 6 This proof shows a standard way to write an induction proof. When writing a proof by mathematical induction, we should follow the guideline that we always keep the reader informed. This means that at the beginning of the proof, we should state that a proof by induction will be used. We should then clearly define the open sentence P(n) that will be used in the proof. 2 2. [Logic & Sets] Determine the elements of the set A = {x |x = 11x -30 or 4 – x >0 when the universal set U is: a) the set of real numbers, A consists of all real numbers less than 4, together with the numbers 5 and 6. b) the set of rational numbers, A consists of all rational numbers less than 4, together with the numbers 5 and 6. c) the set of integers, A consists of all integers less than 4, together with the numbers 5 and 6. d) the set of positive integers, A = {1,2,3,5,6} (include 0 if you consider this positive) e) the set of negative integers, A=U f) the set of odd integers, A consists of all odd integers less than or equal to 5. g) the set of even integers, A consists of all even integers less than or equal to 6, except 4. h) the set of integers greater than 10, A=Θ CSE4412 & CSE6412 – Data Mining Page 2 2a. For each of the following, determine the relative complement A – B. a) A = {1, 4, 7, 10} {4, 7, 10}x B = {1, 2, 5} b) A = {1, 2, 5} {2, 5} B = {1, 4, 7, 10} c) A = {a, e, i} Θ B = {e, a, i} d) A = {a, ), 17} {a, 17} B = {)} e) A = {1, 5, 6, a} A B=Θ f) B = {1, 2, 4, 7} A = {2, 4} Θ g) A = Θ Θ B = {a, b, 7} 2b. Determine all subsets of the set A = {0, a, #, 2} Θ, {0}, {a}, {#}, {2}, {0, a}, {0, #}, {0, 2}, {a, #}, {a, 2}, {#, 2}, {0, a, #}, {0, a, 2}, {0, #, 2}, {a, #, 2}, A 3. [Probability] (a) The psychologist Tversky and his colleagues say that about four out of five people will answer (a) to the following question: A certain town is served by two hospitals. In the larger hospital about 45 babies are born each day, and in the smaller hospital 15 babies are born each day. Although the overall proportion of boys is about 50 percent, the actual proportion at either hospital may be more or less than 50 percent on any day. At the end of a year, which hospital will have the greater number of days on which more than 60 percent of the babies born were boys? (a) the large hospital (b) the small hospital (c) neither|the number of days will be about the same. Assume that the probability that a baby is a boy is .5 (actual estimates make this more like .513). Decide, by simulation (in any language), what the right answer is to the question. Can you suggest why so many people go wrong? Your simulation should result in about 25 days in a year having more than 60 percent boys in the large hospital and about 55 days in a year having more than 60 percent boys in the small hospital. (b) Tversky and Kahneman asked a group of subjects to carry out the following. They are told that: Linda is 31, single, outspoken, and very bright. She majored in philosophy in college. As a student, she was deeply concerned with racial discrimination and other social issues, and participated in anti-nuclear demonstrations. The subjects are then asked to rank the likelihood of various alternatives, such as: (1) Linda is active in the feminist movement. (2) Linda is a bank teller. (3) Linda is a bank teller and active in the feminist movement. CSE4412 & CSE6412 – Data Mining Page 3 Tversky and Kahneman found that between 85 and 90 percent of the subjects rated alternative (1) most likely, but alternative (3) more likely than alternative (2). Is it? They call this phenomenon the conjunction fallacy, and note that it appears to be unaffected by prior training in probability or statistics. Is this phenomenon a fallacy? If so, why? Can you give a possible explanation for the subjects' choices? They call it a fallacy because if the subjects are thinking about probabilities they should realize that P(Linda is bank teller and in feminist movement) ≤ P(Linda is bank teller). One explanation is that the subjects are not thinking about probability as a measure of likelihood. (c) The following is a variation on the Linda problem. The registrar is carrying John and Mary's registration cards and drops them in a puddle. When he picks them up he cannot read the names but on the first card he picked up he can make out Mathematics 23 and Government 35, and on the second card he can make out only Mathematics 23. He asks you if you can help him decide which card belongs to Mary. You know that Mary likes government but does not like mathematics. You know nothing about John and assume that he is just a typical student. From this you estimate: P (Mary takes Government 35) = .5 P (Mary takes Mathematics 23) = .1 P (John takes Government 35) = .3 P (John takes Mathematics 23) = .2 Assume that their choices for courses are independent events. Show that the card with Mathematics 23 and Government 35 showing is more likely to be Mary's than John's. The conjunction fallacy referred to in the Linda problem would be to assume that the event Mary takes Mathematics 23 and Government 35" is more likely than the event \Mary takes Mathematics 23." Why are we not making this fallacy here? We assume that John and Mary sign up for two courses. Their cards are dropped, one of the cards gets stepped on, and only one course can be read on this card. Call card I the card that was not stepped on and on which the registrar can read government 35 and mathematics 23; call card II the card that was stepped on and on which he can just read mathematics 23. There are four possibilities for these two cards. They are: Card I Mary(gov,math) Mary(gov,math) John(gov,math) John(gov,math) Card II John(gov, math) John(other,math) Mary(gov,math) Mary(other,math) Prob. .0015 .0025 .0015 .0012 Cond. Prob. .224 .373 .224 .179 In the third column we have written the probability that each case will occur. For example, for the first one we compute the probability that the students will take the appropriate courses: .5x.1.3x.2 = .0030 and then we multiply by 1/2, the probability that it was John’s card that was stepped on. Now to get the conditional probabilities we must renormalize these probabilities so that they add up to one. In this way we obtain the results in the last column. From this we see that the probability that card I is Mary’s is .597 and that card I is John’s is .403, so it is more likely that that the card on which the registrar sees Mathematics 23 and Government 35 is Mary’s. 4. [Logic] Business Trip: Business was bad (and getting worse) for the Fastanloose Finance Company, so four of their best regional sales persons were sent out on business trips to look for new opportunities. Where is each salesperson’s base, what was his or her destination, and at which hotel did he or she stay? 1. The salesperson who stayed at the Pitts hotel in Buffalo isn’t based in Baltimore. 2. The salesperson based in Nashville went to Milwaukee but didn’t stay at Dumpster’s hotel. 3. Dick is based in Charleston. He didn’t travel to Buffalo or Dallas, nor did he stay at Dedlegg’s hotel or Dumpster’s. 4. Sharon isn’t based in Nashville and neither she nor Tom went to Dallas. CSE4412 & CSE6412 – Data Mining Page 4 Base Baltimore Destination Charleston Nashville Richmond Buffalo Dallas Hotel Milwaukee Saint Louis Dedlegg’s Dumpster’s Slummer’s The Pitts Dick Harry Sharon Tom Dedlegg’s Dumpster’s Slummer’s The Pitts Buffalo Dallas Milwaukee Saint Louis Salesperson Base Destination Hotel Dick Charleston Saint Louis Slummer’s Harry Baltimore Dallas Dumpster’s Sharon Richmond Buffalo The Pitts Tom Nashville Milwaukee Dedlegg’s The salesperson based in Nashville went to Milwaukee (clue 2). Dick is based in Charleston (clue 3) and didn’t go to Buffalo or Dallas, so Saint Louis. The Pitts is in Buffalo (1). Dick didn’t stay at Dedlegg’s or Dumpster’s (3), so Slummer’s. The salesperson who went to Dallas isn’t Sharon or Tom (4), so Harry. The one based in Nashville who went to Milwaukee (2) isn’t Sharon (4), so Tom. He didn’t stay at Dumpster’s (2), so Dedlegg’s. By elimination Sharon went to Buffalo. She isn’t based at Baltimore (1), so Richmond. Thus Harry is based in Baltimore and stayed at Dumpster’s. 5. (a) [Information Retrieval] What is the bag-of-words representation of the sentence “to be or not to be”? A vector with one component for each word in our dictionary, all of them zero except for the following: be or not to 2 1 1 2 This is the form as given by Table (c (“to”, “be”, “or”, “not”, “to”, “be”)) (b) Suppose we each or the above sentence via the keyword “be”. What is the bag-of-words representation for this query, and what is the Euclidean distance from the sentence? A vector whose only non-zero component is that for “be” where the count is 1. The Euclidean distance is CSE4412 & CSE6412 – Data Mining Page 5 2 2 2 (2-1) + (1-0) + (1-0) + (0-2) = 7 (c) Describe how weighting words by inverse-document-frequency (IDF) should help when making a Web query for “The Principles of Data Mining”. It keeps from wasting time on words like “the” and “of”, and emphasize the less-common, moreinformative words “principles”, “data” and “mining”; something titled “Data Mining Principles” is a good match. (d) Describe a single text search that could not be carried out effectively using a bag-of-words representation (no matter what distance measure is used). “Simple” means no high-level understanding of English is required. There are many; but a search for the exact phrase “to be or not to be” is impossible . 6. (a) [k-means Clustering] Explain what is the k-means clustering algorithm. Do not write code but give a precise verbal description which someone could turn into code (i) (ii) (iii) (iv) (v) (b) Start a set of n vectors x1,x2,...xn. Assign each vector at random to one of the k clusters. For each cluster, compute the mean of the vectors belonging to that cluster. For each vector, assign it to the cluster whose mean is closest to it. (Do not recompute the means while these assignments are being made.) If any vectors have changed their cluster assignments, go backtostep (iii); if not, stop. Can k-means ever give results that contain more or less than k clusters? No. It can never give more clusters, since at every stage every point is assigned to one of k clusters. To give fewer than k clusters, we would need there to be a cluster which got no points at one of the re-assignment stages. This means that its centre would be further apart from every point than one of the other cluster centres. But since the centre lies in between the points currently assigned to the cluster, that id not possible. (c) Explain what the sum-of-squares is for k-means. For each cluster, it is the sum of the squared distances of points in that cluster to their centre, summed over clusters. Writing Ci for the points in cluster i, and mi for the mean of cluster i, (d) The following diagrams show the results of clustering the same data with k means, with k running from 2 to 6; also a plot of the sum-of-squares versus k. How many clusters would you guess this data has, and why? Does it matter whether the plot is an average over many runs of the algorithm? CSE4412 & CSE6412 – Data Mining Page 6 CSE4412 & CSE6412 – Data Mining Page 7 A reasonable guess here is 4; the sum of squares goes up dramatically after that, but adding more than 4 clusters does little to lower the sum of squares. Visually, k = 4 gives us four compact, well separated clusters with fairly clear divisions mbetween them, which is not true of either more of fewer clusters. - in fact, the data were generated as a mixture from four different Gaussians, centred at (-1, -1), (1, 1), ( 1, -1), (1, 1), all with covariance matrices 7. (a) Data Mining] Describe the steps involved in data mining when viewed as a process of knowledge discovery. The steps involved in data mining when viewed as a process of knowledge discovery are as follows: • Data cleaning , a process that removes or transforms noise and inconsistent data • Data integration , where multiple data sources may be combined • Data selection , where data relevant to the analysis task are retrieved from the database • Data transformation , where data are transformed or consolidated into forms appropriate for mining • Data mining , an essential process where intelligent and efficient methods are applied in order to extract patterns • Pattern evaluation , a process that identifies the truly interesting patterns representing knowledge based on some interestingness measures • Knowledge presentation , where visualization and knowledge representation techniques are used to present the mined knowledge to the user (b) How is a data warehouse different from a database? How are they similar? Differences between a data warehouse and a database: A data warehouse is a repository of information collected from multiple sources, over a history of time, stored under a unified schema, and used for data analysis and decision support; whereas a database, is a collection of interrelated data that represents the current status of the stored data. There could be multiple heterogeneous databases where the schema of one database may not agree with the schema of another. A database system supports ad-hoc query and on-line transaction processing. Similarities between a data warehouse and a database: Both are repositories of information, storing huge amounts of persistent data. (c) What is the difference between discrimination and classification? Between characterization and clustering? Between classification and regression? For each of these pairs of tasks, how are they similar? Discrimination differs from classification in that the former refers to a comparison of the general features of target class data objects with the general features of objects from one or a set of contrasting classes, while the latter is the process of finding a set of models (or functions) that describe and distinguish data classes or concepts for the purpose of being able to use the model to predict the class of objects whose class label is unknown. Discrimination and classification are similar in that they both deal with the analysis of class data objects. Characterization differs from clustering in that the former refers to a summarization of the general characteristics or features of a target class of data while the latter deals with the analysis of data objects without consulting a known class label. This pair of tasks is similar in that they both deal with grouping together objects or data that are related or have high similarity in comparison to one another. CSE4412 & CSE6412 – Data Mining Page 8