Download Assignment 1 (small) answers

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected]
Tuesdays, Thursdays 1:00-2:20 – Lassonde 3033
Fall Semester, 2014
_____________________________________________________________________________________________
THE SMALL ASSIGNMENT (answers)
_____________________________________________________________________________________________
1. [Inductive Sets] Intuitively, the natural numbers begin with the number 1, and then there is 2, then 3,
then 4, and so on. Does this process of “starting with 1” and “adding 1 repeatedly” result in all the
natural numbers? We will use the concept of an inductive set to explore this idea. Definition. A set T
that is a subset of Z is an inductive set provided that for each integer k, if k ∈ T , then k + 1 ∈ T .
Consider what it means to say that a subset T of the integers Z is not an inductive set.
Suppose that T is an inductive subset of the integers. Which of the following statements are true,
which are false, and for which ones is it not possible to tell?
a) 1 ∈ T and 5 ∈ T.
It is not possible to tell if 1 ∈ T and 5 ∈ T
b) If 1 ∈ T, then 5 ∈ T.
True
c) If 5 ∉ T, then 2 ∉ T.
True. The contrapositive is, “If 2 ∈ T , then 5 ∈ T ,” which is true.
d) For each integer k, if k ∈ T, then k + 7 ∈ T.
True
e) For each integer k, k ∉ T or k + 1 ∈ T
False. If k ∈ T , then k + 1 ∈ T .
f) There exists an integer k such that k ∈ T and k + 1 ∉ T.
True, since “k ∉ T or k + 1 ∈ T” is logically equivalent to “If k ∈ T, then k + 1 ∈T”
g) For each integer k, if k + 1 ∈ T, then k ∈ T.
It is not possible to tell if this is true. It is the converse of the conditional
statement, “For each integer k, if k ∈ T , then k + 1 ∈ T”
h) For each integer k, if k + 1 ∉ T, then k ∉ T.
True. This is the contrapositive of the conditional statement, “For each integer k,
if k ∈ T, then k + 1 ∈ T”
i)
Prove the following by mathematical induction: For each natural number n,
2
2
2
1 + 2 + ... + n = [n (n + 1) (2n + 1) / 6
We will use a proof by mathematical induction. For each natural number n, we let P(n) be
2
2
2
1 + 2 + ... + n = [n (n + 1) (2n + 1) / 6
We first prove that P.1/ is true. Notice that [1 (1+1) (2x1+1)] / 6 = 1. This shows that
2
1 = [1 (1 + 1) (2x1 + 1) / 6
which proves that P(1) is true.
CSE4412 & CSE6412 – Data Mining
Page 1
For the inductive step, we prove that for each k ∈ N, if P(k) is true, then P(k+1) is true. So let k be a
natural number and assume that P(k) is true. That is, assume that
2
2
2
1 + 2 + ... + k = [k (k + 1) (2k + 1) / 6
{1}
The goal now is to prove that P(k+1) is true. That is, it must be proved that
2
2
2
2
1 + 2 + ... + k + (k + 1) = [(k + 1) [(k+1)+1] [2(k + 1)+1]] / 6
=[(k+1) (k+2) (2k+3)] / 6
{2}
2
To do this, we add (k+1) to both sides of equation {1} and algebraically rewrite the right side of the
resulting equation. This gives
2
2
2
2
2
1 + 2 + ... + k + (k + 1) = {[k (k + 1) (2k+1)] / 6} + (k+1)
2
= [k (k+1) (2k+1) 6 (k+1) ] / 6
= [(k+1) [(k (2k+1) + 6 (k+1)]] / 6
2
= [(k+1) (2k + 7k + 6)] / 6
= (k+1) (k+2) (2k+3)] / 6
Comparing this result to equation (2), we see that if P(k) is true, then P(k+1) is true. Hence, the
inductive step has been established, and by the Principle of Mathematical Induction, we have proved
that for each natural number n,
2
2
2
1 + 2 + ... + n = [n (n + 1) (2n + 1) / 6
This proof shows a standard way to write an induction proof. When writing a proof by mathematical
induction, we should follow the guideline that we always keep the reader informed. This means that
at the beginning of the proof, we should state that a proof by induction will be used. We should then
clearly define the open sentence P(n) that will be used in the proof.
2
2. [Logic & Sets] Determine the elements of the set A = {x |x = 11x -30 or 4 – x >0 when the universal
set U is:
a) the set of real numbers,
A consists of all real numbers less than 4, together with the numbers 5 and 6.
b) the set of rational numbers,
A consists of all rational numbers less than 4, together with the numbers 5 and 6.
c) the set of integers,
A consists of all integers less than 4, together with the numbers 5 and 6.
d) the set of positive integers,
A = {1,2,3,5,6} (include 0 if you consider this positive)
e) the set of negative integers,
A=U
f) the set of odd integers,
A consists of all odd integers less than or equal to 5.
g) the set of even integers,
A consists of all even integers less than or equal to 6, except 4.
h) the set of integers greater than 10,
A=Θ
CSE4412 & CSE6412 – Data Mining
Page 2
2a. For each of the following, determine the relative complement A – B.
a) A = {1, 4, 7, 10}
{4, 7, 10}x
B = {1, 2, 5}
b) A = {1, 2, 5}
{2, 5}
B = {1, 4, 7, 10}
c) A = {a, e, i}
Θ
B = {e, a, i}
d) A = {a, ), 17}
{a, 17}
B = {)}
e) A = {1, 5, 6, a}
A
B=Θ
f)
B = {1, 2, 4, 7}
A = {2, 4}
Θ
g) A = Θ
Θ
B = {a, b, 7}
2b. Determine all subsets of the set A = {0, a, #, 2}
Θ, {0}, {a}, {#}, {2}, {0, a}, {0, #}, {0, 2}, {a, #}, {a, 2}, {#, 2}, {0, a, #}, {0, a, 2}, {0, #, 2}, {a, #, 2}, A
3. [Probability] (a) The psychologist Tversky and his colleagues say that about four out of five people will
answer (a) to the following question:
A certain town is served by two hospitals. In the larger hospital about 45 babies are born
each day, and in the smaller hospital 15 babies are born each day. Although the overall
proportion of boys is about 50 percent, the actual proportion at either hospital may be more
or less than 50 percent on any day.
At the end of a year, which hospital will have the greater number of days on which more than 60
percent of the babies born were boys?
(a) the large hospital
(b) the small hospital
(c) neither|the number of days will be about the same.
Assume that the probability that a baby is a boy is .5 (actual estimates make this more like .513).
Decide, by simulation (in any language), what the right answer is to the question. Can you suggest
why so many people go wrong?
Your simulation should result in about 25 days in a year having more than 60 percent boys in the
large hospital and about 55 days in a year having more than 60 percent boys in the small hospital.
(b) Tversky and Kahneman asked a group of subjects to carry out the following. They are told that:
Linda is 31, single, outspoken, and very bright. She majored in philosophy in college. As a
student, she was deeply concerned with racial discrimination and other social issues, and
participated in anti-nuclear demonstrations.
The subjects are then asked to rank the likelihood of various alternatives, such as:
(1) Linda is active in the feminist movement.
(2) Linda is a bank teller.
(3) Linda is a bank teller and active in the feminist movement.
CSE4412 & CSE6412 – Data Mining
Page 3
Tversky and Kahneman found that between 85 and 90 percent of the subjects rated alternative (1)
most likely, but alternative (3) more likely than alternative (2). Is it? They call this phenomenon the
conjunction fallacy, and note that it appears to be unaffected by prior training in probability or
statistics. Is this phenomenon a fallacy? If so, why? Can you give a possible explanation for the
subjects' choices?
They call it a fallacy because if the subjects are thinking about probabilities they should realize that
P(Linda is bank teller and in feminist movement) ≤ P(Linda is bank teller).
One explanation is that the subjects are not thinking about probability as a measure of likelihood.
(c) The following is a variation on the Linda problem. The registrar is carrying John and Mary's
registration cards and drops them in a puddle. When he picks them up he cannot read the names but
on the first card he picked up he can make out Mathematics 23 and Government 35, and on the
second card he can make out only Mathematics 23. He asks you if you can help him decide which
card belongs to Mary. You know that Mary likes government but does not like mathematics. You
know nothing about John and assume that he is just a typical student. From this you estimate:
P (Mary takes Government 35) = .5
P (Mary takes Mathematics 23) = .1
P (John takes Government 35) = .3
P (John takes Mathematics 23) = .2
Assume that their choices for courses are independent events. Show that the card with Mathematics
23 and Government 35 showing is more likely to be Mary's than John's. The conjunction fallacy
referred to in the Linda problem would be to assume that the event Mary takes Mathematics 23 and
Government 35" is more likely than the event \Mary takes Mathematics 23." Why are we not making
this fallacy here?
We assume that John and Mary sign up for two courses. Their cards are dropped, one of the cards
gets stepped on, and only one course can be read on this card. Call card I the card that was not
stepped on and on which the registrar can read government 35 and mathematics 23; call card II the
card that was stepped on and on which he can just read mathematics 23. There are four possibilities
for these two cards. They are:
Card I
Mary(gov,math)
Mary(gov,math)
John(gov,math)
John(gov,math)
Card II
John(gov, math)
John(other,math)
Mary(gov,math)
Mary(other,math)
Prob.
.0015
.0025
.0015
.0012
Cond. Prob.
.224
.373
.224
.179
In the third column we have written the probability that each case will occur. For example, for the first
one we compute the probability that the students will take the appropriate courses: .5x.1.3x.2 = .0030
and then we multiply by 1/2, the probability that it was John’s card that was stepped on. Now to get
the conditional probabilities we must renormalize these probabilities so that they add up to one. In this
way we obtain the results in the last column. From this we see that the probability that card I is Mary’s
is .597 and that card I is John’s is .403, so it is more likely that that the card on which the registrar
sees Mathematics 23 and Government 35 is Mary’s.
4. [Logic] Business Trip: Business was bad (and getting worse) for the Fastanloose Finance Company,
so four of their best regional sales persons were sent out on business trips to look for new
opportunities. Where is each salesperson’s base, what was his or her destination, and at which hotel
did he or she stay?
1. The salesperson who stayed at the Pitts hotel in Buffalo isn’t based in Baltimore.
2. The salesperson based in Nashville went to Milwaukee but didn’t stay at Dumpster’s hotel.
3. Dick is based in Charleston. He didn’t travel to Buffalo or Dallas, nor did he stay at Dedlegg’s
hotel or Dumpster’s.
4. Sharon isn’t based in Nashville and neither she nor Tom went to Dallas.
CSE4412 & CSE6412 – Data Mining
Page 4
Base
Baltimore
Destination
Charleston Nashville
Richmond Buffalo
Dallas
Hotel
Milwaukee Saint
Louis
Dedlegg’s
Dumpster’s Slummer’s The Pitts
Dick
Harry
Sharon
Tom
Dedlegg’s
Dumpster’s
Slummer’s
The Pitts
Buffalo
Dallas
Milwaukee
Saint Louis
Salesperson
Base
Destination
Hotel
Dick
Charleston
Saint Louis
Slummer’s
Harry
Baltimore
Dallas
Dumpster’s
Sharon
Richmond
Buffalo
The Pitts
Tom
Nashville
Milwaukee
Dedlegg’s
The salesperson based in Nashville went to Milwaukee (clue 2). Dick is based in Charleston (clue 3) and
didn’t go to Buffalo or Dallas, so Saint Louis. The Pitts is in Buffalo (1). Dick didn’t stay at Dedlegg’s or
Dumpster’s (3), so Slummer’s. The salesperson who went to Dallas isn’t Sharon or Tom (4), so Harry.
The one based in Nashville who went to Milwaukee (2) isn’t Sharon (4), so Tom. He didn’t stay at
Dumpster’s (2), so Dedlegg’s. By elimination Sharon went to Buffalo. She isn’t based at Baltimore (1), so
Richmond. Thus Harry is based in Baltimore and stayed at Dumpster’s.
5.
(a)
[Information Retrieval]
What is the bag-of-words representation of the sentence “to be or not to be”?
A vector with one component for each word in our dictionary, all of them zero except for the
following:
be
or
not
to
2
1
1
2
This is the form as given by
Table (c (“to”, “be”, “or”, “not”, “to”, “be”))
(b)
Suppose we each or the above sentence via the keyword “be”. What is the bag-of-words
representation for this query, and what is the Euclidean distance from the sentence?
A vector whose only non-zero component is that for “be” where the count is 1. The Euclidean
distance is
CSE4412 & CSE6412 – Data Mining
Page 5
2
2
2
(2-1) + (1-0) + (1-0) + (0-2) = 7
(c)
Describe how weighting words by inverse-document-frequency (IDF) should help when making a
Web query for “The Principles of Data Mining”.
It keeps from wasting time on words like “the” and “of”, and emphasize the less-common, moreinformative words “principles”, “data” and “mining”; something titled “Data Mining Principles” is a
good match.
(d)
Describe a single text search that could not be carried out effectively using a bag-of-words
representation (no matter what distance measure is used). “Simple” means no high-level
understanding of English is required.
There are many; but a search for the exact phrase “to be or not to be” is impossible .
6.
(a)
[k-means Clustering]
Explain what is the k-means clustering algorithm. Do not write code but give a precise verbal
description which someone could turn into code
(i)
(ii)
(iii)
(iv)
(v)
(b)
Start a set of n vectors x1,x2,...xn.
Assign each vector at random to one of the k clusters.
For each cluster, compute the mean of the vectors belonging to that cluster.
For each vector, assign it to the cluster whose mean is closest to it. (Do not recompute
the means while these assignments are being made.)
If any vectors have changed their cluster assignments, go backtostep (iii); if not, stop.
Can k-means ever give results that contain more or less than k clusters?
No. It can never give more clusters, since at every stage every point is assigned to one of k
clusters. To give fewer than k clusters, we would need there to be a cluster which got no points at
one of the re-assignment stages. This means that its centre would be further apart from every
point than one of the other cluster centres. But since the centre lies in between the points
currently assigned to the cluster, that id not possible.
(c)
Explain what the sum-of-squares is for k-means.
For each cluster, it is the sum of the squared distances of points in that cluster to their centre,
summed over clusters. Writing Ci for the points in cluster i, and mi for the mean of cluster i,
(d)
The following diagrams show the results of clustering the same data with k means, with k running
from 2 to 6; also a plot of the sum-of-squares versus k. How many clusters would you guess this
data has, and why? Does it matter whether the plot is an average over many runs of the
algorithm?
CSE4412 & CSE6412 – Data Mining
Page 6
CSE4412 & CSE6412 – Data Mining
Page 7
A reasonable guess here is 4; the sum of squares goes up dramatically after that, but adding
more than 4 clusters does little to lower the sum of squares. Visually, k = 4 gives us four compact,
well separated clusters with fairly clear divisions mbetween them, which is not true of either more
of fewer clusters.
- in fact, the data were generated as a mixture from four different Gaussians, centred at (-1, -1), (1, 1), ( 1, -1), (1, 1), all with covariance matrices
7.
(a) Data Mining] Describe the steps involved in data mining when viewed as a process of
knowledge discovery.
The steps involved in data mining when viewed as a process of knowledge discovery are as
follows:
• Data cleaning , a process that removes or transforms noise and inconsistent data
• Data integration , where multiple data sources may be combined
• Data selection , where data relevant to the analysis task are retrieved from the database
• Data transformation , where data are transformed or consolidated into forms appropriate
for mining
• Data mining , an essential process where intelligent and efficient methods are applied in
order to extract patterns
• Pattern evaluation , a process that identifies the truly interesting patterns representing
knowledge based on some interestingness measures
• Knowledge presentation , where visualization and knowledge representation techniques
are used to present the mined knowledge to the user
(b) How is a data warehouse different from a database? How are they similar?
Differences between a data warehouse and a database: A data warehouse is a repository of
information collected from multiple sources, over a history of time, stored under a unified schema,
and used for data analysis and decision support; whereas a database, is a collection of
interrelated data that represents the current status of the stored data. There could be multiple
heterogeneous databases where the schema of one database may not agree with the schema of
another. A database system supports ad-hoc query and on-line transaction processing.
Similarities between a data warehouse and a database: Both are repositories of information,
storing huge amounts of persistent data.
(c) What is the difference between discrimination and classification? Between characterization
and clustering? Between classification and regression? For each of these pairs of tasks, how are
they similar?
Discrimination differs from classification in that the former refers to a comparison of the general
features of target class data objects with the general features of objects from one or a set of
contrasting classes, while the latter is the process of finding a set of models (or functions) that
describe and distinguish data classes or concepts for the purpose of being able to use the model
to predict the class of objects whose class label is unknown. Discrimination and classification are
similar in that they both deal with the analysis of class data objects.
Characterization differs from clustering in that the former refers to a summarization of the
general characteristics or features of a target class of data while the latter deals with the analysis
of data objects without consulting a known class label. This pair of tasks is similar in that they
both deal with grouping together objects or data that are related or have high similarity in
comparison to one another.
CSE4412 & CSE6412 – Data Mining
Page 8