Download Probability and Statistics for Data Miners

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Probability and Statistics for
Data Mining
COMP5318
Question 1
Gender
% of credit
% of gender
card holders who default
Male
60
55
Female
40
35
• Question: Suppose you randomly select a credit card holder
and the person has defaulted on their credit card. What is
the probability that the person selected is a ‘Female’?
Probability
• Probability is the mathematical language to
understand uncertainty.
• We need to make decisions in the presence of
uncertainty which is ever present.
• Example: The Earth is warming- a phenomenon
that is known as Global Warming (GW). Is
modern human activity the cause of GW.
– Physics driven approach
– Data driven approach
Experiments and Observation
• When an experiment is carried out we observe
the outcome – which is often uncertain.
– If not uncertain then why carry out the experiment?
• We look into a random shopping basket. Does it
contain a a packet of “Tofu”?
• We toss a coin, does it land on “Heads”?
• We ask a question: “Is it raining in Broom, WA,
right now”?
Building Blocks of Probability
• The space of all possible outcomes is
called the sample space.
– Non-trivial to decide.
• Single Coin Toss. The space is {H,T}.
• Shopping Basket. The space of all
possible combinations of all items sold in
the store.
• Shopping Basket: {Tofu, Not-Tofu}.
Events
• Events are subsets of the sample space.
Events are often defined in familiar terms.
• In the shopping basket scenario
– A vegetarian shopping basket is an event.
– all possible vegetarian item combinations.
• Throw of a dice. The event we are looking
for could be: Even Number = {2,4,6},
where the sample space = {1,2,3,4,5,6}
Events
• Let G be the set of all galaxies.
Characterize each galaxy by three number
– d: distance from earth
– a: major axis
– b: minor axis
• Elliptic Galaxies (EG)
– EG ={(a,b,d) | a/b > 1.5}
• Distant Spiral Galaxies (DSG)
– DSG ={(a,b,d) | a/b <= 1.5 and d > 10}
Events
• Let G be the set of all genes. Each gene
can be “on” or “off”. Let E correspond to
the event: all genes which are “on” when
the skin cells are “starved”.
Events are Sets
• At the most basic level events are sets.
Therefore we can carry out set union,
difference and intersection on events.
• For example:
– E1: shopping baskets which contain Tofu
– E2: shopping baskets which contain Milk
– E1 U E2: shopping baskets which contain
either Tofu or Milk
Probability
• Let S be the space of all possible
elementary outcomes. Let  = Power(S)
be the power set of S. Then the probability
P is function:
P :   [0,1]
that satisfy the following properties (axioms):
Interpretation of Probability
• Physical or Ontological: Long term frequency
– 50% chance that a coin will land on heads.
– 20% of all Woolworth shopping baskets are
vegetarian.
– 22% of all Woolworth shopping baskets in
Northbridge plaza are vegetarian.
• Epistemological : Degree of Belief
– 20% chance that my neighbours are watering their
lawn on “dry” days.
– 99% chance that the green immovable object outside
my house is a Tree.
– 90% chance that Australia will win the cricket world
cup.
Consequences of Axioms
Example
• Two coin tosses. Let H1 be the event that
a heads occurs on toss 1 and H2 a heads
on toss 2. All events are equally likely.
• Sample space = {HH, HT, TH, TT}
– H1 = {HH, HT}
– H2 = {HH,TH}
– P(H1 U H2) = ½ + ½ - ¼ = 3/4
Example
• Two events A and B are independent if
– P(A ∩ B) = P(A)P(B)
• P(A∩B) is also written as P(AB) and P(A,B).
• If A and B are disjoint event then A and B such
that P(A) > 0 and P(B) > 0 then A and B cannot
be independent
– P(A ∩ B) = 0. Yet P(A)P(B) > 0
• Except for this case you cannot determine
independence by looking at a Venn diagram
Question
• A shopping basket can either be kosher or
not. The probability that it will be kosher is
3/4. Examine 10 baskets at a check out
counter. What is the probability that there
will be at least one kosher basket.
Answer
• Let E be the event “At least one kosher
basket.” Let NKi be the event that the i-th
basket is non-kosher.
Independence
Example
• For an Online Book Seller (OBS) the
conversion rate is 1/100, i.e., every 100th
visitors ends up making a purchase. What
is the probability that at least one
purchase will be made in 10 consecutive
visits (by distinct customers).
Example
• Two people take turns to sink a basketball. P1
succeeds with probability 1/3 and P2 with ¼.
What is the probability that P1 succeeds before
P2.
• Requires clever setting up of the events.
– Let E be the event that P1 succeeds before P2.
– Let Ai be the event that P1 succeeds before P2 on the
ith trial.
– Ai ∩Aj = Ø and E = [i=11Ai
Conditional Probability
• Very Important Concept
• P(A|B) is “fraction of occurrences of B in
which A also occurs”
– P(A|B) = P(A ∩ B)/P(B); P(B) > 0
• For a fixed B, P(.|B) is a probability
– Therefore if A1 and A2 are disjoint then
– P(A1 U A2 |B) = P(A1|B) + P(A2|B)
• Note, P(A|B U C) =/= P(A|B) + P(A|C)
• Also P(A|B) =/= P(B|A)
Standard Example
D
Dc
+
0.009
0.099
-
0.001
0.891
P (  | D) 
P(   D)
0.009

 0.9
P( D)
0.009  0.001
P(   D c )
0.891
P(  | D ) 

 0.9
P( D c )
0.891  0.099
c
Suppose a test is positive. What is
the probability of disease?
D is disease
+/-; Test positive or negative
P( D | ) 
0.009
 0.08
0.009  0.099
Standard Data Mining Example
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Suppose the data above closely resembles the behaviour of the population
at large.
What is the chance that those who buy a Diaper will also buy Beer.
= P(Diaper
∩ Beer)/P(Diaper) = 0.6/0.8 = 0.75
Is Diaper an Event?
Conditional Independence
• If A and B are independent then
P(A|B)=P(A)
• P(AB) = P(A|B)P(B)
• Law of Total Probability.
Bayes Theorem
Question 1
Gender
% of credit
% of gender
card holders who default
Male
60
55
Female
40
35
• Question: Suppose you randomly select a credit card holder
and the person has defaulted on their credit card. What is
the probability that the person selected is a ‘Female’?
Answer to Question 1
P(G  F | D  Y ) 


P( D  Y | G  F ) P(G  F )
P( D  Y | G  F )  P( D  Y | G  M )
0.35  0.40
0.35  0.40  0.55  0.60
0.30
But what does G=F and D=Y mean? We have not even formally defined them.
Related documents