Download Spatial Statistics and Spatial Knowledge Discovery

Document related concepts

Inductive probability wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Foundations of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Law of large numbers wikipedia , lookup

Transcript
Spatial Statistics and Spatial Knowledge
Discovery
First law of geography [Tobler]: Everything is related to everything, but
nearby things are more related than distant things.
Drowning in Data yet Starving for Knowledge [Naisbitt -Rogers]
Lecture 3 : More Basic Statistics
with R
Pat Browne
Population & Sample
• Statistics often involves selecting a
random (or representative) subset of a
population called a sample.
Degrees of freedom (df)
Degrees of Freedom
• We had total freedom in selecting the first four
numbers, but we had no choice in selecting the
fifth number. We have four degrees of freedom
when selecting five numbers. In general we have
(n-1) DOF if we estimate the mean from a
sample size n.
• DOF is the sample size, n, minus the number of
parameters, p, estimated from the data.
Recall Permutations & Combinations
• P(n,r) = n! / (n-r)!
• Permutations (sequence) of a, b, and c taken 2
at a time is 3*2/1=6=<ab>,<ba>,<ac>,<ca>,<bc>,<cb>
• C(n,r) = n! /r! (n-r)!
• Combinations (set) of a, b, and c taken 2 at a
time is 3*2/2*1=3={a,b},{a,c},{b,c}
• ab is a distinct permutation from ba, but they are
the same combination.
Probability Calculations
• Conditional probability
• P(A|B) = P(A  B)/P(B) (probability of A, given B)
• Test for independence
• P(A  B) = P(A)P(B)
• Calculation of union
• P(A  B) = P(A) + P(B) – P(A  B)
Frequency Table
• One way of organizing raw data is to use a
frequency table (or frequency distribution),
which shows the number of times that an
individual item occurs or the number of
items that fall within a given range or
interval.
Frequency Table
#tennents
Frequency
1
8
Frequency
14
5
3
6
1
Frequency
or
e
12
M
4
5
7
3
3
16
14
12
10
8
6
4
2
0
1
2
Histogram with class interval
TempRange
Frequency
70
0
75
3
80
7
10
85
7
8
90
5
6
Frequency
Frequency
95
8
100
2
4
2
110
3
90
10
0
11
0
0
80
105
70
0
Random variables and probability
distributions.
• Suppose you toss a coin two times. There are
four possible outcomes: HH, HT, TH, and TT.
Let the variable X represents the number of
heads that result from this experiment. The
variable X can take on the values 0, 1, or 2. In
this example, X is a random variable; because
its value is determined by the outcome of a
statistical experiment.
Random variables and probability
distributions.
• A probability distribution is a table (or an
equation) that links each outcome of a
statistical experiment with its probability of
occurrence. The table below, which associates
each outcome (the number of heads) with its
probability. This is an example of a probability
distribution.
Mean
• The arithmetic mean is the sum of the
values in a data set divided by the number
of elements in that data set.
x =
∑xi
n
x =
∑fixi
∑fi
where f denotes frequency
Variance & Standard Deviation
• List A: 12,10,9,9,10
• List B: 7,10,14,11,8
• The mean (x) of A & B is 10, but the values
of A are more closely clustered around the
mean than those in B (there is greater
dispersion or spread in B). We use the
standard deviation to measure this spread.
Variance & Standard Deviation
• The variance is always positive and is zero only
when all values are equal.
variance =
∑(xi - xi )2
n
2
2
2
2
(
x
1

x
)

(
x
2

x
)

...

(
x
t

x
)

(
x
i

x
)

n
n
Alternatively
2
2
2
2
x
1

x
2

...

x
t

x
i
2
2

x
 
x
n
n
standard deviation =
variance
Variance of a frequency distribution
2
2
2
2
f
1
(
x
1

x
)

f
2
(
x
2

x
)

...

f
t
(
x
t

x
)

f
i
(
x
i

x
)

f
1

f
2

...

f
t

f
Alternatively
f
x

fx

...

f
x
f
x
 
x
f
1

f
2

...

f
t

f
i
2
1
1
2
2
2
2
tt
2
ii
2
Median
• The median is the middle value. If the
elements are sorted the median is:
• Median = valueAt[(n+1)/2]
• Median = average(valueAt[n/2],
valueAt[n/2+1])
• For odd and even n respectively.
Mode
• The mode is the class or class value which
occurs most frequently. We can have
bimodal or multimodal collections of data.
Trials with 2 possible outcomes.
• Outcome = success or failure
• Let p be the probability of success, then q=1-p is
the probability of failure.
• Often we are interested in the number of
successes without considering their order.
• The probability of exactly k successes in n
repeated trials is:
 n  k n-k
• b(k,n,p)=   p q
k 
Bernoulli Trials: Example
No success (0), all failures,
• John hits target: p=1/4,
Anything to the power of 0 is 1
Only 1 way to pick 0 from 6
• John fires 6 times, n=6,:
• What is the probability John hits the target at least
once?
Probability that John hits target at least once
Only 1 way to pick 0 from 6
 6  1   3 
729
729
P(0)       
, P( X  0)  1 
 0.82
4096
4096
 0  4   4 
0
6
Probability that John does not hit target
0 to the power 0 is undefined, anything else to the power of zero is 1.
Bernoulli Trials: Example
• Probability that Mary hits target: p=1/4,
• Mary fires 6 times, n=6,:
• What is the probability Mary hits the target more
than 4 times?
 6  1 
P(5)  P(6)    
 5  4 
5
1
6
3 1
      0.0046
4 4
This could be written in R: 6*((1/4)^5)*((3/4)^1)+(1/4)^6
Tossing Dice in R
• The rep function generates repeats; 6 one
sixths which is the probability of a die
landing on any one of its faces
• die <- 1:6
• p.die <- rep(1/6,6)
• The total probability sums to 1.
• sum(p.die)
Tossing Dice in R
die <- 1:6
p.die <- rep(1/6,6)
s <- table(sample(die, size=1000, prob=p.die, replace=T))
barX <- barplot(s, ylim=c(0,200))
lbls = sprintf("%0.1f%%", s/sum(s)*100)
text(x=barX, y=s+10, label=lbls)
Copy the above code and run it R several times.
Tossing Dice in R
Represesent the dice as a vector with vlaues 1 to 6
> die <- 1:6
Throw the dice 10 time, note replacement.
> sample(die, size=10, prob=p.die, replace=T)
[1] 1 1 1 2 1 6 6 2 5 1
Calculate the expected value
>sum(die*P.die)
[1] 3.5
If we sample twice we usually get distinct samples.
> sam1 <- sample(die, size=10, prob=p.die, replace=T)
> sam2 <- sample(die, size=10, prob=p.die, replace=T)
Tossing Dice in R
• R code to throw a 1000 dice and make a bar chart
of their values.
s <- table(sample(die, size=1000,
prob=p.die, replace=T))
lbls = sprintf("%0.1f%%", s/sum(s)*100)
barX <- barplot(s, ylim=c(0,200))
text(x=barX, y=s+10, label=lbls)
Print s and sum(s).
> s
1
2
3
4
5
6
160 155 170 173 164 178
> sum(s)
[1] 1000
Tossing Dice in R
• Expected value of a discrete random
variable X is the weighted average of the
values in the range of X.
• For a die it is:
•
1*(1/6)+2*(1/6)+3*(1/6)+4*(1/6)+5*(1/6)+6*(1/6) = 3.5
• Or more simply:
• (1+2+3+4+5+6)/6 = 3.5
Random Variable
• A random variable X on a finite sample
space S is a function from S to a real
number R in S’.
• Let S be sample space of outcomes from
tossing two coins. Then mapping a is;
• S={HH,HT,TH,TT} (assume HT≠TH)
• Xa(HH)=1, Xa(HT)=2, Xa(TH)=3, Xa(TT)=4
• The range (image) of Xa is:
• S’={1,2,3,4}
Random Variable
• Let S be sample space of outcomes from
tossing two coins, where we are interested
in the number of heads. Mapping b is:
• S={HH,HT,TH,TT}
• Xb(HH)=2, Xb(HT)=1, Xb(TH)=1, Xb(TT)=0
• The range (image) of X is:
• S’’={0,1,2}
Random Variable
• A random variable is a function that maps
a finite sample space into to a numeric
value. The numeric value has a finite
probability space of real numbers, where
probabilities are assigned to the new
space according to the following rule:
pi = P(xi)= sum of probabilities of points
in S whose range is xi.
Random Variable
• The function assigning pi to xi can be
given as a table called the distribution of
the random variable.
• pi = P(xi)=
number of points in S whose image is xi
number of points in S
(i = 1,2,3...n) gives the distribution of X
Random Variable
• The equiprobable space generated by
tossing pair of fair dice, consists of 36
ordered pairs(1):
• S={(1,1),(1,2),(1,3)...(6,6)}
• Let X be the random variable which
assigns to each element of S the sum of
the two integers: 2,3,4,5,6,7,8, 9,10,11,12
Random Variable
• Continuing with the sum of the two dice.
• There is only one point whose image is 2, giving
P(2)=1/36.
• There are two points whose image is 3, giving
P(3)=2/36. ( <1,2>≠<2,1>, but their sums are =)
• Below is the distribution of X.
xi 2 3 4 5 6 7 8 9 10 11 12
pi 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36
=36/36
Example: Random Variable
• A box contains 9 good items and 3 defective items
(total 12 items). Three items are selected at
random from the box. Let X be the random variable
that counts the number of defective items in a 108
27
sample. X can have values 0-3.
84
1
3
12


 9

---p
i
/











220
x
i
3

x
i
3



• Below is the distribution of X.
xi
0
1
2
3
pi
84/220
108/220
27/220
1/220
= 220/220
Example: Random Variable
• There are choose(12,3) different 3 samples.
• There are choose(9,3) = 84 of sample size 3,
with 0 defective.
• There are choose(9,2)*3 = 108 of sample size 3,
with 1 defective.
• There are choose(3,2)*9 = 27 of sample size 3,
with 2 defective.
• There is 1 of sample size 3, with 3 defective.
Functions of a Random Variable
• If X is a random variable then so is
Y=f(X).
• P(yk) = sum of probabilities xi, such that
yk=f(xi)
Expectation and variance of a
random variable
• Let X be a discrete random variable over sample
space S.
• X takes values x1,x2,x3,... xt with respective
probabilities p1,p2,p3,... pt
• An experiment which generates S is repeated n
times and the numbers x1,x2,x3,... xt occur
with frequency f1,f2,f3,... ft (fi=n)
• If n is large then
one expects
f1
f2
f
t

p
1
,

p
2
,...
p
t
n
n
n
Expectation of a random variable
• So
fx

x
f
i i
becomes
i
f 1x1  f 2x2  ... ftxt
x
n
f1
f2
ft

x1 
x2  ... xt
n
n
n
 x1p1  x2 p2  ... xtpt
• The final formula is the population mean, expectation,
or expected value of X is denoted as  or E(X).
Variance of a random variable
• The variance of X is denoted as 2 or Var(X).
variance 
f 1( x1 
x )  2f
2
( x2 
n
x )  2... 
ft ( xt  x )
f1
f2
ft
2
2
2
 ( x1  x ) 
( x 2  x )  ...  ( xt  x )
n
n
n
2
2
2
 ( x1   ) p1  x 2( x 2   ) p 2  ...  ( x 2   ) pt
 Var
(X)
• The standard deviation is
2
Expected value, Variance,
Standard Deviation
• E(X)= μ = μx = ∑xipi
• Var(X)= 2 = 2x =∑(xi - μ)2pi
• SD(X)= x =
 Var
(X)
Relation between population and
sample mean.
• If we select a sample size N at random from
a population, then it is possible to show that
the expected value of the sample mean m
approximates the population mean μ.
• This rule differs slightly for variance. The
sample variance is (N-1)/N times the
population variance (almost 1).
Example: Random Variable
• A box contains 9 good items and 3 defective items
(total 12 items). Three items are selected at
random from the box. Let X be the random variable
that counts the number of defective items in a 108
27
sample. X can have values 0-3.
84
3
12
1


 9

There are choose(9,3) =
choose(12, 3)
p
i
/





---84 of sample size 3, with
= 1320/6=220






x
i
3

x
i
3


 220
0 defective
• Below is the distribution of X.
xi
0
1
2
3
pi
84/220
108/220
27/220
1/220
= 220/220
Example: Random Variable
• There are choose(12,3) different 3 samples.
• There are choose(9,3) = 84 of sample size 3,
with 0 defective.
• There are choose(9,2)*3 = 108 of sample size 3,
with 1 defective.
• There are choose(3,2)*9 = 27 of sample size 3,
with 2 defective.
• There is 1 of sample size 3, with 3 defective.
Example : Random Variable & Expected
Value
xi
0
1
2
3
pi
84/220
108/220
27/220
1/220
μ is the expected value of defective items in in a
sample size of 3.
μ=E(X)=
0(84/220)+1(108/220)+2(27/220)+3(1/220)=132/220=?
• Var(X)=
02(84/220)+12 (108/220)+22 (27/220)+32 (1/220) - μ 2 =?
• SD(X) sqrt(μ2)=?
Fair Game1?
• If a prime number appears on a fair die the player
wins that value. If an non-prime appears the player
looses that value. Is the game fair?(E(X)=0)
• S={1,2,3,4,5,6}
xi 2
pi 1/6
3
5
-1
-4
-6
1/6
1/6
1/6
1/6
1/6
• E(X) = 2(1/6)+3(1/6)+5(1/6)+(-1)(1/6)+(-4)(1/6)+(-6)(1/6)= -1/6
• Note: 1 is not prime, 2 is prime
Fair Game2?
• A player tosses two fair coins. The player
wins €2 if two heads occur, and wins €1 if
one head occurs. The player looses €3 if no
heads occur. Find the expected value of the
game. How would you test whether or not the
game is fair? Is the game fair? Show the
sample space and distribution.
Fair Game2?
• Sample Space S = {HH,HT,TH,TT} each point
has probability ¼.
• X(HH) = 2, X(HT)=X(TH)=1, X(TT)= -3
• E(X) = 2(1/4)+1(2/3)-3(1/4) = 0.25
• Game is fair if E(X)=0
• Game favours player because E(X)>0
Distribution Example
• Five cards are numbered 1 to 5. Two
cards are drawn at random. Let X denote
the sum of the numbers drawn. Find (a)
the distribution of X and (b) the mean,
variance, and standard deviation.
• There are choose(5,2) = 10 ways of
drawing two cards at random.
Distribution Example
• Ten equiprobable sample points with their
corresponding X-values are
points
1,2 1,3 1,4 1,5 2,3 2,4 2,5 3,4 3,5 4,5
xi
3
4
5
6
5
6
7
7
8
9
Distribution Example(3)
• The distribution is:
xi
3
4
5
6
5
6
7
7
8
9
pi
0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.1 0.1
Distribution Example(4)
• The distribution is:
xi
3
pi
0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.1 0.1
•
•
•
•
4
5
6
5
6
7
The mean is: 3(0.1)..+..9(0.1)=6
The E(X2) is 32(0.1)..+..92(0.1) = 39
The variance is 39 – 62 = 3
The SD is sqrt(3) = 1.7
7
8
9
Identically Distributed variable
Same probability distributions
Binomial Distribution
• A random variable Xn is defined on a sample space
S. We count the number of successful outcomes of
n repeated trials of a success or failure type
experiment. The distribution of Xn is:
k
0
P(k)
qn
1
 n n1
  pq
1 
2
n 2 n2

2
p q
 
..
n
pn
• Where probability of success in a trial is: p = 1 – q
Binomial Distribution
• E(Xn ) = np
• Var(Xn)=npq
• SD(Xn)=sqrt(Var(Xn))
k
0
P(k)
qn
1
2
 n n1
  pq
1 
n 2 n2

2
p q
 
..
n
pn
Binomial Distribution
• If a fair die is tossed 180 times the expected
number of 6’s is:
μ=E(X)=np=180(1/6)=30
• The standard deviation is:


npq

180
(
1
/
6
)(
5
/
6
)5
Normal Distribution
The expected value is the mean of
a sampling distribution of a statistic.
• The number of heads after a fair coin is
tossed 6 time.
• E(X) = (0x1.5%)+(1x 9.3%)+(2x23.4%)+(3 x31.2%)
(4x23.4%)+(5x9.3%)+(6x1.5%) =3
L7: Review: Permutations &
Combinations
• The number of distinguishable
permutations of the word TITLE.
• Number of 2-permutations of the word
HOGS.
• List the 2-combinations of the word
HOGS.
Machine Learning
Correct and Incorrect
Interpretations
Data and a Linear Model (see Lab1)
Moving the
line to get a
best fit
Changing
the slope of
the line to
get a best fit
R can calculate the maximum likelihood estimate of
the intercept and slope giving: y = 4.8 + (0.6 * x)
Two types of data Categorical and Continuous. The type of data will determine the
types Statistics and Graphs Two main types of statistical variable:
Categorical
Nominal: Mutually exclusive categories: male/female, dead/alive, smoker/nonsmoker, bus/car/train. Tends to be unordered or have no logical hierarchy
Ordinal: Can be ranked in a meaningful order. Distance between values is not
relevant as there is no distance information: race positions (1st, 2nd, 3rd), grouped
amounts (1-5, 6-10, 11-15 per day). Unlike nominal data, ordinal data can be
compared against each other
Continuous
Interval: Meaningful distance information. Intervals are equidistant e.g. Fahrenheit
scale, Celsius scale. Addition or subtraction allowed, but not multiplication or
division.
Ratio: Similar to interval data but has a true zero point: height, weight, speed, time,
Kelvin scale. Multiplication and division are allowed
There is a hierarchy of data “quality”. Ratio is the highest level of data, nominal is
the lowest.
Measurements, Observations,
Variables, Values
Measurement
- How we get our data
Observations
- Person or thing measured (rows)
Statistical Variables
- Characteristic being measured (columns)
Values
- Realised measurements / datum
ID
Gender
Height (cm)
1
2
168.7
2
1
172.0
3
1
176.5
4
1
160.5
5
2
174.0
6
1
168.6
7
2
160.0
8
2
163.0
9
1
175.0
10
2
161.4
Descriptive Statistics
• A good statistical model should…
- be simpler than the original data
- make the most of the data
- communicate accurately without distortion
• Mean is a measure of central tendency
• Median is the central value when values are sorted.
• Standard Deviation is a measure of dispersion.
• When the distribution of values is skewed, the mean can
be an unreliable measure of central tendency, and the
median becomes the preferred reporting method.
Descriptive Statistics
• The mean is sensitive to sample size.
Descriptive Statistics
frequency
frequency
frequency
Values
or normalized values
Descriptive Statistics
distribution
distribution
distribution
Values
or normalized values
Normal Distribution in R
Normal Distribution in R
• The height of one hundred people was
measured in centimetres, with
mean = 170, sd=8.
• We can program this in R:
• ht <- seq(150,190,0.1)
• #Note type is “l” for line
plot(ht,dnorm(ht,170,8), type="l",ylab="Probability density",xlab="height")
Normal Distribution in R
• > plot(ht,pnorm(ht,170,8), type="l",ylab="
Cumulative Distribution Function
",xlab="height")
• > plot(ht,dnorm(ht,170,8),
type="l",ylab="Probability
density",xlab="height")
Z
• What is the probability that a randomly
selected individual will be:
– Taller than a particular height
– Shorter that a particular height
– Between two heights
• We answer these questions using R
pnorm function. We first convert a height
to a z value, where : z = (y - y)
s
Z
Standard Normal Distribution
• Find the probability that someone is less
than 160cm
Z= (160-170) = -1.25, pnorm(-1.25)=0.1
8
• Find the probability that someone is
greater than 185cm
Z =(185-170) = 1.875, 1-pnorm(1.875)=0.03
8
T-Test
• The t-test assesses whether the means of two groups are
statistically different from each other.
• If there is a less than 5% chance (p-value<0.05) of getting the
observed differences by chance, we reject the null hypothesis and
say we found a statistically significant difference between the two
groups.
T-Test
Correlation
Correlation
The correlation coefficient is equal to the slope of the
regression line when both the X and Y variables have been
converted to z-scores. Where z is the standardized score:
Confidence Intervals
• A value higher and lower than the mean
• Are used to infer the mean results from a
sample to a wider population
• Results show that if a study was
conducted 100 times, 95 of the times the
mean would fall within the upper and lower
range
• Confidence intervals are wider if the
sample is small and if the data is varied.
Confidence Intervals
• A survey was conducted on rate of work-related
stress in a 12 month period (per100,000
employed).
• The mean was 780 / 100,000 employed.
• The confidence limits are 700 to 860 people
• This shows that 95% of the time the mean
number of people that self-reported work-related
stress in the 12 months would fall between these
values
Confidence Intervals
simpleR : Using R for Introductory
Statistics, by John Verzani
•
•
•
•
•
•
•
•
•
•
•
•
Univariate Data
Bivariate Data
Linear regression
Random
Data Simulations
Exploratory Data Analysis.
Confidence Interval Estimation
Hypothesis Testing
Two-sample tests
Regression Analysis
Multiple Linear Regression
Analysis of Variance
Correct and Incorrect
Interpretations
Data and a Linear Model (see Lab1)
Moving the
line to get a
best fit
Changing
the slope of
the line to
get a best fit
R can calculate the maximum likelihood estimate of
the intercept and slope giving: y = 4.8 + (0.6 * x)