Download Techniques of Data Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Spatial analysis wikipedia , lookup

Foundations of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Transcript
Spatial Statistics (SGG 2413)
Descriptive Statistics
Assoc. Prof. Dr. Abdul Hamid b. Hj. Mar Iman
Director
Centre for Real Estate Studies
Faculty of Engineering and Geoinformation Science
Universiti Tekbnologi Malaysia
Skudai, Johor
Spatial Statistics: Topic 3
1
Learning Objectives
 Overall: To give students a basic understanding of
descriptive statistics
 Specific: Students will be able to:
* understand the basic concept of descriptive
statistics
* understand the concept of distribution
* can calculate measures of central tendency
dispersion
* can calculate measures of kurtosis and skewness
Spatial Statistics: Topic 3
2
Contents
What is descriptive statistics
Central tendency, dispersion, kurtosis,
skewness
Distribution
Spatial Statistics: Topic 3
3
Descriptive Statistics
 Use sample information to explain/make
abstraction of population “phenomena”.
 Common “phenomena”:
* Association (e.g. σ1,2.3 = 0.75)
* Tendency (left-skew, right-skew)
* Trend, pattern, location, dispersion, range
* Causal relationship (e.g. if X then Y)
 Emphasis on meaningful characterisation of data
(e.g. central tendency, variability), graphics, and
description
 Use non-parametric analysis (e.g. 2, t-test, 2-way
anova)
Spatial Statistics: Topic 3
4
E.g. of Abstraction of phenomena
350,000
300,000
No. of houses
200000
150000
100000
50000
200,000
1991
150,000
2000
100,000
50,000
1
2
3
4
5
6
7
8
32635.8
38100.6
42468.1
47684.7
48408.2
61433.6
77255.7
97810.1
Demand f or shop shouses (unit s)
71719
73892
85843
95916
101107
117857
134864
86323
Supply of shop houses (unit s)
85534
85821
90366
101508
111952
125334
143530
154179
0
Ba
tu
J o Pa
ho h a
rB t
ah
r
Kl u
Ko ua
ta ng
Ti
n
M ggi
er
si
ng
M
u
Po ar
n
Se tian
ga
m
at
0
Loan t o propert y sect or (RM
250,000
million)
Year (1990 - 1997)
District
Trends in property loan, shop house dem and & supply
200
14
180
10
160
40
-4
4
30
-3
4
20
-2
4
10
-1
4
0
120
100
Age Category (Years Old)
70
-7
4
2
140
60
-6
4
4
-5
4
6
50 (RM/sq.ft. built area)
Price
8
04
Proportion (%)
12
80
20
40
60
80
100
120
Demand (% sales success)
Spatial Statistics: Topic 3
5
Inferential Statistics
Using sample statistics to infer some
“phenomena” of population parameters
Common “phenomena”: cause-and-effect
Y = f(X)
* One-way r/ship
Y1 = f(Y2, X, e1)
* Feedback r/ship
Y2 = f(Y1, Z, e2)
* Recursive
Y = f(X, e )
1
1
Y2 = f(Y1, Z, e2)
Use parametric analysis (e.g. α and )
through regression analysis
Emphasis on hypothesis testing
Spatial Statistics: Topic 3
6
Parametric statistics
Statistical analysis that attempts to explain
the population parameter using a sample
E.g. of statistical parameters: mean,
variance, std. dev., R2, t-value, F-ratio, xy,
etc.
It assumes that the distributions of the
variables being assessed belong to known
parameterised families of probability
distributions
Spatial Statistics: Topic 3
7
Examples of parametric relationship
Dep=9t – 215.8
Dep=7t – 192.6
Coefficientsa
Model
1
(Cons tant)
Tanah
Bangunan
Ans ilari
Umur
Flo_go
Uns tandardized
Coefficients
B
Std. Error
1993.108
239.632
-4.472
1.199
6.938
.619
4.393
1.807
-27.893
6.108
34.895
89.440
Spatial
Statistics: Topic
3
a. Dependent Variable: Nilaism
Standardized
Coefficients
Beta
-.190
.705
.139
-.241
.020
t
8.317
-3.728
11.209
2.431
-4.567
.390
Sig.
.000
.000
.000
.017
.000
.6978
Non-parametric statistics
 First used by Wolfowitz (1942)
 Statistical analysis that attempts to explain the
population parameter using a sample without
making assumption about the frequency
distribution of the assessed variable
 In other words, the variable being assessed is
distribution-free
 E.g. of non-parametric statistics: histogram,
stochastic kernel, non-parametric regression
Spatial Statistics: Topic 3
9
Descriptive & Inferential Statistics (DS & IS)
 DS gather information about a population
characteristic (e.g. income) and describe it with
a parameter of interest (e.g. mean)
 IS uses the parameter to test a hypothesis
pertaining to that characteristic. E.g.
Ho: mean income = RM 4,000
H1: mean income < RM 4,000)
 The result for hypothesis testing is used to make
inference about the characteristic of interest
(e.g. Malaysian  upper middle income)
Spatial Statistics: Topic 3
10
Sample Statistics: Central Tendency
Measure
Mean
(Sum of
all values
÷
no. of
values)
Median
(middle
value)
Mode
(most
frequent
value)
Advantages
Disadvantages
 Best known average
 Exactly calculable
 Make use of all data
 Useful for statistical analysis
 Affected by extreme values
 Can be absurd for discrete data
(e.g. Family size = 4.5 person)
 Cannot be obtained graphically
 Not influenced by extreme
values
 Obtainable even if data
distribution unknown (e.g.
group/aggregate data)
 Unaffected by irregular class
width
 Unaffected by open-ended class
 Needs interpolation for group/
aggregate data (cumulative
frequency curve)
 May not be characteristic of group
when: (1) items are only few; (2)
distribution irregular
 Very limited statistical use
 Unaffected by extreme values
 Cannot be determined exactly in
 Easy to obtain from histogram
group data
 Determinable from only values
 Very limited statistical use
Spatial Statistics: Topic 3
11
near the modal class
Central Tendency – Mean
 For individual observations,
. E.g.
X = {3,5,7,7,8,8,8,9,9,10,10,12}
= 96 ; n = 12
 Thus,
= 96/12 = 8
 The above observations can be organised into a frequency
table and mean calculated on the basis of frequencies
x
3
5
7
8
9
f
1
1
2
3
2
fx
3
5
Thus,
10 12
2
1
14 24 18 20 12
= 96;
= 12
= 96/12 = 8
Spatial Statistics: Topic 3
12
Central Tendency - Mean and Mid-point
 Let say we have data like this:
Price (RM ‘000/unit) of Shop Houses in Skudai
Location
Min
Max
Town A
228
450
Town B
320
430
Can you calculate the mean?
Spatial Statistics: Topic 3
13
Central Tendency - Mean and Mid-point
(contd.)
Let’s calculate: M = ½(Min + Max)
Town A: (228+450)/2 = 339
Town B: (320+430)/2 = 375
Are these figures means?
Spatial Statistics: Topic 3
14
Central Tendency - Mean and Mid-point
(contd.)
 Let’s say we have price data as follows:
Town A: 228, 295, 310, 420, 450
Town B: 320, 295, 310, 400, 430
 Calculate the means?
Town A:
Town B:
 Are the results same as previously?
  Be careful about mean and “mid-point”!
Spatial Statistics: Topic 3
15
Central Tendency – Mean of Grouped Data
 House rental or prices in the PMR are frequently
tabulated as a range of values. E.g.
Rental (RM/month)
135-140
140-145
145-150
150-155
155-160
Mid-point value (x)
137.5
142.5
147.5
152.5
157.5
5
9
6
2
1
687.5
1282.5
885.0
305.0
157.5
Number of Taman (f)
fx
 What is the mean rental across the areas?
= 23;
= 3317.5
Thus,
= 3317.5/23 = 144.24
Spatial Statistics: Topic 3
16
Central Tendency – Median
 Let say house rentals in a particular town are tabulated:
Rental (RM/month)
Number of Taman (f)
Rental (RM/month)
Cumulative frequency
130-135
135-140 140-145
155-50
150-155
3
5
9
6
2
>135
> 140
> 145
> 150
> 155
3
8
17
23
25
 Calculation of “median” rental needs a graphical aids→
1. Median = (n+1)/2 = (25+1)/2 =13th.
Taman
2. (i.e. between 10 – 15 points on the
vertical axis of ogive).
3. Corresponds to RM 140145/month on the horizontal axis
4. There are (17-8) = 9 Taman in the
range of RM 140-145/month
5. Taman 13th. is 5th. out of the 9
Taman
6. The rental interval width is 5
7. Therefore, the median rental can
be calculated as:
140 + (5/9 x 5) = RM 142.8
Spatial Statistics: Topic 3
17
Central Tendency – Median (contd.)
Spatial Statistics: Topic 3
18
Central Tendency – Quartiles (contd.)
Following the same process as
in calculating “median”:
Upper quartile = ¾(n+1) = 19.5th.
Taman
UQ = 145 + (3/7 x 5) = RM
147.1/month
Lower quartile = (n+1)/4 = 26/4 =
6.5 th. Taman
LQ = 135 + (3.5/5 x 5) =
RM138.5/month
Inter-quartile = UQ – LQ = 147.1
– 138.5 = 8.6th. Taman
IQ = 138.5 + (4/5 x 5) = RM
142.5/month
Spatial Statistics: Topic 3
19
Variability
 Indicates dispersion, spread, variation, deviation
 For single population or sample data:
where σ2 and s2 = population and sample variance respectively, xi =
individual observations, μ = population mean, = sample mean, and n
= total number of individual observations.
 The square roots are:
standard deviation
standard deviation
Spatial Statistics: Topic 3
20
Variability (contd.)
 Why “measure of dispersion” important?
 Consider yields of two plant species:
* Plant A (ton) = {1.8, 1.9, 2.0, 2.1, 3.6}
* Plant B (ton) = {1.0, 1.5, 2.0, 3.0, 3.9}
Mean A = mean B = 2.28%
But, different variability!
Var(A) = 0.557, Var(B) = 1.367
* Would you choose to grow plant A or B?
Spatial Statistics: Topic 3
21
Variability (contd.)
 Coefficient of variation – CV – std. deviation as % of
the mean:
 A better measure compared to std. dev. in case
where samples have different means. E.g.
* Plant X (ton/ha) = {1.2, 1.4, 2.6, 2.7, 3.9}
* Plant Y (ton/ha) = {1.4, 1.5, 2.1, 3.2, 3.9}
Spatial Statistics: Topic 3
22
Variability (cont.)
Yield
(ton/ha)
Farm
No. Species Species
X
Y
1
1.2
1.4
2
1.4
1.5
3
2.6
2.1
4
2.7
3.2
5
3.9
3.9
Mean
2.36
2.42
Var.
1.20
1.20
Calculate CV for both
species.
CVx = (1.2/2.36) x 100
= 50.97%
CVy = (1.2/2.42) x 100
= 49.46%
 Species X is a little more
variable than species Y
Spatial Statistics: Topic 3
23
Variability (cont.)
 Std. dev. of a frequency distribution
E.g. age distribution of second-home buyers (SHB):
Spatial Statistics: Topic 3
24
Probability distribution
 Logical probability: If there 20 lecturers, the probability that
A becomes a professor is: p = 1/20 = 0.05
 Experiential probability: Out of 100 births, half of them were
girls (p=0.5), as the number increased to 1,000, two-third
were girls (p=0.67) but from a record of 10,000 new-born
babies, three-quarter were girls (p=0.75)
 Subjective probability: The probability of a drug addict
recovering from addiction is 50:50
 General rule:
No. of times event X occurs
Pr (event X) = ------------------------------------Total number of occurrences
 Probability of certain event X to occur has a specific form of
distribution
Spatial Statistics: Topic 3
25
Probability Distribution
Classical example of
Dice1
tossing
1
2
3
4
5
6
1
2
3
4
2
3
4
5
3
4
5
6
4
5
6
7
5
6
7
8
6
7
8
9
7
8
9
10
5
6
7
8
9
10
11
6
7
8
9
10
11
12
Dice2
What is the distribution of the sum of tosses?
Spatial Statistics: Topic 3
26
Probability Distribution (contd.)
Discrete variable
Values of x are discrete (discontinuous)
Sum of lengths of vertical bars p(X=x) = 1
all x
Spatial Statistics: Topic 3
27
Probability Distribution (cont.)
Continuous variable
Age
Freq
Prob.
Mean = 39.5
36
3
0.02
Std. dev = 2.45
37
14
0.07
38
10
0.04
39
36
0.18
40
73
0.36
41
27
0.14
42
20
0.10
43
17
0.09
Total
200
1.00
Pr (Area under
curve) =
1
Pr (Area under
curve)
=1
Age distribution of second-home buyers in
Spatial Statistics: Topic 3
probability histogram
28
Probability Distribution (cont.)








Pr (Age ≤ 36) = 0.02
Pr (Age ≤ 37) = Pr (Age ≤ 36) + Pr (Age = 37) = 0.02 + 0.07 = 0.09
Pr (Age ≤ 38) = Pr (Age ≤ 37) + Pr (Age = 38) = 0.09 + 0.04 = 0.13
Pr (Age ≤ 39) = Pr (Age ≤ 38) + Pr (Age = 39) = 0.13 + 0.18 = 0.31
Pr (Age ≤ 40) = Pr (Age ≤ 39) + Pr (Age = 40) = 0.31 + 0.36 = 0.67
Pr (Age ≤ 41) = Pr (Age ≤ 40) + Pr (Age = 41) = 0.67 + 0.14 = 0.81
Pr (Age ≤ 42) = Pr (Age ≤ 41) + Pr (Age = 42) = 0.81 + 0.10 = 0.91
Pr (Age ≤ 43) = Pr (Age ≤ 42) + Pr (Age = 43) = 0.91 + 0.09 = 1.00
Cumulative probability corresponds to the
left tail of a distribution
Spatial Statistics: Topic 3
29
Probability Distribution
(cont.)
 As larger and larger
samples are drawn, the
probability distribution is
getting smoother
 Tens of different types of
probability distribution: Z,
t, F, gamma, etc
 Most important: normal
distribution
Spatial Statistics: Topic 3
Larger sample
Very large
sample
30
Normal Distribution - ND
 Salient features of ND:
* Bell-shaped, symmetrical
* Total area under curve = 1
* Area under curve between
any two points = prob. of
values in that range (shaded area)
* Prob. of any exact value = 0
* Has a function of:
μ = mean of variable x; σ = std. dev. of x;
π = ratio of circumference of a circle to its
diameter = 3.14; e = base of natural log =
Spatial Statistics: Topic 3
31
2.71828.
Normal Distribution - ND
Population 2
Population 1
2
1
1
2
*  determines location
while  determines
* A larger population has
narrower base (smaller
Spatial Statistics: Topic 3
shape of ND
variance)
32
Normal Distribution (cont.)
* Has a mean  and a variance 2, i.e. X  N(, 2 )
* Has the following distribution of observation:
“Home-buyers example…”
Mean age = 39.3
Std. dev = 2.42
Spatial Statistics: Topic 3
33
Standard Normal Distribution (SND)
 Since different populations have different  and 
(thus, locations and shapes of distribution), they have
to be standardised.
 Most common standardisation: standard normal
distribution (SND) or called Z-distribution
 (X=x) is given by area under curve
 Has no standard algebraic method of integration
→ Z ~ N(0,1)
 To transform f(x) into f(z):
x-µ
Z = ------- ~ N(0, 1)
σ
Spatial Statistics: Topic 3
34
Z-Distribution
 Probability is such a way that:
* Approx. 68% -1< z <1
* Approx. 95% -1.96 < z < 1.96
* Approx. 99% -2.58 < z < 2.58
Spatial Statistics: Topic 3
35
Z-distribution (cont.)
 When X= μ, Z = 0, i.e.
 When X = μ + σ, Z = 1
 When X = μ + 2σ, Z = 2
 When X = μ + 3σ, Z = 3 and so on.
 It can be proven that P(X1 <X< Xk) = P(Z1 <Z< Zk)
 SND shows the probability to the right of any
particular value of Z.
Spatial Statistics: Topic 3
36
Normal distribution…Questions
A study found that the mean age, A of second-home buyers in Johor Bahru
is 39.3 years old with a variance of RM 2.45.Assuming normality, how sure
are you that the mean age is: (a) ≥ 40 years old; (b) 39 to 42 years old?
Answer (a): P(A ≥ 40)
= P[Z ≥ (40 – 39.3)/2.4]
= P(Z ≥ 0.2917 0.3000)
= 0.3821
(b) P(39 ≤ A ≤ 42)
= P(A ≥ 39) – P(A ≥ 42)
= 0.45224 – P[A ≥ (42-39.3)/2.4]
= 0.45224 – P(A ≥ 1.125)
= 0.45224 – 0.12924
= 0.3230
Use Z-table!
Spatial Statistics: Topic 3
Always remember: to convert to SND, subtract the mean and divide by the std. dev.
37
“Student’s t-Distribution”
 Similar to Z-distribution (bell-shaped, symmetrical)
 Has a function of
where  = gamma distribution; v = n-1 = d.o.f;  = 3.147
 Flatter with thicker tails
 Distributed with t(0,σ) and -∞ < t < +∞
 As n→∞ t(0,σ) → N(0,1)
 Probability calculation requires
information on d.o.f.
Spatial Statistics: Topic 3
38
How Are t-dist. and Z-dist. Related?
 Using central limit theorem, N(, 2/n) will become
zN(0, 1) as n→∞
 For a large sample, t-dist. of a variable or a
parameter is given by:
The interval of critical values for variable, x is:
Spatial Statistics: Topic 3
39
Skewness, m3 & Kurtosis, m4
 Skewness, m3 measures
degree of symmetry of
distribution
 Kurtosis, m4 measures its
degree of peakness
 Both are useful when
comparing sample
distributions with different
shapes
 Useful in data analysis
Xi = indivudal sample
observation, =
sample mean;  = std.
deviation; n = sample size
Spatial Statistics: Topic 3
40
Skewness


Right (+ve) skew
Left (-ve) skew


Bimodal
Uniform
Spatial Statistics: Topic 3

Perfectly normal (zero skew)

J-shaped
41
Kurtosis
Leptokurtic
Mesokurtic
Platykurtic
(high peak)
(normal)
(low peak)
(+ve kurtosis)
(zero kurtosis)
(-ve kurtosis)
Mesokurtic distribution…kurtosis = 3
Leptokurtic distribution…kurtosis < 3
Platykurtoc distribution…kurtosis > 3
Spatial Statistics: Topic 3
42
Occurrence of ganoderma
X-coord.
(000)
Y-coord.
(000)
535.60
104.80
536.70
Trees with
Ganoderma
X-coord.
(000)
Y-coord.
(000)
Trees with
ganoderma
8
547.75
106.08
5
107.30
12
547.10
105.25
8
536.80
106.80
11
547.80
101.05
7
537.30
107.31
12
548.18
105.92
8
537.15
105.40
13
548.80
105.90
12
537.40
105.37
13
548.95
104.85
15
538.48
107.82
9
548.94
104.50
13
542.22
106.10
8
548.75
103.73
7
540.35
105.91
7
540.10
104.95
7
540.30
104.75
6
538.75
102.80
5
545.10
105.90
4
546.30
105.90
3
547.15
105.90
2
548.94
102.80
Occurrence
of ganoderma 4
Spatial Statistics: Topic 3
43
Aluminium residues in the soil
Al p.p.m.
Freq.
0
0
250
7
500
E.g. Al2++ + H2++O-- → Al2O + H2
sum
102.00
13
mean
1073.53
750
25

1000
18
1250
13
1500
9
1750
7
2000
3
2250
4
skew
2500
3
kurtosis
553.05
2
305867.94
3
169161266
.28
4
935551939
11.64
Spatial Statistics: Topic 3
0.77
13.44
44
Measures of spatial separation
Weighted mean centre (Xcoord.) =
Weighted mean centre (Ycoord.) =
Distance
(x1,y1) and (x2,y2) =
E.g. WCM = ((545.10-542.86)2 + (105.90-105.48)2)0.5
= (5.0176 + 0.1764)0.5
= 2.28 (i.e. 2,280 m)
Standard distance =
Spatial Statistics: Topic 3
45
Spatial distribution – Occurrence of ganoderma
f = 191.00
Sum
Weighted mean centre
Standard distance
Xw = 103687.00
Yw = 20147.40
542.86
105.48
(Xw- )2 =588.46
(Yw-
)2 = 55.50
1.84
Point to point distance (e.g.)
x-dist.
5.00
y-dist.
0.17
Distance Wc-M
2.27
Spatial Statistics: Topic 3
46
Spatial distribution – point data
Ethnic distribution of residence
Spatial Statistics: Topic 3
47
Ethnic distribution of residence
x
f
fx
(x- )2
0
81
0
-0.49
1
50
50
0.51
2
9
18
1.51
140
68
1.54

0.49
2
0.01
CV
0.02
CV
0.12
tc
-8.15
Ho: 2 =
Reject
Ho…residence
pattern is scattered
(pattern is random)
H1: 2 > (pattern is clustered) or 2 <
(pattern is scattered)
X = no. of observations per quadrat; f = frequency of
quadrats; = (fx)/f; 2 = (x- )2/(fx) -1; CV = 2/ ;
CV = (2/(k-1))½.
Spatial Statistics: Topic 3
k = (fx) -1
Test
statistics
48