Download File

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Confidence interval wikipedia , lookup

Taylor's law wikipedia , lookup

History of statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Lecture-2
Some Basic Statistical Methods
Engr. Dr. Attaullah Shah
Distributions for Sample Data

Data distributions come in two basic varieties.

When the values that can be observed are
anything within some range, then the
distribution is said to be continuous.

If only certain particular values can be
observed, then the distribution is said to be
discrete.
Example of continuous data
Different levels of measurement: (1) nominal, (2) ordinal, (3)
interval or ratio scale
1
2
3
0
0
1
0.5
1.0
10
1.5 2.0
100
2.5
3.0 3.5
1000
4.0
Measurements of Location
mean
0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0 mm
Mean = Sum of values/n = Xi/n
e.g. length of 8 fish larvae at day 3 after hatching:
0.6, 0.7, 1.2, 1.5, 1.7, 2.0, 2.2, 2.5 mm
mean length = (0.6+0.7+1.2+1.5+1.7+2.0+2.2+2.5)/8
= 1.55 mm
mean
median
0
0.5
1.0
1.5 2.0
2.5
3.0 3.5
Median,


Order = n/2 for n is an odd number
Order = (n+1)/2 for n is an even number
4.0 mm
mean
median
0
0.5
1.0
1.5 2.0
2.5
3.0 3.5
4.0 mm
2.5
3.0
4.0 mm
mean
median
0



0.5
1.0
1.5
2.0
3.5
Median is often used with mean
Mean is used much more frequent, however,
Median is a better measure of central tendency for data
with skewed distribution or outliers
Measurements of dispersion
Range
e.g. length of 8 fish larvae at day 3 after hatching:
0.6, 0.7, 1.2, 1.5, 1.7, 2.0, 2.2, 2.5 mm
Range = 2.5 - 0.6 = 1.9 mm (or say from 0.6 to
2.5mm)
Population Standard Deviation ()

Averaged measurement of deviation from mean
xi - x



e.g. five rainfall measurements, whose mean is 7
Rainfall (mm)
xi - x
(xi - x)2
12
12 - 7 = 5
0
2
5
16
0 - 7 = -7
2 - 7 = -5
5 - 7 = -2
16 - 7 = 9
Sum = 184
25
49
25
4
81
Sum = 184
Population variance: 2 =  (xi - x)2/n = 184/5 = 36.8
Population SD:  = (xi - x)2/n = 6.1
Sample SD (s)
s = [(xi - x)2]/ (n - 1)
s = [xi2 - (xi)2 /n]/ (n - 1)

Two modifications:


by dividing [(xi - x)2] by (n -1) rather than n, gives
a better unbiased estimate of  (however, when n
increases, difference between s and  declines
rapidly)
the sum of squared deviations can be calculated as
 (xi2)- ( xi)2/ n
Sample SD (s)

e.g. five rainfall measurements, whose mean is 7
Rainfall (mm)
12
0
2
5
16


xi2
144
0
4
25
256
(xi2) = 429
xi
12
0
2
5
16
xi = 35
(xi)2 = 1225
s2 = [xi2 - (xi)2 /n]/ (n - 1) = [429 - (1225/5)]/ (5 - 1)
= 46.0
s = (46.0) = 6.782
Frequency Distribution
e.g. The particle sizes (m) of 37 grains
from a sample of sediment from an estuary Define
convenient
classes (equal
8.2
6.3
6.8
6.4
8.1
6.3
width) and class
5.3
7.0
6.8
7.2
7.2
7.1
intervals e.g. 1
5.2
5.3
5.4
6.3
5.5
6.0
m
5.5
5.1
4.5
4.2
4.3
5.1
4.3
5.8
4.3
5.7
4.4
4.1
4.2
4.8
3.8
3.8
4.1
4.0
e.g. Frequency distribution for the size of particles collected
from the estuary
Particle size (m)
3.0 to under 4.0
4.0 to under 5.0
5.0 to under 6.0
6.0 to under 7.0
7.0 to under 8.0
8.0 to under 9.0
Frequency
2
12
10
7
4
2
Frequency Histogram
Frequency
15
10
5
0
3 to <4
4 to <5
5 to <6
6 to <7
7 to <8
8 to <9
Particle size (m)
e.g. Frequency distribution for the size of particles collected
from the estuary
e.g. Frequency distribution of height of the students in a
class (n = 52: 30 females & 22 males)
14
12
Frequency
10
8
6
4
2
0
>149-153 >153-157 >157-161 >161-165 >165-169 >169-173 >173-177 >177-181 >181-185
Height (cm)
Normal Distribution

There are many standard distributions for continuous data.
Here, only the normal distribution (also sometimes called the
Gaussian distribution) is considered.

This distribution is characterized as being bell-shaped, with
most values being near the center of the distribution. There are
two parameters to describe the distribution: the mean and the
standard deviation, which are often denoted by μ and σ,
respectively.

There is also a function to describe the distribution in terms of
these two parameters, which is referred to as a probability
density function (pdf).
Normal curve

f(x) = [1/(2)]exp[(x  )2/(22)]
10
9
8
f(x)
Frequency
7
6
5
4
3
2
1
145
150
155
160
165
Height (cm)
170
175
180
0
151
155
159
163
Height (cm)
167
171
Normal curve
f(x) = [1/(2)]exp[(x  )2/(22)]
In general, it turns out that, for all
normal distributions, about 67% of
values will be in the range μ ± σ,
about 95% will be in the range μ ±
2σ, and about 99.7% will be in the
range μ ± 3σ
Normal curve was first expressed on
paper (for astronomy) by A. de
Moivre in 1733.
0.10
0.09
0.08
Probability density
Parameters  and  determine
the position of the curve on the xaxis and its shape.
0.07
male
0.06
0.05
female
0.04
0.03
0.02
0.01
0.00
140
150
160
170
Height (cm)
180
190
f(x) = [1/(2)]exp[(x  )2/(22)]
0.50
Probability density
0.40
N(10,1)
N(20,1)
0.30
0.20
N(20,2)
N(10,3)
0.10
0.00
0
10
20
X
• Normal distribution N(,)
• Probability density function: the area under the
curve is equal to 1.
30
The standard normal curve



 = 0,  = 1 and with the total area under the curve = 1
units along x-axis are measured in  units
Figures: (a) for 1 , area = 0.6826 (68.26%); (b) for 2   95.44%; (c)
the shaded area = 100% - 95.44%
Application of the Standard Normal Distribution

For example:
We have a large data set (e.g. n = 200) of
normally distributed suspended solids
determinations for a particular site on a
river:
x = 18.3 ppm and s = 8.2 ppm.
We are asked to find the probability of a
random sample containing 30 ppm
suspended solids or more.
Application of the standard normal distribution





The standard deviation (or Z value): Z = (Xi - )/
Z = (30 - 18.3)/8.2 = 1.43
Check the Z Table you will obtain
the probability for the samples having 30 ppm = 0.0764 or 7.64%
i.e. for n = 200, more on about 15 occasions for having 30 ppm
Central Limit Theorem

As sample size (n) increases, the means of samples
(i.e. subsets or replicate groups) drawn from a
population of any distribution will approach the
normal distribution.

By taking mean of the means, we smooth out the
extreme values within the sets while keeping x  x.

As the number of subsets increases, the standard
deviation of the mean of the means will be reduced
and the frequency distribution is very close to the
normal distribution
Inferential statistics - testing the null hypothesis



Inferential = “that may be inferred.” Infer = conclude or
reach an opinion
The hypothesis under test, the null hypothesis, will be that
Z has been chosen at random from the population
represented by the curve.
Z values close to the mean ( = 0) are high, while
frequencies away from the mean decline
e.g. two values of Z are shown:
Z = 1.96 and Z = 2.58
From the Table, we have the
corresponding probability:
0.025 (2.5%) and 0.0049 (0.5%)
 Testing
of Hypothesis
Samples and Populations
How do we select?
Population
Sample
 of
population
Inference
Parameters
Statistics
X of sample
Null and Research Hypotheses

Hypothesis
 Educated guess
 Reflects the research problem being
investigated
 Determines the techniques for testing the
research questions
 Should be grounded in theory
Hypotheses Contd.
Research Question
Research Hypothesis
Test
In research we
NEVER
prove a hypothesis!
Purposes of the Null Hypothesis

Acts as a starting point
 State of affairs accepted as true in the
absence of any other information
 Until a systematic difference is shown,
assume that any difference observed is
due to chance
 Research job is to eliminate chance
factors and evaluate other factors that
may contribute to group differences
Null Hypothesis Purpose # 2

Provides a benchmark to measure actual
outcomes
 How likely is it that outcomes are due to
some other factor?
 Helps define range within which
observed differences can be reasonably
attributed to chance or something other
than chance
Null Hypotheses

Usually a statement of no differences or no
associations – an equality
 Sentence


There will be no difference in the
pollution level before and after the
construction activity.
Symbols
 Ho:  before const =  after-const.
 Ho:  before const –  after-const. = 0
Research/Alternative Hypotheses
A statement of a relationship between the
variables – an inequality.
 May be non-directional (two-tailed)
 May be directional (one-tailed) which is
more powerful in research results as it splits
the p – value in half

Non-directional Alternative Hyp.

Reflects a difference between groups but the
direction of the difference is not specified
 Non-directional Sentence

There is a difference in pollution level
before and after construction activity
Non-directional Symbols
 Ha:  before const.   After-Const
Directional Alternative Hyp.

Reflects a difference between groups, and
the direction of the difference is specified
 Directional Sentence


Pollution level after construction
will be higher than the pollution
level before construction
Directional Symbols
 Ha:  before Const <  After const
What Makes a Good Hypothesis?
A good hypothesis:
 is stated in declarative
form and not as a
question.
 posits an expected
relationship between
variables.
What Makes a Good Hypothesis?
A good hypothesis:
 reflects the theory or
literature on which it is
based.
 should be brief and to the
point.
 is testable, which means
that it can carry out the
intent of the question
reflected by the
hypothesis.
Six Steps of Hypothesis Testing
1.
2.
3.
4.
5.
6.
State the null hypothesis.
State the alternative hypothesis
Select a level of significance
Collect and summarize the sample data.
Refer to a criterion for evaluating the sample
evidence.
Make a decision to keep/reject the null.
1. State the Null Hypothesis
States that there is no relationship between
the variables.
 Refers to the population.

Examples of the Null Hypothesis
Written: There are no differences in the pre,
mid, and post construction pollution levels
due to new project
 Symbols: µpre = µmid = µpost

Step 2: State the Alternative
Hypothesis
Symbolically referred to as Ha
 States the opposite of the Ho

Examples of Alternative
Hypothesis
Written: There are differences within the
pre, mid, and post pollution levels.
 Symbols:  pre   mid   post for at least
one pair.

Step 3: Select a Level of
Significance




Most researchers select a small number such as
0.001, 0.01, or 0.05.
The most common choice is 0.05
Otherwise known as “alpha level”, p=0.05,
=0.05
The significance level serves as a scientific cutoff
point that determines what decision will me made
concerning the null hypothesis.
Type I and Type II Errors

1.
Mistakes can occur:
Type I Error – designates the mistake of
rejecting the Ho when the null is actually
false. When the level of significance is set
at 0.05, this means the chance of a Type I
error becomes equal to 1 out of 20.
Type II Errors

Designates a mistake
made if Ho is not
rejected when the null
is actually false.
Statistical Errors in Hypothesis Testing

Consider court judgments where the accused is presumed
innocent until proved guilty beyond reasonable doubt (I.e.
Ho = innocent)
If the accused is If the accused is
innocent
guilty
(Ho is true)
(Ho is false)
Court’s
decision:
Guilty
Wrong
judgement
OK
Court’s
decision:
Innocent
OK
Wrong
judgement
Statistical Errors in Hypothesis Testing

Similar to court judgments, in testing a null hypothesis in
statistics, we also suffer from the similar kind of errors:
If Ho is true
If Ho is false
If Ho is rejected
Type I error
No error
If Ho is accepted
No error
Type II error
Statistical Errors in Hypothesis Testing
e.g. Ho = responses of cancer patients to a new drug and
placebo are similar
• If Ho is indeed a true statement about a statistical
population, it will be concluded (erroneously) to be false
5% of time (in case  = 0.05).
• Rejection of Ho when it is in fact true is a Type I error
(also called an  error).
• If Ho is indeed false, our test may occasionally not detect
this fact, and we accept the Ho.
• Acceptance of Ho when it is in fact false is a Type II
error (also called a  error).
Step 4: Collection and Analysis
of Sample Data
The summary of the sample data will
always lead to a single numerical value
which is referred to as the calculated value.
( r, t, or f).
 The computer calculates the probability of
the above value in the form of p = ____.

Step 5: The Criterion for
Evaluating the Sample Evidence
Two Methods:
 Compare the calculated and critical
values.
 Compare the data-based p-value against a
preset point on the 0-1 scale on which the
p must fall. (Level of Significance)
Step 6: Make a Decision!

Reject the Null if the p-value is less than the
established level of significance.
•
a statistically significant difference was
obtained
p< 0.05
•
Fail to Reject the Null

•
•
•
•
Retain the Null if the
p-value is greater than
the established level of
significance.
H0 was tenable
The null was retained.
No significant
difference was found.
The result was not
statistically
significant.
Inferential statistics - testing the null hypothesis
As the curve is symmetrical about the
mean, p to obtain a value of Z < -1.96 is
also 2.5%;
so the total p of obtaining a value of Z
between -1.96 and +1.96 is 95%
Likewise, between Z = 2.58, the total p =
99%
Then we can state a null hypothesis that a
random observation of the population will
have a value between -1.96 and + 1.96.
Inferential statistics - testing the null hypothesis
Alternatively, we can state the null hypothesis
as that a random observation of Z will lie
outside the limit -1.96 or +1.96.
There are 2 possibilities:
Either we have chosen an ‘unlikely’ value
of Z, or our hypothesis is incorrect.
Conventionally, when performing a
significant test, we make the rule that if
Z values lies outside the range 1.96, then the null
hypothesis is rejected and the Z value is termed
significant at the 5% level or  = 0.05 (or p < 0.05) —
critical value of the statistics.
For Z =  2.58, the value is termed significant at the 1%
level.
Chi-square statistics



Widely used for the analysis of nominal scale data
Introduced by Karl Pearson during 1900
Its theory and application expanded by him and R.
A. Fisher
The 2 test:
2 =  (observed freq. - expected freq.)2/ expected freq.

Obtain a sample of nominal scale data and to infer if
the population from which it came conforms to a
certain theoretical distribution.

Used to test Ho that the observations (not the
variables) are independent of each other for the
population.

Based on the difference between the actual
observed frequencies (not %) and the expected
frequencies that would be obtained if the variables
were truly independent.
The 2 test:
2 =  (observed freq. - expected freq.)2/ expected freq.

Used as a measure of how far a sample distribution
deviates from a theoretical distribution

Ho: no difference between the observed and
expected frequency (HA: they are different)

If Ho is true then both the difference and chi-square
value will be SMALL

If Ho is false then both measurements will be Large,
HA will be accepted
Example

In a questionnaire, 259 adults were asked
what they thought about cutting air
pollution by increasing tax on vehicle fuel.
113 people agreed with this idea but the rest
disagreed. Perform a Chi-square text to
determine the probability of the results
being obtained by chance.
Agree
Observed
113
Expected 259/2 = 129.5
Disagree
259 -113 = 146
259/2 = 129.5
Ho: Observed = Expected
2 = (113 - 129.5)2/129.5 + (146 - 129.5)2 /129.5
= 2.102 + 2.102 = 4.204
df = k - 1 = 2 - 1 = 1
From the Chi-square Table
Critical 2 ( = 0.05, df = 1)= 3.841 << calculated 2
= 4.202, 0.025<p<0.05
Therefore, rejected Ho. The probability of the results
being obtained by chance is between 0.025 and 0.05.
Confidence Interval:

Confidence limits for a parameter of a distribution give a
range within which the parameter is expected to lie. For
example, a 90% confidence limit for a distribution mean
defines a range, which is called a confidence interval,
within which the mean is expected to lie 90% of the time,
in the sense that if many such intervals are calculated, then
about 90% of them will contain the true value of the
parameter.
When to use z and When to use t
z and t distributions are used in confidence intervals.
_
These are determined by the distribution of X.
USE
z
(with σ X  σ/ n ) when :
• Large n or sampling from a normal distribution
• σ is known
t
USE
(with
s X  s/ n ) when :
• Large n or sampling from a normal distribution
• σ is unknown
General Form of confidence Intervals
The general form of a confidence interval is:
(Point Estimate) ± (Margin of Error)
or
(Point Estimate) ± (zα/2 or tα/2) (Appropriate Standard Error)
Example
The average cost of all required test for water
analysis has gone up due to several tests required.
A sample of 41 sources was taken
 The average cost of tests for these 41tezts is $86.15

Construct a 95% confidence interval for the
average costs of test for these tests assuming:
1. The standard deviation is $22.
2. The standard deviation is unknown, but the
sample standard deviation of the sample is
$24.77.
Case 1


Because the sample size > 30, it is not necessary to assume that
the costs follow a normal distribution to construct a confidence
interval.
And because it is assumed that σ is known (to be $22), this will
be a z-interval.
x  z α/2
σ
n
86.15  1.96
22
41
$86.15 ± $6.73
($79.42$92.92)
Case 2


Because the sample size > 30, it is not necessary to assume that
the costs follow a normal distribution to construct a confidence
interval.
Because it is assumed that σ is unknown, this will be a t-interval
with 40 degrees of freedom and s = 24.77.
x  t α/2,40
86.15  2.021
s
n
24.77
41
$86.15 ± $7.82
($78.33$93.97)
How does one variable respond
to changes in another variable?




Lichen is sensitive to SO2
e.g. Growth of lichen vs. Air pollution
Growth determined by max length
Pollution indicated by the distance from a town
center (0-10 km)
30.00
Max length (mm)
25.00
20.00
15.00
10.00
5.00
0.00
0.00
2.00
4.00
6.00
8.00
10.00
Distance from the town center (km)
Evernia prunastri
• Decreasing thallus size as the town center is approached
• A gap in the data between 4 and 6 km
• Any outliner(s)?
• A statistical technique
termed CORRELATION
enables us to quantify the
relationship between two
variables
30.00
Max length (mm)
25.00
20.00
15.00
• Calculation of a
correlation coefficient r
(range from –1 to +1)
10.00
5.00
0.00
0.00
2.00
4.00
6.00
8.00
Distance from the town center (km)
10.00
• r  +1 : +ve correlation
• r  0 : no correlation
• r  –1 : -ve correlation
A
X2
B
X2
X1
X1
C
X2
D
X2
X1
X1
Significant negative correlation between number of trees
and number of sick people within individual regions
(r = -0.981, p < 0.001).
Is this conclusion right???
Regression Analysis

A simple mathematical expression to
provide an estimate of one variable from
another

It is possible to predict the likely outcome
of events given sufficient quantitative
knowledge of the processes involved
Regression

Model 1

y
x
“Controlled” parameter (Independent variable)
vs. Measured parameter (dependent variable)

Independent variable (on x-axis) must be
measured with a high degree of accuracy & is
not subjected to random variation

Other inferential factors must be kept constant

Dependent variable (on y-axis) may vary
randomly and its ‘error’ should follow a normal
distribution
Normally distributed populations of y values
Y
X
• The population of y values is normally distributed &
• The variances of different population of y values
corresponding to different individual x values are similar
Regression
x2

Model 2
x1

Both measured parameters (x1 & x2 not x & y) which
cannot be controlled

Both are subject to random variation & called randomeffects factors

Common in field studies where conditions are difficult
to control

Correlation rather than regression, required for
bivariately normal distributions

e.g. measurements of human arm and leg lengths
Example for model 1


Study the rate of disappearance of a pesticide in a
seawater sample

Time (independent) vs. Concentration (dependent)

Other factors such as pH, salinity must be kept
constant
Study the growth rate of fish at different fixed water
temperature

Temp (independent) vs. Growth rate (dependent)

Other factors such as diet, feeding frequency must
be kept constant
Y
x, y
c
d
a
X
Model: y = a + bx
Slope: coefficient b = c/d
Intercept: coefficient a
b= -ve
b= +ve
Y
b= 0
X
3
4
5
6
8
9
10
11
12
14
15
16
17
1.4
1.5
2.2
2.4
3.1
3.2
3.2
3.9
4.1
4.7
4.5
5.2
5.0
Wing lengths of 13 sparrows of various age
6.0
y = 0.2702x + 0.7131
R2 = 0.9733
5.0
Wing length (cm )
Age (days) Wing length (cm)
X
Y
4.0
3.0
2.0
1.0
0.0
0
5
10
Age (days)
15
20
6.0
y = 0.2695x + 0.7284
R2 = 0.9705
Wing length (cm )
5.0
The concept of
least squares
Sum of di2
indicates the
deviations of the
points from the
regression line
4.0
d
3.0
2.0
1.0
0.0
0
5
10
Age (days)
15
20
Best fit line is
achieved with
minimum sum of
square
deviations (di2)
Age (days) Wing length (cm)
X
Y
3
4
5
6
8
9
10
11
12
14
15
16
17
n
mean
sum
sum X^2
13
10.0
130.0
1562.0
1.4
1.5
2.2
2.4
3.1
3.2
3.2
3.9
4.1
4.7
4.5
5.2
5.0
XY
4.2
6.0
11.0
14.4
24.8
28.8
32.0
42.9
49.2
65.8
67.5
83.2
85.0
13.0
3.4
44.4 514.8
171.3
Calculation for a regression



y = a + bx
b = [xy – (xy/n)]/ [x2 –
(x)2/n]
a = y – bx
b = [514.8-(130)(44.4)/13]/[1562 – (130)2/13]
b = 0.720 cm/day
a = 3.4 – (0.720)(10) = 0.715 cm
The simple linear regression equation is
Y = 0.715 + 0.270X
Normally distributed populations of y values
Y
X
• The population of y values is normally distributed &
• The variances of different population of y values
corresponding to different individual x values are similar
Residual = y – y^ = y – (a + bx)
+
+
0
0
-
-
+
+
0
0
-
-