Download DESCRIPTIVE STATISTICS: ONE VARIABLE, ONE SAMPLE

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
DESCRIPTIVE STATISTICS: ONE VARIABLE, ONE SAMPLE
Frequency Distributions and Histograms
• A way to tabulate or graphically represent a collection of variable
data from a sample by lumping together into more manageable
units (classes, class intervals, or bins).
• The distribution is represented by counts (= frequency) or
proportions (= relative frequency) of observations within different
class intervals.
• The classes are often referred to as bins into which we lump
observations (data) in order to reduce the amount of data to
make it easier to analyze or visualize.
• Should be careful in choosing both the width and number of class
intervals -- all class interval widths must be the same and one
should have a reasonable number of intervals.
e.g. for the following data where n = 40
22223333333344444444455555555666
66667777
Class Interval
Frequency
1.5-2.5
2.5-3.5
3.5-4.5
4.5-5.5
5.5-6.5
6.5-7.5
4
8
9
8
7
4
Relative Freq.
4/40
8/40
9/40
8/40
7/40
4/40
2.1
=
=
=
=
=
=
0.100
0.200
0.225
0.200
0.175
0.100
Cumulative Rel.
Freq.
0.100
0.300
0.525
0.725
0.900
1.000
10
0.25
5
0.125
4
8
9
8
7
4
0
0
1.5
2.5
3.5
4.5
5.5
6.5
1.5
7.5
2.5
3.5
4.5
5.5
6.5
• sometimes use cumulative relative frequency
1.0
0.5
0
1.5
2.5
3.5
4.5
5.5
6.5
7.5
• commonly represent frequency histograms by smooth curve
(sometimes called frequency density)
Frequency
10
5
4
0
8
9
8
7
4
.15 .22 .35 .45 .55 .65 .75
2.2
.15 .22 .35 .45 .55 .65 .75
7.5
Distributions come in all shapes
2.3
Measures of central tendency
• arithmetic mean for population = µ =
• arithmetic mean for sample = x =
∑X
∑X
N
n
• for positively skewed data whose frequency density of logarithms
are symmetrical, use geometric mean
geometric mean =
(∏ x )
1
n
mea n
0
mea n
50
100
0
10
100
Other measures of central tendency ...
•
median = middle value (equal number of items above and
below).
If n is odd, it is the middle value (n+1)/2th value, if n is even,
take the average of middle values [(n/2)th value + (n+1)/2th
value]/2
•
mode = most frequent value
2.4
mo d e
mea n
mo d e
m e di a n
mea n
m e di a n
50
0
100
0
100
50
Can have more than one mode (e.g., bimodal or trimodal,
etc…)
Example:
Data: 23
60
24
60
27
88
29 29 30 33
126 221 256
33
34
38
n = 17
Median = (n+1)/2th value = (17+1)/2 = 9th value = 34
Arithmetic mean =
∑x
Geometric mean =
n
n = 1156/17 = 68.0
∏x
=
17
2.5
1.629•1027 = 39.87
45
Measures of Dispersion: variance and standard deviation.
low variance
Frequency
Frequency
high variance
x
x
Dispersion is often measured as distance from the mean …
• mean deviate = x − x
• Summing up (∑) the squares of the mean deviates gives you the
sum of squares (SS), an important quantity in statistics
• SS =
∑ (x − x )
2
• SS is used for determining the variance and standard deviation in
samples and populations.
• variance for a population:
(X − µ )
σ2 = ∑
2
N
• standard deviation for a population: σ =
• variance for a sample:
∑ (X −µ )2
N
s2 = ∑ (x− x )
2
n−1
• standard deviation for a sample: s =
• beware of hand-held calculators .....
2.6
∑ (x− x )2
n−1
• some properties of σ in normally distributed data ...
- total area under curve represents 100%
- from -σ to +σ = 68.27% of area under curve
- from -2σ to +2σ = 95.45% of area under curve
- from -3σ to +3σ = 99.73% of area under curve
−4σ −3σ
−2σ
−σ
+σ +2σ +3σ
+4σ
• Commonly recode the data as Z-scores to match the Standard
Normal Distribution which has a mean of 0 and a standard
deviation of 1
x −x
Zi = i
s
• Z-scores are in units of standard deviations and represent areas
(proportions) under a normally distributed curve.
• Can use them to find probability of outcomes by looking up area
(or probability) in statistical table
2.7
e.g.
Suppose we have examined the lengths of the entire
population of species A and determined that µ = 14.2 mm and
σ = 4.7. What is the probability of finding by chance a
specimen shorter than 3 mm?
Step 1. Convert to Z-score
Z=
3.0 −14.2
= −2. 4
4.7
Step 2. from looking up Z-score in table, probability is 0.0082
(0.82%) which is very small
0.82%
0
99.18%
-2.4
Z
2.8
e.g.
Suppose we have examined the widths of the entire population
of species B and determined that µ = 6.20 cm and σ = 0.1.
Find the length that will be exceeded by 30% of the
specimens.
Step 1. From table we know that a 30% corresponds to a Z-score
of 0.52
Step 2. solve the Z equation for x
since Z =
x− x
s
then 0.52 =
x − 6.20
0.1
which reduces to (0.52 ∗ 0.1) + 6.2 = 6.25 cm
30%
Z=0.52
Confidence Intervals for Means
How good is our sample estimate of mean from the true mean of the
population?
• One measure of how good is our estimation is the standard error
(se) which is the standard deviation of the sample mean
2.9
se =
s
n
• To calculate confidence limits, use the standard error (se) of
mean and the area under the curve of distribution
• Because confidence in our sample mean is dependent upon the
ratio s/n whose true parameter σ is unknown, one must use a
distribution that is not dependent on σ . Use the t distribution
• t is dependent sample size and has a fudge-factor called degrees
of freedom (df or sometimes v) which, in this case, equals n-1
• t is also dependent on the confidence interval you wish to use
(expressed in terms of the proportion α) . Use α for finding the
area outside of the interval where α = 1-(confidence level/100).
For example a 95%. interval would correspond to an α = 1-.95 =
0.05, a 99% confidence interval would correspond to an α of
0.01
total area = 1.0
95%, α = 0.05
0.05
0.95
However, one needs to split the a interval (want both upper
and lower boundaries.
α = 0.05
0.025
0.025
0.95
2.10
Compute confidence intervals as follows:
upper limit = xu = x + t α /2 ⋅ s e
lower limit = xL = x − t α /2 ⋅ s e
first look up values for t in a table with n-1 df for any given
confidence level (e.g. 90, 95, 99%).
For example, if you want 95% confidence intervals (α = 0.5
and α/2 = .025) with a sample size of 9 (df = 8) the
corresponding tα/2 value is 2.306.
e.g.
Suppose we have a sample of clams with the following lengths:
23.5 16.6
n=8
25.4
19.1
x = 21.51
19.3
s = 3.08
22.4
20.9
24.9
s2 = 9.51
what would the 95% confidence intervals be on the mean of
21.51?
Step 1. look up t
df of 7 (= n-1)
α/2 of 0.025 (α = 1-0.95 = 0.5)
from table: t = 2.365
2.11
Step 2. plug t into formula:
xu = x + tα/2 ⋅
€
s
3.08
= 21.51+ 2.365⋅
= 24.09
n
8
s
3.08
xL = x − tα/2 ⋅
= 21.51 − 2.365⋅
= 18.93
n
8
Other Common Distributions
€
Although many paleontologic and geologic data are distributed
normally. (e.g. lengths of fossils in a living assemblage), bewarenot all distributions are normal . . .
• Lognormal Distribution. (e.g. permeability of some sediments,
sediment particle size distributions)
• Binomial Distribution. Common with discrete data (e.g.. presence
absence). More when we get to probability
2.12
• Poisson Distribution. Usually observed in data that are randomly
occurring events in space or time. Shape of distribution is a
function of the mean. (e.g. sizes of fossils in a "death"
assemblage)
and many others.. Negative binomial, Hypergeometric, etc...
Given very big sample sizes, all distributions approach normal. When
n = ∞ the distribution is normal. "Central Limits Theorem"
2.13