Download statistics - UH - Department of Mathematics

Document related concepts

History of statistics wikipedia , lookup

Confidence interval wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Categorical variable wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
STATISTICS
Summarizing, Visualizing and
Understanding Data
I. Populations, Variables, and
Data
Populations and Samples
To a statistician, the population is
the set or collection under
investigation. Individual members
of the population are not usually of
interest. Rather, investigators try to
infer with some degree of
confidence the general features of
the population.
Examples




Students currently enrolled at a
certain university.
Registered voters in a certain
Congressional district.
The population of large-mouthed
bass in a certain lake.
The population of all decay times
of a radioactive isotope.
Statistical Inference


Drawing and quantifying the
reliability of conclusions about a
population from observations on a
smaller subset of the population.
Sample: The subset observed.
Variables and Data



A population variable is a descriptive
number or label associated with each
member of a population.
The values of a population variable are
the various numbers (or labels) that occur
as we consider all the members of the
population.
Values of variables that have been
recorded for a population or a sample
from a population constitute data.
Types of Data




Nominal variables are variables whose
values are labels.
Ordinal variables are variables whose
values have a natural order.
Interval variables have values
represented by numbers referring to a
scale of measurement.
Ratio variables have values that are
positive numbers on a scale with a unit of
measurement and a natural zero point.
Guess the Type







Age
Questionnaire responses: 1=”strongly
agree”,2=”agree”…,5=”strongly disagree”
Letter grades
Reading comprehension scores
Gender
Zip codes
Molecular velocities
II. Summarizing Data
Location Measures
(Measures of Central Tendency)
A location measure or measure of
central tendency for a variable is a
single value or number that is taken
as representing all the values of the
variable. Different location
measures are appropriate for
different types of data.
The Mean



For interval or ratio variables x
N individuals in the sample or population
xi = value of x for ith individual
1
x
( x1  x 2    x N )
N
The mean of a population variable is denoted
by m (the Greek letter mu).
The Mean with Repeated Values


Distinct values of x: x1, x2 ,, xM
nj = frequency of occurrence of x j
1
x  (n1 x1  n2 x 2   n M x M )
N
The Mean with Repeated Values

Relative frequencies:
fj 
nj
N
x  f1 x1  f 2 x2    f M xM
Example
x j
-2
1
3
4
6
nj
2
1
3
5
3
The Median


Informally, the “middle” value when
all the values are arranged in order
A number m is a median of x if at
least half the individuals i in the
population have xi  m
and at least half of them have
xi  m
The Median – Example 1


x: –2.0, 1.5, 2.2, 3.1, 5.7 (no
repetitions)
median(x)=2.2
The Median – Example 2

x: -2.0, 1.5, 3.1, 3.1, 3.1

median(x) = 3.1
The Median – Example 3



x: -2.0, 1.5, 3.1, 5.7, 5.9, 7.1
median(x)=Any number in [3.1,5.7]
By convention, for an even number
of individuals choose the midpoint
between the smallest and largest
medians, e.g.,
3.1  5.7
m
 4.4.
2
Example




Change 7.1 to 71. What happens to
the mean and the median?
The mean changes from 3.55 to
14.2
No change in the median
The median is much less sensitive
to outliers (which may be mistakes
in recording data)
The Median for Ordered Categories
A
A-
B+
B
B-
C+
C
C-
D+
D
D-
F
8
5
10
18
18
15
14
6
4
1
1
0
N=100. The median grade is B-.
The Mode



The data value with the greatest
frequency
Not useful for interval or ordinal
data if recorded with precision
The only useful location measure for
strictly nominal data
Example
A
A-
B+
B
B-
C+
C
C-
D+
D
D-
F
8
5
10
18
18
15
14
6
4
1
1
0
The modes are B and B-.
Cumulative Frequencies and
Percentiles


x is an interval or ratio variable.
Ordered distinct values:
x1  x2    xM

Relative frequencies:
f1 , f 2 ,, f M
Cumulative Frequencies and
Percentiles
Cumulative
Frequencies

Cumulative
Relative
Frequencies

N1  n1
F1  f1
N 2  n1  n2
F2  f1  f 2
N 3  n1  n2  n3
F3  f1  f 2  f 3






N M  n1  n2    n M
FM  f1  f 2    f M
The Weather Person’s Prediction
Errors x
x'j
-2
1
3
4
6
nj
2
1
3
5
3
Nj
2
3
6
11
14
fj
.1429
.0714
.2143
.3571
.2143
Fj
.1429
.2143
.4286
.7857
1.000
Exercise
From the table above, what fraction
of the data is less than 1? What
fraction is greater than 3? What
fraction is greater than or equal to
3?
Percentiles




x: an interval or ratio variable
A number a is a pth percentile of x if at
least p% of the values of x are less than
or equal to a and at least (100-p) % of
the values of x are greater than or equal
to a.
The 25th percentile is called the first
quartile of x and the 75th percentile is the
third quartile of x.
The 50th percentile is the second quartile
or median.
Example
For the weather person’s errors, the
25th percentile is 3. The 50th
percentile and third quartile are
both 4.
Measures of Variability
Statisticians are not only interested
in describing the values of a
variable by a single measure of
location. They also want to
describe how much the values of
the variable are dispersed about
that location.
Population Variance and Standard
Deviation



x: an interval or ratio variable.
N=number of individuals in
population.
Variance of x:
2
2
2
(
x

m
)

(
x

m
)



(
x

m
)
2
1
2
N
 
N

Standard deviation of x:
 
2
Sample Variance and Standard
Deviation


n: the number of individuals in a
sample from a population
Sample variance:
( x1  x )  ( x2  x )    ( xn  x )
s 
n 1
2
2
2

Sample standard deviation:
s s
2
2
Alternative Formulas for the Variance

Using frequencies:
2
2
2



n1 ( x1  m )  n2 ( x2  m )    nM ( xM  m )
2
 
N

Using relative frequencies:
  f1 ( x1  m )  f 2 ( x2  m )    f M ( xM  m )
2
2
2
2
The Interquartile Range


Q1, Q3 : 1st and 3rd quartiles,
respectively
Interquartile range:
IQR  Q3  Q1

Not influenced by a few extremely
large or small observations
(outliers)
The Range


The difference between the largest
data value and the smallest
Range of sample values is not a
reliable indicator of the range of a
population variable
III. Graphical Methods
Pie Charts (Circle Graphs)
Sources: AT&T (1961) The World’s Telephones
R: A language and environment for statistical computing, the R core
development team.
Bar Charts (Bar Graphs)
Pros and Cons


Bar chart has a scale of
measurement – more precise
information
Pie chart gives more vivid
impression of relative proportions,
e.g., obvious at a glance that N.
America had more than half the
telephones in the world.
Stemplots (Stem and Leaf Diagrams)
Stem|Leaves
Cumulative Frequency
4|7
1
5 | 448889
7
6 | 34789
12
7 | 012234455666888889999
33
8 | 0022234457799
46
9 | 0457
50
Grades of 50 students on a test
Find the Median
Stem|Leaves
Cumulative Frequency
4|7
1
5 | 448889
7
6 | 34789
12
7 | 012234455666888889999
33
8 | 0022234457799
46
9 | 0457
50
25th and 26th leaves circled. Median = 78
Exercise
Stem|Leaves
Cumulative Frequency
4|7
1
5 | 448889
7
6 | 34789
12
7 | 012234455666888889999
33
8 | 0022234457799
46
9 | 0457
50
The 1st quartile is 70 and the 3rd
quartile is 82.
Boxplots (Box and Whisker Diagrams)
Elements of a Boxplot
largest
outlier
box
whisker
quartiles
median
Boxplot Shows Distribution Skewed to
the Left
Histograms

For interval or ratio data

Data is grouped into class intervals

Superficially like a bar chart
Frequency Histogram
Height=bin frequency
Class interval (bin)
Source: R: A language and environment for statistical computing, the R core
development team.
Probability Histogram
Area of bar = relative bin frequency
E.g., .011×25=.275
Ogives
(Cumulative Frequency Polygons)



Related to probability histograms
Examples of cumulative distribution
functions
Probability histograms are examples
of density functions
Example Ogive
Relationship Between Probability
Histogram and Ogive
The height of the
ogive is the
cumulative area
under the
histogram
Estimating Percentiles from Ogives




Horizontal line
has height .75
Vertical line
intersects
horizontal axis at
60
Estimated 3rd
quartile is 60
True 3rd quartile
is 62
Scatterplots (Scatter Diagrams)




Used for jointly observed interval or
ratio variables
Example: Heights and weights of
individuals
Example: State per capita spending
on secondary education and state
crime rate
Example: Wind speed and ozone
concentration
Example Scatterplot
centroid
Fitting a Line




Relationship between variables x
and y is approximately linear.
Approximately, y = a + bx.
Find a and b so that data comes
closest to satisfying the equation.
Least squares – a formal
mathematical technique to be
shown later.
Line Fitted by Least Squares
IV. Sampling
Why Sample?



Because the population is too large
to observe all its members.
The population may be partly
inaccessible.
The population may even be
hypothetical.
Statistical Inference



Drawing conclusions about the
population based on observations of
a sample.
Reliability of inferences must be
quantifiable.
Random sampling allows probability
statements about the accuracy of
inferences.
Sampling With Replacement





Population has N members.
n population members chosen
sequentially.
Once chosen, a member of the population
may be chosen again.
At each stage, all members of the
population are equally likely to be chosen.
n
Random experiment with N possible
equally likely outcomes.
Sampling With Replacement
(continued)




x is a population variable.
X1 = value of x for 1st sampled individual, X2 =
value of x for 2nd sampled individual, etc.
Each Xi is a random variable. The random
variables X 1 , X 2 ,, X n are independent.
The sequence X 1 , X 2 ,, X n is a random sample of
values of x, or a random sample from the
distribution of x.
Sampling Without Replacement





Population has N individuals.
n members chosen sequentially.
Once chosen, an individual may not
be chosen again.
At each stage, all of the remaining
members are equally likely to be
chosen next.
Random experiment with N ( N  1) ( N  n  1)
possible equally likely outcomes.
Sampling Without Replacement
(continued)





Sample without replacement.
Ignore the order of the sequence of
individuals in the sample.
Random experiment whose
outcomes are subsets of size n.
N!
C

Experiment has
n!( N  n)!
possible equally likely outcomes.
Common meaning of “random
sample of size n”
N ,n
Random Number Generators



Calculators and spreadsheet
programs can generate
pseudorandom sequences.
Press the random number key of
your calculator several times.
Simulates a random sample with
replacement from the set of
numbers between 0 and 1 (to high
precision).
Generating a Sample with
Replacement



Number the individuals from 1 to N.
Generate a pseudorandom number
R.
Include individual i in the sample if
i 1  R  N  i

Repeat n times. Individuals may be
included more than once.
Exercise
Suppose you have 30 students in
your class. Use the procedure just
described to obtain a sample of size
10 (a) with replacement, (b)
without replacement.
V. Estimation
The Sample Mean and Standard
Deviation


X 1 , X 2 ,, X n is a random sample from
the distribution of a population
variable x.
The sample mean is
1
X  ( X1  X 2    X n )
n

The sample variance is
1
S 
[( X 1  X ) 2  ( X 2  X ) 2    ( X n  X ) 2 ]
n 1
2
The Sample Mean and Standard
Deviation (continued)
The sample standard deviation is
S  S2
The sample mean, variance and
standard deviation are all random
variables because they depend on
the outcome of the random
sampling experiment.
Estimators


The sample mean, variance, and
standard deviation have
distributions derived from the
distribution of values of the
population variable x.
They are estimators of the
population mean m, the population
variance 2, and the population
standard deviation  of x.
Unbiased Estimators

The theoretical expected values of
the sample mean and sample
variance are equal to their
population counterparts, i.e.,
E (X )  m


and
E (S 2 )   2
X and S2 are said to be unbiased
estimators of m and 2, respectively
S is biased. E (S )  
The Distribution of the Random
Variable X



The mean of X is m, the same as the
mean of the population variable x.
The standard deviation of
X
is
 / n.
These are the theoretical mean and
standard deviation.
Density Functions
A density function is a nonnegative
function such that the total area between
the graph of the function and the
horizontal axis is 1.
A probability histogram is a density
function.
Other density functions are limits of
histograms as the number of data
elements grows without bound.
The Standard Normal Density Function
Percentiles of the Standard Normal
Distribution
za is the 100(1-a) percentile of the
distribution
Symmetry About the Vertical Axis
Probabilities Related to the Standard
Normal Distribution
Other Normal Distributions





Let Z be a random variable with the
standard normal distribution.
The mean of Z is 0 and the
standard deviation of Z is 1.
Let m and  be any numbers, >0.
Let Y =Z+m
Y has the normal distribution with
mean m and standard deviation .
Other Normal Distributions
Example
m = 1 and  = 1.5
Standardizing: The Inverse Operation



Let Y be normally distributed with
mean m and standard deviation .
Y m
Let Z 
. This is the zscore of Y. 
Then Z has the standard normal
distribution and
P[a  Y  b]  P[
am

Z 
bm

]
The Central Limit Theorem



Let X be the sample average of a
random sample of n values of a
population variable x.
The population variable x has mean m and
standard deviation .
Standardize X by subtracting its mean
and dividing by its standard deviation
Z 
X m
/ n

n(X  m)

The Central Limit Theorem
(continued)
Get Ready for the Central
Limit Theorem!
The Central Limit Theorem
(continued)
The Central Limit Theorem:
As the sample size n grows without
bound, the distribution of Z
approaches the standard normal
distribution. This is true no matter
what the distribution of values of
the population variable x.
Another Statement of the CLT
For sufficiently large sample sizes n
and for all numbers a and b,
P[a  X  b]  P[
n (a  m )
n (b  m )
Z
]


In almost all applications, n≥50 is
large enough.
The CLT in Action
Sample n=30 from the population variable
COUNTS whose distribution is tabulated.
Calculate the sample average. Repeat this 500
times and construct a histogram of the z-scores of
the 500 sample averages. Note: The distribution
of COUNTS is very far from normal.
xj
0
1
2
3
4
5
6
fj
.36
.33
.19
.08
.02
.01
.01
Distribution of COUNTS
Result-500 Averages of 30 Samples
from COUNTS
Estimating a Population Mean



The sample mean X is an unbiased
estimator of the population mean m.
For “large” sample sizes n, X has
approximately a normal distribution with
mean m and standard deviation  / n
For large n, the sample mean is an
accurate estimator of the population
mean with high probability.
Example


Suppose   2 and we want X to
estimate m with an error no greater
than 0.05.
Assume X is exactly normally
distributed. Standardize.
P[| X  m | 0.05]  P[| Z | 0.025 n ]
Probabilities of 1-place Accuracy
=2
Confidence Intervals for the Population
Mean – Review of za / 2
100(1-a)% Confidence Interval

By the CLT
1  a  P[ za / 2

X m

 za / 2 ]
/ n
Rearranging the inequalities
1  a  P[ X  za / 2

n
 m  X  za / 2

n
]
A Difficulty
 is probably unknown, so the
confidence interval
X  za / 2

n
can’t be used. What to do?
Enhanced Central Limit Theorem

Define the modified z-score for X
as
X m
n(X  m)
Z

S
S/ n

As n grows without bound, the
distribution of Z approaches the
standard normal distribution.
A More Useful Confidence Interval

By the enhanced CLT
1  a  P[ X  za / 2

S
S
 m  X  za / 2
]
n
n
An approximate 100(1-a)%
confidence interval is
X  za / 2
S
n
Example





n=50 from COUNTS (m = 1.14)
X = 1.32
S = 1.39
1-a = .95
S
1.39
=
=1.32±0.39
X  za / 2
1.32  1.96
n
50

95% confidence interval: (0.93, 1.71)

Don’t say .95=P[0.93<m<1.71]
Confidence Intervals for Proportions




x is a population variable with only
two values, 0 and 1.
Numerical code for two mutually
exclusive categories, e.g., “male”
and “female”, or “approves” and
“disapproves”.
p=relative frequency of x=1.
m=p; 2=p(1-p)
Confidence Intervals for Proportions
(continued)



Sample n values of x, with
replacement. Result is a sequence
of 1s and 0s.
Sample mean is the relative
frequency in the sample of 1s, e.g.,
the relative frequency of females in
the sample of individuals.
Denote the sample mean by p̂
since it is an estimator of p.
Confidence Intervals for Proportions
(continued)


n ( pˆ  p)
By the enhanced CLT, Z 
is
pˆ (1  pˆ )
approximately standard normal.
An approximate 100(1-a)%
confidence interval is pˆ  za / 2
pˆ (1  pˆ )
n
Example
A public opinion research
organization polled 1000 randomly
selected state residents. Of these,
413 said they would vote for a 1¢
sales tax increase dedicated to
funding higher education. Find a
90% confidence interval for the
proportion of all voters who would
vote for such a proposal.
Solution

n = 1000

413
pˆ 
 0.413
1000

1-a =.90;
za  z.05  1.645
2
pˆ (1  pˆ )
n

pˆ  za / 2

(0.387, 0.489)
0.413  0.587
= 0.413 ± 1.645
1000
Linear Regression and Correlation



x and y are jointly observed
numeric variables, i.e., defined for
the same population or arising from
the same experiment.
Have observations for n individuals
or outcomes.
Data: ( x1 , y1 ),  , ( x n , y n )
Examples


(An observational study) Let x be
the height and y the weight of
individuals from a human
population.
(A designed experiment) Let x be
the amount of fertilizer applied to a
plot of cotton seedlings and let y be
the weight of raw cotton harvested
at maturity.
Data on Fertilizer and Cotton Yield
x
2
2
2
4
4
4
6
6
6
8
8
8
y
2.3
2.2
2.2
2.5
2.9
2.7
3.4
2.7
3.4
3.5
3.4
3.3
Scatterplot of Fertilizer vs. Yield
Assumptions of Linear Regression


There is a population or distribution of values of y
for any particular value of x.
There are unknown constants a and b so that for
any particular value of x, the mean of all the
corresponding values of y is

m y  a  bx
The standard deviation  of the values of y
corresponding to a value of x is the same for all
values of x.
The Method of Least Squares


Estimate a and b by choosing them
to minimize the sum of squared
differences between the observed
values yi and their putative
expected values a  bxi
In symbols, minimize
( y1  a  bx1 )  ( y 2  a  bx2 )    ( y n  a  bxn )
2
2
2
The Least Squares Estimates

Let
x
and
and y’s. Let
x’s.

y
be the means of the observed x’s
s x2
be the sample variance of the
The covariance between the x’s and the y’s is
1
s xy 
[( x1  x )( y1  y )    ( xn  x )( y n  y )]
n 1

The least squares estimate of the slope is
b 

s xy
s x2
The least squares estimate of the intercept is
a  y  bx
Least Squares Line for Cotton Yield
Correlation

The correlation between the x’s and y’s is
r


s xy
sx s y
r is related to the slope b of the least
squares regression line by
sx
r b
sy
r is always between -1 and 1. r
measures how nearly linear the
relationship between x and y is. If r = 0,
then x and y are uncorrelated.
Examples