Download 1.1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Note that these are textbook
chapters, although Lecture
Notes may be referenced.
Chapter 1
Overview and Descriptive Statistics
1.1 - Populations, Samples and Processes
1.2 - Pictorial and Tabular Methods in
Descriptive Statistics
1.3 - Measures of Location
1.4 - Measures of Variability
1
What is “random variation” in the distribution of a population?
Examples: Toasting time, Temperature settings, etc. of a population of toasters…
POPULATION 1: Little to no variation (e.g., product manufacturing)
In engineering situations such as this, we
try to maintain “quality control”… i.e.,
“tight tolerance levels,” high precision,
low variability.
But what about a population of, say, people?
4
What is “random variation” in the distribution of a population?
Example: Body Temperature (F)
POPULATION 1: Little to no variation (e.g., clones)
Most individual values ≈ population mean value
Density
Very little variation
about the mean!
98.6 F
5
What is “random variation” in the distribution of a population?
Example:
Examples:Body
Gender,
Temperature
Race, Age,
(F)
Height, Annual Income,…
POPULATION 2: Much variation (more common)
Density
Much more
variation about the
mean!
6
Example
• Click on image
for full .pdf article
• Links in article
to access datasets
Study Question:
How can we estimate
“mean age at first birth”
of women in the U.S.?
POPULATION
Women in U.S. who
have given birth
“Random Variable”
X = Age at first birth
Suppose we know that X follows a “normal
distribution” (a.k.a. “bell curve”) in the population.
That is, the Population Distribution of X ~ N(, ).
 and  are
“population
characteristics”
i.e., “parameters”
(fixed, unknown)
How is this accomplished?
Hospital records, etc.
standard
deviation
σ
“sampling frame”
mean μ = ???
{x1, x2, x3, x4, … , x400}
Study Question:Other possible parameters:
How can we estimate
• standard POPULATION
deviation
“mean age at first birth”• median
Women in U.S. who
of women in the U.S.?
• minimum
have given birth
•
maximum
“Random Variable”
x = 25.6 is an example of a “sample
characteristic” = “statistic.”
(numerical info culled from a sample)
Suppose we know that X follows a “normal
This is called a “point estimate“ of 
distribution” (a.k.a. “bell curve”) in the population.
from the one sample.
That is, the Population Distribution of X ~ N(, ). Can it be improved, and if so, how?
• Choose a bigger sample, which
standard
should reduce “variability.”
 and  are
???
deviation
• Average the sample means of
“population
σ
many samples, not just one.
characteristics”
(introduces “sampling variability”)
i.e., “parameters”
“Sampling Distribution” ~ ???
(fixed, unknown)
X = Age at first birth
?????????
How big???
mean μ = ???
{x1, x2, x3, x4, … , x400}
FORMUL
A
mean x = 25.6
mean x = 25.6
Without knowing every value in the population, it is not possible
to determine the exact value of  with 100% “certainty.”
mean x = 25.6
Study Question:
How can we estimate
“mean age at first birth”
of women in the U.S.?
POPULATION
Women in U.S. who
have given birth
“Random Variable”
X = Age at first birth
Suppose we know that X follows a “normal
distribution” (a.k.a. “bell curve”) in the population.
That is, the Population Distribution of X ~ N(, ).
 and  are
“population
characteristics”
i.e., “parameters”
(fixed, unknown)
standard
deviation
σ
For concreteness,
suppose  = 1.5
mean μ = ???
{x1, x2, x3, x4, … , x400}
FORMUL
A
mean x = 25.6
95% CONFIDENCE INTERVAL FOR µ
25.453
mean x = 25.6
25.747
μ
Without knowing every value in the population, it is not possible
to determine the exact value of  with 100% “certainty.”
BASED ON OUR SAMPLE DATA, the true value of μ is between
25.453 and 25.747, with 95% “confidence” (…akin to “probability”).
This is called an
“interval estimate“ of
 from the sample.
Used in “Statistical Inference”
via “Hypothesis Testing”…
Study Question:
How can we estimate
“mean age at first birth”
of women in the U.S.?
• Arithmetic Mean
POPULATION
Women in U.S. who
have given birth
“Random Variable”
X = Age at first birth
“population
characteristics”
i.e., “parameters”
(fixed, unknown)
x1  x2 
n
 xn
• Geometric Mean
xG  n x1 x2
xn
• Harmonic Mean
Suppose we know that X follows a “normal
distribution” (a.k.a. “bell curve”) in the population.
That is, the Population Distribution of X ~ N(, ).
 and  are
xA 
standard
deviation
σ
xH 
1
x1
 x12
n

Each of these gives an
estimate of  for a particular
sample.
Any
general
sample
estimator for  is denoted by
the symbol ˆ .
Likewise for

and
mean μ = ???
{x1, x2, x3, x4, … , xn}
 x1n
FORMUL
A
mean x
ˆ .
Study Question:
How can we estimate
“mean age at first birth”
of women in the U.S.?
POPULATION
Women in U.S. who
have given birth
“Random Variable”
X = Age at first birth
Suppose we know that X follows a “normal
distribution” (a.k.a. “bell curve”) in the population.
That is, the Population Distribution of X ~ N(, ).
 and  are
“population
characteristics”
i.e., “parameters”
(fixed, unknown)
Extending these ideas to
other parameters of a
population gives rise to
the general theory of…
“PARAMETER
ESTIMATION”
standard
deviation
σ
mean μ = ???
{x1, x2, x3, x4, … , xn}
FORMUL
A
mean x
POPULATION
composed of “units” (people, rocks, toasters,...)
 To make certain calculations simpler, we
assume that populations are “arbitrarily large”
(or indeed, infinite).
What do we want to know
about this population?
How is…
“Random Variable” X
(age, income level, …)
… distributed?
Suppose we know
know that
that XX follows
follows aa “normal
known
but with parameters
distribution” distribution”
“probability
(a.k.a. “bell curve”)
in the population…
in the population.
That is, the Population Distribution of X ~ Dist(
N(, 1)., 2,…).
 and  are
“population
characteristics”
i.e., “parameters”
(fixed, unknown)
standard
deviation
heavily
skewed tail
σ
mean μ = ???
1 , 2 ,
unknown vals.
SAMPLE
For a particular , want
to define a corresponding
“parameter estimator” ˆ
Ideal properties…
• Unbiased estimator of 
• Minimum Variance among
all such unbiased estimators
i.e., “MVUE”
POPULATION
composed of “units” (people, rocks, toasters,...)
 To make certain calculations simpler, we
assume that populations are “arbitrarily large”
(or indeed, infinite).
“Random Variable”
X = any numerical value that can be
assigned to each unit of a population
“Random” refers to the notion that this
value is unknown until actually observed
(usually as part of an outcome of an
experiment to test a specific hypothesis).
Contrast this with the idea of a
“nonrandom” variable with no empirical
error, e.g., X = # cards in a deck = 52.
What do we want to know
about this population?
How is…
“Random Variable” X
(age, income level, …)
… distributed?
Quantitative [measurement]
 length
 mass
 temperature
 pulse rate
 # puppies
 shoe size
There are two general types.........
Quantitative and Qualitative
10
10½
11
16
POPULATION
composed of “units” (people, rocks, toasters,...)
 To make certain calculations simpler, we
assume that populations are “arbitrarily large”
(or indeed, infinite).
“Random Variable”
X = any numerical value that can be
assigned to each unit of a population
“Random” refers to the notion that this
value is unknown until actually observed
(usually as part of an outcome of an
experiment to test a specific hypothesis).
Contrast this with the idea of a
“nonrandom” variable with no empirical
error, e.g., X = # cards in a deck = 52.
What do we want to know
about this population?
How is…
“Random Variable” X
(age, income level, …)
… distributed?
Quantitative [measurement]
 length
 mass
 temperature
 pulse rate
 # puppies
 shoe size
CONTINUOUS
(can take their values at any
point in a continuous interval)
DISCRETE
(only take their values in
disconnected jumps)
There are two general types.........
Quantitative and Qualitative
17
POPULATION
composed of “units” (people, rocks, toasters,...)
 To make certain calculations simpler, we
assume that populations are “arbitrarily large”
(or indeed, infinite).
“Random Variable”
X = any numerical value that can be
assigned to each unit of a population
“Random” refers to the notion that this
value is unknown until actually observed
(usually as part of an outcome of an
experiment to test a specific hypothesis).
Contrast this with the idea of a
“nonrandom” variable with no empirical
error, e.g., X = # cards in a deck = 52.
There are two general types.........
Quantitative and Qualitative
What do we want to know
about this population?
How is…
“Random Variable” X
(age, income level, …)
… distributed?
Qualitative [categorical]
 video game levels (1, 2, 3,...)
1
2
3
 income level (low, mid, high)
 zip code
 PIN #
1
2
3
 color (Red, Green, Blue)
ORDINAL,
RANKED
(ordered
labels)
NOMINAL
(unordered
labels)
IMPORTANT SPECIAL CASE:
Binary (or Dichotomous)
1, "Success"
X 
• “Pregnant?” (Yes / No)
0, "Failure"
• Coin toss (Heads / Tails)
• Treatment (Drug / Placebo)
18
1, "Success"
Y 
0, "Failure"
POPULATION
Define a new parameter
 = P(Success)
Point estimator
Suppose we intend to
select a random sample of
size n from this population
of Success and Failures…
… in such a way that the
“Success or Failure” outcome
of any selected individual
conveys no information about
the “Success or Failure”
outcome of any other
selected individual.
That is, the “Success or Failure” outcomes
between any two individuals are independent.
(Think of tossing a coin n times.)
ˆ  ?
Random Variable
Let X = “Number of Successes
in the sample.” (0, 1, 2, …, n)
Then a natural estimator for  could be
the sample proportion of Success
X
ˆ 
n
Ex: n = 500 tosses, X= 285 Heads 
ˆ 
285
 0.57
500