Download Variables and their distributions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Gibbs sampling wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
• Distribution: Describes what values a variable
takes and how frequently these values occur.
distribution of a variable can bedescribed
graphically and numerically in terms of “shape”,
“center” and “spread”.
• Mean is the “average value”.
• If there are n observations x1, x2,…, xn, then the
mean is… For example if the data are: 3, 2, 3, 6, 1
then their mean (or average) is (3+2+3+6+1)/5 =
3.0
• Median is the “midpoint”
• 50% of observations are smaller than the median
and 50% are larger than the median
• If n is odd then the median is the center
observation in the ordered list
• If n is even then the median is the mean of the
two center observations in the ordered list
• For example if the data are: 3, 2, 3, 6, 1, we can
order them 1, 2, 3, 3, 6 and see that the median is 3
• Mode is the observation that occurs most
frequently may not be unique; there may be more
than one mode
• For example if the data are: 3, 2, 3, 6, 1,
the mode is 3 because it occurs most frequently
• Outliers usually demand investigation
• Often they are errors in the data (e.g. due to
instrument failure or errors in recording) but they
also may be very important
(e.g. a new scientific observation)
• If there is no reason to suspect they have been
wrongly recorded, may want to use summaries that
are resistant to their influence (e.g., medians rather
than means)
• Outliers should not be discarded without good
reason
• A measure of spread conveys information
regarding variability – how dispersed the
distribution is
• Common numerical summaries of spread
• Variance (s^2)
• Standard Deviation (SD & s)
• Range (largest minus smallest observation)
and IQR
The concept of variance:
• The “center” of a group of observations can be
measured by the mean
• variability of a single observation ( ) can be
measured by its distance from the center (e.g.
mean) Since we want this to always be a positive
number we consider the square of the above
• If we consider the sum of such “squared
deviations from the mean” as a measure of
variability - we realize that we need to take its
average
• variance is the average (almost) of squared
deviations from the mean the units of variance are
squared units
• If there are n observations x1, x2,…, xn, then the
variance is Standard Deviation
• standard deviation (SD) is the square root of the
variance
Quartiles and the Interquartile Range
• The first quartile Q1 is the median of the
observations
in the ordered list to the left of the overall median
(25% are smaller than Q1 and 75% are larger)
• The third quartile Q3 is the median of the
observations
in the ordered list to the right of the overall median
• Interquartile Range, IQR = Q3 - Q1, is a
measure of variability of the distribution
(IQR contains middle 50% of the observations)
• Example: For the observations 1, 2, 3, 4, 5
Q1 = 1.5, Median = 3, Q3 = 4.5, and IQR = 3
• Boxplot graphically displays several important
features of a distribution, including the median,
quartiles and outliers: tool for visualizing the
location (center) and variation of quantitative data,
illustrating differences between 2 or more groups
of data
Constructing a boxplot
• Draw a box whose ends are the lower and
upper quartiles Q1 and Q3 (length of box is
equal to the IQR)
• Mark the median by a line within the box
• Observations greater than Q3 + (1.5 x IQR)
or less than (Q1 – 1.5 x IQR) are
considered to be outliers and highlighted
• Draw lines from the quartiles to the most
extreme values that are not marked as
outliers (called whiskers)
• Bar graphs used for categorical data Display
count/percentage
of individuals in each category of the categorical
variable emphasize center and spread of a
distribution
• Histograms: quantitative data Display
count/percentage of
Individuals within intervals of equal width number
of intervals and
choice of interval width is important
• emphasize the distribution of values
• Graphically summarize the distribution of one
variable
1. Center (i.e., the location) of the data
2. Spread (i.e., the variation)
3.Skewness (departure from left-right symmetry)
4.Presence of outliers
5.Presence of multiple modes (high frequency
values) in the data
• Red text notes the strength of a histogram
compared to a boxplot
Density Curves
• It is often easier to conceptualize a population of
values with smooth curves rather than histograms
• The curve serves as a mathematical model for the
distribution
• Graph on next slide comes from data in IPS, p 66
• Histogram of Iowa test vocabulary scores,
Gary, Indiana 7th graders (n = 947)
• Vertical axis is relative frequency
• Plus approximation of the distribution with the
normal density curve
Properties of a density curve
• All values are positive (curve sits above the
horizontal axis)
• Total area under the curve is 1
• Areas under the curve and between two x values
give (an approximation to) the relative frequency
of values in the population between those x values
• Shapes follow those of histograms
The “normal distributions”
• The normal distributions are a family of density
curves indexed by their mean and standard
deviations
• The curves are symmetric, unimodal, bell-shaped
Normal distributions – N(µ,σ)
• The family of “normal distributions” are
symmetric, bell-shaped density curves
• All normal distributions have the same shape, but
with possibly different means (µ) and standard
deviations (σ) • Common notation for a
normal distribution - N(µ,σ)
The Standard Normal Distribution Z ~ N(0,1)
• The standard normal distribution, called Z
distribution has µ = 0 and σ = 1, so we write Z ~
N(0,1)
• All tables of the normal distribution are for Z ~
N(0,1)
• If Y ~ N(µ,σ) then we can standardize it by:
The 68-95-99.7 rule
• All normal distributions follow the 68-95-99.7
rule
� 68% of observations fall within σ of µ
� 95% fall within 2σ of µ
� 99.7% fall within 3σ of µ
• Conversely, if a distribution has this property
then it is normal or nearly normal
Z = (Y- µ)/σ so Z ~ N(0,1)
Why is the “normal distribution” so common?
• Result became known as “Central Limit
Theorem”: Under general conditions, the
distribution of a sum(or average) of many random
quantities is close to a normal distribution when
repeated
How to “standardize” an observation
• Subtract the mean from the observation
• Divide by the standard deviation
• if Y is N(µ,σ) distributed, then Z = (Y- µ)/σ has a
N(0,1) distribution
• The standardized value is often called a “z-score”
Standardizing
• Example: Grades of a previous STAT 104 final
exam
- the mean was 66 and the standard deviation was
12
• Student A scored 78 and student B scored 48
• Since 1 SD = 12 points, student A scored 1 SD
above the mean (Z = 78 - 66 / 12 = 1)
• Student B scored 1.5 SDs below the mean
(Z = 48 - 66 / 12 = -1.5)
• A standardized score takes into account the
spread of the data
Normal distribution – finding the probability to
the right of a Z-score
• Since the normal tables give the probability
to the left of a Z-score, we use subtraction
and the fact that the total probability is 1
Example - SAT Verbal Scores
(IPS p 79)
• SAT verbal test scores have an approximately
normal distribution with µ = 505 and σ = 110 [X ~
N(505,110)]
What test score will place a student in the top
10%?
• So we want to find x0 such that Prob (X > x0) =
0.1
• This is the same as finding z0 such that
Prob ((X – µ / σ) = Z > z0) = 0.1 (standardize)
where Z ~ N(0,1)
• Because Table A only gives area to the left, need
to state this problem as: what z0 has area 0.9 to the
left? Example - SAT Verbal Scores
• Prob (Z < 1.28) = 0.9 from Table A,
so Prob (Z > 1.28) = 0.1
To determine the SAT score, set z0 = 1.28 = (x0 –
505)/110 and solve for x0, so x0 = 505 +
(1.28)(110) = 645.8
Some properties of normal distributions
• Think of Prob (Y < w) as the probability of an
eventwhere “Y < w” is the event [shorten to P(Y <
w)]
• When dealing with distributions P(Y < w) can
also be interpreted as a proportion or relative
frequency
• Table A gives us P(Z < z0) when Z ~ N(0,1)
• A plot that can be used to assess normality is
called a normal quantile plot (or normal probability
plot)
• A tool that will become useful later
• P(Z < -z0) = P(Z > z0), i.e. they are symmetric
• P(z1< Z < z2) = P( Z < z2) - P(Z < z1)
• P(Z < z0) = P(Z < z0), since P( Z = z0) = 0
Normal quantile plots
4. Plot each data point y (vertical axis) against
the corresponding z (horizontal axis)
5. If the data distribution is close to the normal
distribution then the plotted points will lie
close to a straight line
Quantile plot vs histogram vs boxplot
• Unlike a histogram, quantile plot does not
require an arbitrary definition of bins (width of
the bars) for the histogram
• Boxplot will show symmetry, but not good at
indicating when tails have “too many outliers”
for normality
Transforming data to “normality”
• Consider the elimination of outliers – with
caution
• If the data are positive and skewed, then consider
transforming the data using the natural logarithm
• Other possible transformations include the class
ofpower transformations Xk where k ≠ 0 (e.g. k =
½)… Many methods described later in the course
are more reliable when the data are normally
distributed, or nearly so… such transformations do
not always work
Relationships between variables
.. Categorical variables – a limited set of
outcomes
.. Quantitative variables – take on numerical
values (arithmetic operations are meaningful)
• Use boxplots to examine the relationship
between a categorical variable and a quantitative
variable
• Use scatterplots to look at the relationship
between two quantitative variables (measured on
the same individuals)
(First step when studying the relationship)
Positive and negative associations
• Two variables measured on the same individuals
are called positively associated if increasing values
of one variable tend to occur with increasing
values of the other
• They are negatively associated if increasing
values of one variable occur with decreasing
values of the other
Response and explanatory variables
• Response variable, denoted as Y, measures the
outcome of a study. Y is the variable we want to
predict/explain (often called the dependent
variable)
• Explanatory variable, denoted as X, is a variable
that may predict/explain (but not necessarily cause)
the response variable (often called the predictor
variable) (frequently - many possible explanatory
variables)
Linear relationships
• The relationship between two variables is said to
be linear if the points on the scatterplot lie
(approx.) on a straight line.
• A perfect linear relationship between a response
variable (Y) and an explanatory variable (X) is
Y = a + bX
• A positive linear relationship means b > 0
• A negative linear relationship means b < 0
• What if b = 0? Flat.
Correlation
• Correlation is a measure of the strength of the
linear relationship between two variables
• It is usually denoted by r with a range of -1 to 1
.. r = 1 means the relationship between two
variables X and Y is exactly positive linear
.. r = -1 indicates the relationship is exactly
negative linear…
r = 0 indicates a very weak (or no) linear
relationship
Correlation
• Definition: Suppose we have n pairs of
observations (x1,y1),…,(xn,yn) on two variables X
and Y. The correlation between X and Y is given
by the formula where sx and sy are the SDs of X
and Y
Properties of r, the correlation coefficient
• r always between –1 and +1
• r is 1 or –1 only if points lie exactly on a straight
line
• sign of r indicates a positive or negative
association
• r is unaltered by changes in units of X or Y
• absolute value of r measures the strength of the
linear relationship
• r has no direct interpretation as a percent or
proportion (e.g., r = 0.8 is not twice as strong as r
= 0.4) *Association does not imply causation*
Least-squares regression
• Situation: 2 quantitative variables
• A regression line is a straight line that describes
how a response variable (Y) changes as an
explanatory variable (X) changes
• Unlike correlation, regression requires that we
have a response variable (Y) and an explanatory
variable (also called predictor variable) (X)
Least-squares regression – the formulas
Suppose we have n pairs of observations on X and
Y: (x1,y1), (x2,y2), (x3,y3), ... , (xn,yn)
We want to find the straight line that best “fits” the
data… This line has an equation of the form, where
(“y hat”) is the predicted value of Y, a is the yintercept (value of Y when X = 0), b is the slope of
the line
Least-squares criterion
The “best-fitting” line is the line that makes the
sum of the squares of the vertical deviations from
the data points to the line as small as possible
Minimizes the quantity: We want to solve for a
and b to make the above quantityas small as
possible
Least-squares intercept (a) and slope (b)
The values of a and b that minimize this quantity
are: where sx and sy are the standard deviations for
X and Y and r is the correlation coefficient
between X and Y
Interpreting the regression line
• The least-squares regression line for these data is:
y = 64.93 + 0.635x
• For these data, b = 0.635 (the slope), so that
height increases on the average by 0.635
centimeters for each month increase in age
• For these data, a = 64.93 (the y-intercept), which
is the point on the y axis where the line (if
extended) would touch when x = 0
Correlation between X-Y and Y-X is the same
But least-squares regression of Y on X is
different from regression of X on Y
Interpreting r2 in regression
• r2 is the fraction of the variation in the values of y
that is explained by the least-squares regression of
y on x. Thus, r2 = variance of y / variance of y,
where y are the predicted values (y = a + bx) and
y are the observed values
Analysis of residuals
• Definition: residual = observed y – predicted y =
y – y. So the data (y) are: the predicted values (y)
[the pattern] plus the residuals [deviations from the
pattern]
DATA = FIT + RESIDUALS
• Residual plot: a scatterplot of residuals versus x
values
• Ideally…Residuals are close to zero for all values
of x ... Have no pattern when plotted against the
explanatory variable x (random scatter) ...
Residuals have a normal distribution with mean 0
Outliers and Influential Points
• Outlier: in regression, a point that lies far from
the fitted line, often producing a large residual
• Influential Point: a point whose removal would
markedly change the position of the regression line
Extrapolation is the use of the regression line for
prediction outside the range of the explanatory
variable. This can produce nonsensical results
• Aggregation: Associations based on averaged
data
• Problem: A scatterplot of just the averages hides
much of the variability in the data
• In general, regression with aggregate data
overstates the strength of the association (larger r2)
Lurking variables
• A variable not among the explanatory or response
variables that influences the interpretation
• Solution: plot the residuals against time and other
variables that may influence the results
Common relationships between X and Y
(a) Association between X and Y (partially) due to
“X causes Y” (b) Association between X and Y
(partially) explained by a “lurking variable” (Z) (c)
Association between X and Y is mixed up with and
cannot be distinguished from the effect of an
additional variable (Z)
Establishing causation
The best (and only?) method of clearly establishing
causation is to conduct a carefully-designed
randomized experiment that changes X, the
explanatory variable, and controls for the effects of
possible lurking variables
Establishing causation – the backup plan
• The association is strong
• The association is consistent across many studies
• Higher doses are associated with stronger
responses
• The alleged cause precedes the effect in time
• There is a plausible causal relationship
The ecological fallacy
• Sociologists: “ecology” = study of groups
• Data on group behavior is called ecological data
• Ecological fallacy: concluding (perhaps
incorrectly) that relationships holding for groups
necessarily hold for individuals in those groups
• Aggregate data + lurking variables = ecological
fallacy
• Aggregate data may be easier to obtain than data
on individuals, but such inferences are only weakly
supported, at best
2 types of logarithms
• Logarithms were invented to reduce
multiplication and division calculations to addition
and subtraction calculations before the time of
electronic calculators
• Basic principle of logarithm use:
Log (A x B) = Log A + Log B
Log (Ab) = b x Log (A)
• Two major types of logarithms
• Common (base 10) (usually called “log”)
• Natural (base e = 2.718...) (usually called “ln”)
• We will now consider log (base 10)
transformations
• Later we will start using ln (base e) in formulas
• If log (A) = c, then A = 10c
• If ln (A) = c, then A = ec
Checking regression assumptions
We usually check:
• Relation between X and Y is linear
• Residuals have constant SD
• Residuals have a normal distribution
(e.g. examine plots of the residuals)
If assumptions are not met:
• Pretend they are (“ostrich”)
• Consider more complex models
• Transform data to conform to assumptions
Nonlinear transformations
• Earlier in the course we discussed linear
transformations (y = a + bx)
• Here we consider nonlinear transformations
Nonlinear transformations can
1) alter the shape of distributions (to make skewed
distributions more symmetric)
2) change the form of the relationship between two
variables (to make it linear)
3) alter the residuals (to make them normal with
consistent SD)
Transformations
• Logarithmic transformation works very well for
some financial data and some biological data
(makes “exponential growth” data linear)
• When relationship between Y and X is not linear,
consider transformations of the form Yk and Xk
where:
k = ... -3, -2, -1, -½, log, ½, 1, 2, 3 ...
“Ladder of power transformations”
• A specific experimental condition (intervention)
is called a treatment
• An experiment imposes some “treatment” on
individuals in order to observe their responses
• An experiment allows us to control lurking
variables
• In principle, randomized controlled experiments
are the “gold-standard” of evidence to support
“causation”
• Experiments may not always be ethical or
practical
Principles of experimental design
• 1st Control – directly compare two or more
treatments – helps control effects of lurking
variables
• 2nd Randomization – use randomization to assign
individuals (experimental units) to treatments
• 3rd Replication – replicate each treatment on
many
individuals to reduce effect of chance variation
(Also called repetition)
Control group
• Control - 1st principle of experimental design
• In a “controlled experiment”, two or more groups
of individuals (subjects, experimental units) are
compared
.. Treatment group: subjects receive a specific
intervention
.. Control group: subjects do not receive the
specific
intervention and are compared to the treatment
group
• Controlled comparisons allow us to eliminate (or
reduce) effects of specific treatment assignments,
selection of subjects, placebo effects and potential
biases (systematic favoring of a certain outcome)
• If studies are uncontrolled, results may be
meaningless
Assignment of treatments
• The 2nd principle of experimental design
concerns assignment of subjects to treatments
• We want the treatment groups to be alike as
much as possible in every way (except for the
treatment) for a fair comparison
• We could do it by matching (e.g. by subject’s
age, sex, smoking), but matching is not enough
(unknown lurking variables cannot be matched)
• Instead, use chance to decide - randomization
• Assignment of treatments using randomization
helps ensure balance of known and unknown
factors in the treatment groups
Replication
• Randomization produces treatment groups that
are similar in all respects except treatment received
• Therefore differences in the response must be due
to either the treatments or the play of chance
• Replication of the treatments on many subjects
(large sample size) reduces the role of chance
variation
• Replication gives the experiment the power to
detect differences between the treatments
• A treatment effect so large that it would rarely
occur by chance is said to be “statistically
significant”
research trial
• The placebo effect is a measurable, observable, or
felt improvement not attributable to a treatment
Blinding
• Blinding: comparison of treatments can be
distorted if subjects or persons administering or
evaluating treatment know which treatment is
being allocated – especially for subjective
endpoints
• Blinding avoids many sources of unconscious
biases
• Single-blind: subjects do not know which
treatment they have received
• Double-blind: neither subjects nor experimenters
know which treatments have been received
Population and sample
• Population: entire group of individuals on which
we desire information
• Sample: part of population on which we actually
collect data
• Sampling design: method used to choose sample
from population
• Census: survey of an entire population
• Why sample, instead of taking a census?
Time, expense, and sometimes sampling units are
changed by their measurement
Simple random sample (SRS)
• In a SRS of size n:
1) each individual in the population has an
equal chance of being chosen
2) every set of n individuals has an equal
chance of being the sample chosen
Hite Report and non-response bias
• Sampling frame: The “list” of individuals from
whom the sample is selected
Drawback of simple random sampling
• Weakness of SRS: it does not use relevant
information about the population - such as small
group of people who are poorer than the others - to
ensure proper balance that pure random sampling
may miss
• A sampling method that uses this type of
information is called stratified random sampling
.. Individuals are divided into groups called strata
.. Often (but not always), a SRS is taken within
each stratum
• National surveys can be even more complicated,
using multistage sampling
Stratified random samples
Basic idea: sample important groups separately,
then combine these samples
1) Divide population into groups of similar
individuals, called strata
2) Choose a separate simple random sample
within each strata
3) Combine these simple random samples to form
the full sample (in the correct proportions)
Multistage samples
One way to take a nationwide multistage sample:
Stage 1: Take a sample from the 3000 counties in
the US
Stage 2: Take a sample of townships within each
county chosen
Stage 3: Take a sample of city blocks (or census
blocks) within each township chosen
Stage 4: Take a sample of households within each
block
At each stage, take a simple random sample
Data
• Data can be produced in many ways:
1. Anecdotal information
2. Available data
3. Observational studies
4. Controlled experiments
5. Randomized controlled experiments
• Major differences in quality of information
produced and ultimately the reliability of
conclusions that can be drawn (lower on list is
better)
• Randomized controlled experiments provide by
far the most reliable information
Blocking in experimental designs
• Blocking: a block is a group of individuals
known to
be similar in some way that is thought likely to
influence the response variable
• In a “randomized block design”, randomization is
carried out separately within each block
• Example: matched-pairs design
.. Blocks consisting of two units
matched as closely as possible,
e.g., using identical twins
Observational studies (e.g. sample surveys)
versus experiments
• An observational study collects information from
individuals making no attempt to influence the
responses
• An experiment imposes an intervention (e.g.
treatment)
on individuals in order to observe their responses
• Sample surveys are a type of observational study
Block designs
• The device of pairing observations is a special
case of blocking
• A block is a portion of the experimental material
(e.g.,the 2 shoes of one boy) that is expected to be
morehomogeneous than the aggregate (all shoes of
all boys)
• By confining treatment comparisons within such
blocks, greater precision can be obtained
• In the paired design the block size is two, and we
compare two treatments A and B
Design of studies
Methods for producing data are called designs
Major elements of study design
1) Who or what is the object of study
(individuals)?
2) Will study be observational or experimental?
(if experimental - how will treatments be
assigned?)
3) How will the individuals be selected?
4) How many individuals will be studied?
5) What variables will be measured?
Choice of blocks
• Blocks should be chosen on the basis of the most
important (known) unavoidable source of variation
among the individuals (experimental units)
• Randomization then averages out the remaining
sources of variability to allow “unbiased” (i.e.,
un-confounded) estimation of treatment effects
• Blocks allow greater precision, because a source
of systematic variation is removed (reduced
variability) from the experimental comparison
Sampling distribution
• What would happen if an experiment (or a
sample) were repeated many times? (a “thought
experiment”)
• Take repeated samples of the same size from the
same population:
– 1st sample, calculate the statistic of interest
– 2nd sample, calculate the statistic of interest, and
so on ...
• The statistic will vary from sample to sample
• The sampling distribution of a statistic is the
distribution of values taken by the statistic in all
possible samples of the same size from the same
population
• The sampling distribution often has a predictable
pattern
Some terminology and concepts 
• Experimental units (e.g., individual subjects) are
the objects of the study
Placebo effects
• A placebo is a medically inert substance, such as
a sugar pill, used to replace medication in a clinical
The major concept of statistical inference
• A sampling distribution characterizes the
behavior of a statistic
Cautions for sample surveys
1) Selection bias: some groups in population are
over or under-represented in sample
2) Non-response bias: non-respondents may differ
in important ways from respondents
3) Response bias: e.g., wording of questions,
telescoping in the recall of events
Parameters and statistics
Parameter: number that describes the population
Statistic: number that describes a sample
Statistical inference: use information from a
sample (a statistic) to make an inference about a
population (a population parameter)
Sample .. Population
• A sampling distribution is inherently
unobservable, because there will (in almost all
cases) be only one survey, one experiment, one
observational study ...
• Probability theory provides tools for calculating
the theoretical form of a sampling distribution
• Understanding the behavior of a statistic under
(hypothetical) repeated samplings (the sampling
distribution) helps understand the precision and
reliability of the statistic
Bias and variability
• Two measures of the reliability of a statistic
.. Bias – the distance of the center of the sampling
distribution from the true parameter
.. Variability – the variance of the sampling
distribution
• Bias is often thought of as a measure of validity
of a study (e.g. reduced by using random sampling)
• Variability captures the spread in the sampling
distribution (e.g. reduced by increasing sample
size)
• Survey results come with a “margin of error” (+
3%)
• If bias = 0 and variability is small, the values of a
statistic will be tightly clustered around the “truth”
Size doesn’t matter
• Population size doesn’t matter
• The variability of a statistic from a random
sample doesn’t depend on the size of the
population (provided the population is
substantially larger than sample)
• Important consequences for surveys: A SRS of
2500 from the more than 210 million adults in US
gives results as precise as a SRS of 2500 from the
665,000 inhabitants of San Francisco