Download Stat 139 - Unit 01

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
8/30/16
Unit 1 Outline
•
•
•
•
Unit 1: Data Collection
Section 1.1 & 1.2 in the Text
Variables & Measurement
Collecting Data
Sampling
Random Assignment (for causal inference)
In God we trust. All others must bring data.
– W. Edmunds Deming
2
Variables and their Measurements
Categorical Variables
• Variable: Any characteristic that takes different values for
• Two types: Nominal and Ordinal
• Two major Types of variables: categorical and quantitative
• Nominal variable: a categorical variable in which the
• A categorical variable is a variable that can take on a few
• Ordinal variable: a categorical variable in which the
different individuals in a sample or population
different values (categories) when measured. Sometimes
called qualitative variables.
categories are unordered.
categories have an order or hierarchy (and can possibly be
numeric), but there is “no defined distance between levels
on the measurement scale”
• A quantitative variable is a variable that is measured on a
numerical scale covering a large range of values.
• Examples?
• Examples? Categorical:
Nominal:
Ordinal:
Quantitative:
3
4
1
8/30/16
1+1=3
Dummy Variables
X
What a
Dummy!
Y
Quantitative Variables
In English
please?
• Two types: Discrete and Continuous.
• Both are measured on an interval scale. That is
• There is a special type of “nominal” categorical variable
called a dummy variable or indicator variable.
there is a specific numerical distance between any
two measurements.
• These variables take on only 2 possible values: 0 or 1. The
one usually stands for success or yes, while the zero usually
stand for failure or no.
• By convention, they are usually named after the category
• Discrete variable: a quantitative variable that can only take on
specific numbers, like 0, 1, 2, …
• Continuous variable: a quantitative variable that can take an
that is a success.
infinite number of possibilities within a range of numbers
• Example: to represent sex/gender, we could define a
dummy variable named female, which would be 1 for all
women, and 0 for all men:
⎧1 if female
female = ⎨
⎩0 if male
• Examples:
Discrete:
Continuous:
5
6
Summary of types of Variables:
Variables
Categorical
Nominal
Ordinal
(more common)
(less common)
Quantitative
Discrete
Continuous
Dummy
Unit 1 Outline
•
•
•
•
Variables & Measurement
Collecting Data
Sampling
Random Assignment (for causal inference)
(special case)
**Note: in this class (and most of statistics), the most important difference is that
between categorical and quantitative variables. That differentiation will typically
determine the type of statistics and analysis used. Nominal and ordinal variables
are often treated the same. Same for discrete and continuous variables.
7
8
2
8/30/16
Anecdotal evidence
Collecting Data
•
Data can be collected in many ways:
1. Anecdotal information
2. Available data
3. Observational studies
4. Randomized experiments
The further
down the list
you go, the
more reliable
the information
is. And the
conclusions you
can draw will
typically then
be stronger.
9
• Anecdotal evidence is based on haphazardly selected
individual cases, that often come to our attention
because they are striking (probably not representative)
• Example: Politicians often cite the case of a single
individual to invoke a public response consistent with
the politicians’ desire (a sample of size n = 1)
• “Ask for averages, not testimonials”
10
Available data
Observational Studies
• Available data are data that were produced in the past for
some other purpose but may help answer a present question
• Many use available data because producing new data is
expensive (nearly always most costly part of research).
• There are lots of reliable available datasets on the web rich
with information. Some examples:
: http://www.census.gov/#
• An observational study is one in which data is collected by
merely observing the measurements on the individuals in
the sample. No attempt to influence or intervene with the
subject is taken.
• May be difficult to reach causal conclusions (that changing
one variable causes another variable to change) since other
variables may be muddling up (called confounding) this
relationship.
• Example: Does smoking cigarette increase your risk of
heart disease?
: http://www3.norc.org/gss+website/
• Example: Let’s come up with a survey of Harvard:
: http://www.hcup-us.ahrq.gov/nisoverview.jsp
11
12
3
8/30/16
Observational Studies
Pros:
• Usually cheap
• The only option when randomized experiment is not
feasible or unethical
• Showing causation is not always necessary
• Risk factors for medical decisions, population statistics.
• Risk factor (common in medicine and epidemiology) a variable associated with an increased risk of disease
or infection.
• Examples?
Observational Studies
Cons:
• Establishing causation may be impossible due to the presence of
confounding variables.
• Requires advanced statistical methods and unverifiable
assumptions.
• Confounding variable (or factor), sometimes referred to as a
confounder or a lurking variable, affects both the group
membership and the outcome (or dependent) variable.
• This third variable causes the two variables to falsely appear to be
related.
13
Confounding Variables
• Name confounding variables that may induce the following
associations:
• The association between the amount of serious crime
committed and the amount of ice cream sold by street
vendors.
• Drink More Diet Soda, Gain More Weight?: Overweight
Risk Soars 41% With Each Daily Can of Diet Soft Drink.
• Negative correlation between a size of one’s palm and their
life expectancy.
15
14
Experiments
• An experiment is a study in which an investigator imposes an
intervention (e.g. treatment) on individuals in order to
observe their response.
• Clinical trials are a type of experiment
• An Example: A comparison of different drugs for women
with breast cancer, often with as few as 100 people.
• The experimenter chooses women in the study receive the
different levels of the drug (new therapy vs. old therapy).
The levels of the drug are called the treatment.
• The outcome of the study may be the measured amount of
disease-free survival for each woman
16
4
8/30/16
Experiments: a few details
Experiments
• There has to always be at least two groups of the treatment to
compare. The ‘default’ condition is often called the control group
(standard-of-care in clinical trials).
• The control group may receive a placebo treatment. This is a
treatment that looks like the active treatment (classic ex: a
‘sugar pill’)
• The subjects should be randomized to the treatment groups. That
is, chance should decide which patients receive the treatments
• This guarantees that all other variables are balanced across the
treatment groups
• To ensure this balance, the study needs to be replicated enough
times.
• An experiment is the best (only?) way to determine if one
variable (the treatment) causes another variable (the outcome)
to vary.
• However, they are not always ethical or plausible. You
cannot knowingly do harm to human subjects by forcing
them to take a dangerous treatment (ex: force to smoke)
• Experiments may not mimic real life (the conditions in which
an experiment is run are often too ‘perfect’ or unrealistic).
So there is often some loss of generalization of them to the
real world.
• They are also the most expensive way to collect data
17
18
Confounding Variables and
Randomization
• Suppose we would like to compare two methods of teaching
introductory statistics.
• At Harvard, one professor uses standard lecturing set-up in his
class and another professor uses an interactive clicker
approach in her class.
• Students in the two classes are given achievement tests to see
how well they learned the tests.
Unit 1 Outline
•
•
•
•
Variables & Measurement
Collecting Data
Sampling
Random Assignment (for causal inference)
• Confounding variables?
• Better experiment?
19
20
5
8/30/16
Population vs. Sample
Parameters and Statistics
• Population: entire group of individuals on which we
desire information.
• Technicality: actual vs. conceptual populations
• For our Harvard study:
• Sample: a part of the population on which we
actually collect data.
• For our Harvard study:
• Parameter (often called an estimand): a numerical
summary of the population (like µ or p).
• For our Harvard study:
• Statistic: a numerical summary of the sample data.
• For our Harvard study:
• Estimator: a statistic used as a guess for the value of the
estimand ( or p̂).
• Estimate: a particular realization of the estimator (4/12
= 0.33).
21
Selection of Study Units
Study/experimental unit/subject - one member of a set of
entities being studied.
Two extremes of a selection
mechanism:
• Self-selection (volunteers,
haphazard)
• Random sampling
Analysis, Estimates, & Inference
Purpose: describe population characteristics.
23
22
Parameter vs. Estimate
Parameter (also, estimand) proportion of childless
households in the population.
Estimate - proportion of
childless households in
the sample.
- childless household
- household with children under 18
24
6
8/30/16
Entire population = all possible units
Target Population: a collection of units a
researcher is interested in; a group about which the researcher
wishes to draw conclusions.
25
Census: sample everybody in target population
26
Census
Pros:
• In principle, no need to use statistical inference
Cons:
• Expensive
• Long and difficult
• In practice, never perfect:
• Respondents are often not representative of
target population!
27
28
7
8/30/16
Sampling units whose data
Collection of units that are
Respondents: were actually obtained.
Sampling Frame: potential members of the sample.
Overcoverage
Undercoverage
29
30
Sample:a [randomly selected] subset of a sampling frame
Target population
Sampling Steps
Population
Target population
Sampling frame
Sample
Respondents
31
32
8
8/30/16
Random Sampling
Selecting a Sample from a Sampling
Frame
• Ensures that all subpopulations in the overall population are
roughly represented in the sample.
• Simple Random Sampling (SRS) – every subset of n units has
equal chance to be selected
• Pick size n (may use power analysis)
• Enumerate all units
• Pick n numbers randomly
• What is the simplest way of collecting a random sample?
•
Small example: selection of a 3-member advisory committeeat
random from the 11 faculty members of the Stat Dept.
• What is the population? What is the sample?
• What’s the chance that any one specific member is selected
for the committee?
• (Stat 110 question): How many different 3-person
committees can be formed?
34
33
Simple random sample example
Random Sampling
• Systematic Random Sampling - select every kth unit from
the ordered sampling frame, starting randomly from of the
first k positions.
• Easier to administer
• Requires well-mixed population
• Variable probability sampling – allow units to have
unequal probabilities of being sampled.
• Requires more careful analysis that involves
weighting.
• Example: Stratified Sampling – split the population
into homogeneous subpopulations and use SRS (or
another method) within a sampling frame of each
subpopulation.
35
•
If we were to draw a simple random sample n = 60 students
from all Harvard undergrads, we could :
1) Write out the sampling frame: the list of all individuals in the
population.
2) Assign each of the N members to a number from 1 to N.
3) Use a random numbers table or software to generate random
numbers:
So if N is a 4-digit number, then we could just generate random
sets of 4 digits numbers, and choose the individuals based on those
numbers
36
9
8/30/16
Stratified Sample Example
Stratified random samples
• We could perform a stratified sample at Harvard by
randomly selecting 5 individuals from each house,
and then combining them into one sample of n = 60.
Basic idea: sample important groups separately,
then combine these samples
1)
Divide population into groups of similar
individuals, called strata
2)
Choose a separate simple random sample within
each strata
3)
Combine the results of the simple random
samples together to form the overall statistic,
weighting each separate stratum correctly to
mimic the population
n = 60
What’s an advantage to this stratified sample compared to the SRS?
37
Assignment to Groups
Unit 1 Outline
•
•
•
•
38
Variables & Measurement
Random
sampling
Collecting Data
Sampling
Assignment to groups/treatments
Random Assignment (for causal inference)
If the assignment
mechanism is random,
the expected proportion
of childless couples in
each (treatment) group is
the same.
39
Two extremes of an
assignment mechanism:
•Haphazard (or unknown)
•Random
40
10
8/30/16
Assignment to Groups
Inferences Permitted by Study Design
• Complete Randomization (parallel to SRS).
If sampled
randomly
• Stratified or clustered Randomization (parallel to stratified
sampling)
• Ensures that representatives from all strata are present in
each treatment group.
If groups are
randomly
assigned
• How not to randomize:
If sampled
randomly AND
groups are
randomly
assigned
41
42
Study's generalizability
Observational studies
Difficult to draw
inferences about
population
• Internal validity is the validity of (causal) inferences in a
scientific study.
• Should be established first.
• It is low when there are
• unaccounted confounding factors;
• ignored missing data;
• Noncompliance;
• unverified assumptions;
• suboptimal method of analysis.
• A study that readily allows its findings to generalize to the
population at large has high external validity.
Difficult to draw
causal inferences
43
44
11
8/30/16
Concepts to review:
(some will be covered in Units 2 & 3)
•
•
•
•
•
•
•
•
•
•
•
45
Outcomes, events, probability
Random variables (r.v.)
Probability Distribution of r.v.
Indicator variables
Bernoulli and Binomial Distribution
Normal (Gaussian) distribution
Mean, variance, standard deviation (SD)
Histogram
Sample mean, sample variance, sample SD
Law of Large Numbers
Central Limit Theorem
46
The Last
Word
47
12