Download Welcome to Statistics 111

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Rubin causal model wikipedia, lookup

Transcript
Welcome to Statistics 111
Alex Braunstein
The goal of this course is to develop basic tools for data analysis, probability
and statistical methods. Key topics covered in the course include exploratory
data analysis, regression, probability, estimation, and hypothesis testing
Stat 111 - Lecture 1 - Introduction
1
Syllabus notes: website
• All handouts will be available on the website:
http://stat.wharton.upenn.edu/~braunsf/stat111.html
• Website also contains my contact information
• Link on website for getting Wharton class account if
you are not a Wharton student
• Helpful if you want to use Wharton computer labs
Stat 111 - Lecture 1 - Introduction
2
Syllabus notes: Homeworks
• Homeworks will be handed out at the beginning of
every week
• ~ 5 homeworks in all
• Homeworks will be submitted at the beginning of
class on Mondays
• You are encouraged to work together on homework, but
homeworks are to be completed separately and handed in
individually.
• Do not copy from another person.
• No late homeworks will be accepted!!
• Late homeworks will get a score of zero, without exception
• Your lowest homework grade is not included in final grade
Stat 111 - Lecture 1 - Introduction
3
Syllabus Notes: Midterm Exam
• Midterm is held on following date:
Monday, June 15th (in class)
• No makeup midterm examination!
• A missing midterm exam counts as a zero score
• Consider taking this class in the fall or spring
if you can not attend the midterm!
Stat 111 - Lecture 1 - Introduction
4
Student Questionnaire
• Fill out a questionnaire and hand it in before
the break
• I will try to incorporate some of the subjects
that interest you into future lectures
Stat 111 - Lecture 1 - Introduction
5
Course Overview
1
Collecting Data
2
Exploring Data
Probability Intro.
3
4
Inference
Relationships between Variables
Comparing Variables
2
1
Means
Proportions
1
Regression
Stat 111 - Lecture 1 - Introduction
1
Contingency Tables
6
Out in public: You do statistics ?!?
• I hated that class in college!
• That was the most boring class ever!
• Lame.
Stat 111 - Lecture 1 - Introduction
7
Big Picture Ideas
• Statistics is all about uncertainty
• Focus as much on what we don’t know (or haven’t
observed) instead of what we know
• Formulating the question that we want to
answer is often the most difficult part
• Statistics is part mathematics, part roll-upyour-sleeves-and-get-thinking.
Stat 111 - Lecture 1 - Introduction
8
Science and Skepticism
• We always need to be cautious about
conclusions based on data
• Possible sources of bias and confounding?
• How might things have gone wrong?
• A little bit of skepticism is a good thing!
Stat 111 - Lecture 1 - Introduction
9
Statistical Modeling
•
Inference: using mathematical models of
uncertainty to answer questions
•
•
Connect probability concepts to our data
Can not make claims without using models
and making assumptions
•
Are the assumptions reasonable?
Stat 111 - Lecture 1 - Introduction
10
After the break
• Collecting Data: Design of Experiments
• Sections 3.1-3.2 in Moore, McCabe and Craig
• First couple of classes will not involve much
math at all, but we will get into lots of data
analysis after that!
Stat 111 - Lecture 1 - Introduction
11
Break!
• Hand in questionnaire
• 5 minutes
Stat 111 - Lecture 1 - Introduction
12
Outline for Second Half of Lecture
• Introduction to Experiments
• Sources of Bias in Experiments
• Techniques for Avoiding Bias
•
•
•
•
Matching
Randomization
Block Designs
Blinding and Double-Blinding
• Experiments vs. Observational Studies
• Association vs. Causation
Stat 111 - Lecture 2 - Experiments
13
Experiments
• Used to address a specific question
• Often used to examine causal effects
• Eg. medical trials, education interventions
Treatment Group
Treatment
Result
1
Population
Experimental
Units
2
3
Control Group
4
No Treatment
Result
• Can we just look at difference in results to get the
causal effect of the treatment?
• Depends on whether the experiment was done well
• many possible sources of bias in design of experiments
Stat 111 - Lecture 2 - Experiments
14
Sources of Bias
•
An experiment or study is biased if it systematically
favors a particular outcome
1. Subjects are not representative of the population
2. Treatment and control groups are inherently
different on some lurking or confounding variable
3. Subjects are influenced by knowing they are in
treatment or control groups
4. Evaluator of outcomes is influenced by knowing
they are in treatment or control groups
Treatment Group
Treatment
Result
1
Population
Experimental
Units
2
3
Control Group
Stat 111 - Lecture 2 - Experiments
4
No Treatment
Result
15
Bias 1: Non-representative units
• If your subjects are not representative of the
population, you won’t be able to generalize the results
even if the experiment is well done
• Here are two examples
• Treatment group: High Level NICUs
• Control Group: Low Level NICUs
• Problem: classification of NICU is different from
state to state, so a hospital that might qualify as a
high level NICU in one state might not in another
• Observed differences between the groups can not
be generalized from one state to another
Stat 111 - Lecture 2 - Experiments
16
Bias 2: Confounding/Lurking Variables
• Treatment group and control group are different on some
variable that also influences the outcome
• A confounding variable means that we can’t attribute
difference in outcomes to just the treatment
– Part of the difference may be due to the confounding variable not
the treatment
• Simple example: a breast cancer drug trial where only
women receive the treatment and only men receive the
control
• Gender becomes a confounding variable
• Are treatment vs control outcomes different due to the
treatment or gender differences between groups?
Stat 111 - Lecture 2 - Experiments
17
Bias 3: Subject knows treatment assignment
• A subject’s outcome is influenced by knowing that
he/she is in a treatment or control group
• Eg. drug trials: patients improve just because they think they are
receiving the drug
• Solution: blinded experiment with placebo
• Placebo appears to be the treatment, so all subjects
(treatment and control) don’t know their true treatment
assignment
• Controls may improve outcomes slightly; this is often
called “the placebo effect”
Stat 111 - Lecture 2 - Experiments
18
Bias 4: Evaluator knows treatment assignment
• Person evaluating outcome (eg. doctor in drug trial) may
also be influenced by knowing who receives treatment
• Not a problem if outcome is something indisputable,
such as death!
• This is a problem for more subjective measures like pain
reduction or results from social programs
• Solution: double-blinded experiment where neither
subjects not evaluators know treatment assignments
Stat 111 - Lecture 2 - Experiments
19
Association vs Causation
• In the presence of a confounding variable, we can only
conclude there is an association between treatment and
outcome, not causation
Stat 111 - Lecture 2 - Experiments
20
Examples: “Reporters are stupid”
• Children who watch many hours of TV get lower
grades in school on average than those who
watch less TV
• Does this mean that TV causes poor grades?
• What are potential confounding variables?
• People who use artificial sweeteners in place of
sugar tend to be heavier than people who use
sugar
• Does this mean that sweeteners cause weight gain?
• What is probably happening here?
Stat 111 - Lecture 2 - Experiments
21
One solution: Matching
• Make sure that treatment and control groups are very
similar on observed variables like race, gender, age etc.
• Block designs: divide subjects into blocks with similar observed
variables before dividing them into treatment vs control
• Special case: Matched Pairs
• Subjects are matched up into pairs, then one
member of each pair gets treatment and the
other gets control
• Example: Dandruff experiment
• treatment applied to one side and control
to other side of head
• No reason to expect difference
in sides except for treatment
Stat 111 - Lecture 2 - Experiments
22
Another Solution: Randomization
• Problem with matching is that you cannot usually match
on unobserved characteristics (eg. Genetics)
• Eg. Cholesterol drug trial - can’t match treatment and control
groups on genetic predisposition for high cholesterol
• Randomly assign subjects to treatment or control
• Random assignment should lead to groups that are similar
or balanced on both observed and unobserved
confounding variables
• Example: student questionnaire earlier in class - each
form you filled out was randomly assigned either a 1 or 2
Stat 111 - Lecture 2 - Experiments
23
Randomization of In-Class Survey
• Check to see if groups are balanced:
Variable
Treatment
Control
Average Height
Average Shoe Size
Average Number of Siblings
• There are differences, but are they “significant”?
• Later on in the course, we will be able to answer questions like this
• Of course, we can’t check the balance for unobserved
variables…we just have to trust the randomization process
• This is why good science needs to be replicable
Stat 111 - Lecture 2 - Experiments
24
Even Better: Randomization + Matching
• Randomization generally leads to treatment and control
groups that are evenly balanced but you can still get
unlucky and get unbalanced groups
• Example: randomly placing 20 people (10 males, 10
females) into treatment and control groups.
• How many males will end up in treatment group?
• Ideally, we would have 5 males in treatment group,
and 5 males in control group (balanced)
• However, there is a chance to get 9 males in treatment
and 1 male in control group (unbalanced)
Stat 111 - Lecture 2 - Experiments
25
Even Better: Randomization + Matching
• Randomized Blocks: randomize within blocks of
observed variables
• Example:
• Divide up subjects into males and females first, then randomly
assign treatment or control to subjects in each group separately
• Guarantees that equal number of males end up in treatment group
and control group (same with females)
• Randomized Matched Pairs: randomly decide which
member of each pair gets treatment vs. control
• Example:
• For each head in dandruff experiment, randomly assign which
side of head to get dandruff shampoo vs. control
Stat 111 - Lecture 2 - Experiments
26
Experiments vs. Observational Studies
• Often, we want the causal effect of some treatment,
but our data are from an observational study
• Observational studies examine effects of some variable but
without the advantages of a controlled experiment
• No treatment is applied in observational studies
• Example: health effects of smoking
• Unethical to randomly impose a treatment
• Could there be some confounding variable that explains
health differences between smokers and non-smokers ?
• Very risky to make causal statements from
observational data, since we can not avoid bias!
Stat 111 - Lecture 2 - Experiments
27
Health Effects of Chocolate
• Report to European Society of Sexual
Medicine:
• 153 Italian women filled out sexual function
questionnaires
• “intriguing correlation”: sexual function/desire
significantly greater among chocolate-eaters
• Observational study: association does not imply
causation!
• Confounding: average age is 35 among frequent
chocolate-eaters, compared with 40.4 in nonchocolate group
Stat 111 - Lecture 2 - Experiments
28
Next Class - Lecture 2
• Collecting Data:
– Surveys and Sampling
– Graphical summaries of a single variable
• Moore, McCabe and Craig: Sections 3.3
and 1.1
Stat 111 - Lecture 2 - Experiments
29