Download Lecture 4

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Transcript
Stat 651
Lecture 4
Copyright (c) Bani Mallick
1
Topics in Lecture #4

Probability

The bell-shaped (normal) curve


Normal probability plots (the q-q plot) to
check for normality of continuous data
Use of Table 1 in the back of the book
Copyright (c) Bani Mallick
2
Topics in Lecture #4

Normal probability calculations

Data Transformations

Sampling distributions: sample means are
random variables!

Standard error of the sample mean

Central Limit Theorem

A simple confidence interval
Copyright (c) Bani Mallick
3
Book Sections Covered in Lecture #4

Chapter 4.10, in detail

Chapter 4.11 (read on your own)

Chapter 4.12, in detail

Chapter 5.1

Chapter 5.2
Copyright (c) Bani Mallick
4
Lecture 3 Review

Box plots are probably the best way to
compare populations graphically

You can detect shifts and changes in variation

Also identifies outliers
Copyright (c) Bani Mallick
5
Lecture 3 Review
q-q plots are a simple way to understand
whether the data are approximately bellshaped
Population Relative Frequency Histogram
Bell-shaped curve!!
.5
.4
.3
.2
Normal Density

.1
0.0
-.1
-4
-3
-2
-1
0
1
2
3
4
X
Copyright (c) Bani Mallick
6
Lecture 3 Review

q-q plots are a simple way to understand
whether the data are approximately bellshaped

If they are sort of straight, then normality of
the population relative frequency histogram is
not too badly off
Copyright (c) Bani Mallick
7
q-q plot for the healthy women
Normal Q-Q Plot of Log(Saturated Fat)
4.5
4.0
Expected Normal Value
3.5
3.0
2.5
2.0
1.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
Observed Value
Copyright (c) Bani Mallick
8
Lecture 3 Review

For bell-shaped populations, we have
empirical rules

Approximately 68% (90%) (95%) of the
population lies within 1 (1.645) (1.96)
population standard deviations s of the
population mean m
Copyright (c) Bani Mallick
9
Lecture 3 Review


In many of our examples, we have seen that
there look to be differences among
populations. How can we tell if the
differences are real?
We will say that populations are different if
the differences we observe are more than can
be expected by sample-to-sample variability.
Copyright (c) Bani Mallick
10
Lecture 3 Review



Random variables are any outcome
(qualitative or numerical) from an experiment
involving random sampling from a population
The idea of a model is to write down a
formula for the population histogram as a
function of 1-2 parameters which are
estimated from the data.
If you know the parameters of the model,
then you know everything about probabilities
in that population
Copyright (c) Bani Mallick
11
Using the Normal Model




The entire point of the normal model is to
make probability statements
In practice, we estimate the population mean
m by the sample mean
We estimate the population standard
deviation s by the sample standard deviation
Then we estimate probabilities, by pretending
the sample quantities = the population ones
Copyright (c) Bani Mallick
12
Various Cases


Suppose we want to know what % of a
population lies below a specified value, c
We write this by asking: what is
Pr(X < c)


The value c is any arbitrary value, e.g., 6
X is any random variable with a population
mean m and a population standard deviation
s
Copyright (c) Bani Mallick
13
Pr(X < c) for Normal Populations

Compute the z-score
c-μ
z=
σ

Look up value in Table 1, page 1091

(white board explanation)
Copyright (c) Bani Mallick
14
Mechanics

NHANES: suppose healthy women’s ages are
normally distributed with mean m = 40 and
standard deviation s = 6

What is the chance that a randomly selected
person from this population is aged c = 43.3
or less

We write this in symbols as pr(X < 43.3)
Copyright (c) Bani Mallick
15
Mechanics

m = 40, s = 6

pr(X < 43.3) is what we want

z = (43.3 - m)/ s = 0.55 = z-score

Look up in Table 1:

The value 0.55 is on page 1092: first column
is 0.5, first row is 0.05: add them to get
0.55, and look up the value

Pr(X < 43.3) = 0.7088
Copyright (c) Bani Mallick
16
Various Cases


Suppose we want to know what % of a
population lies above a specified value, c
We write this by asking: what is
Pr(X > c)


The value c is any arbitrary value, e.g., 6
X is any random variable with a population
mean m and a population standard deviation
s
Copyright (c) Bani Mallick
17
Pr(X > c) for Normal Populations

This is simply 1 – Pr(X <= c).

Compute the z-score (c- m)/s

Look up the value for z in Table 1

Subtract this value from 1.0
Copyright (c) Bani Mallick
18
Mechanics

m = 40, s = 6

Chance that a randomly selected person from
this population is aged 46 or more

pr(X > 46)

z = (46 - m)/ s = 1

Look up in Table 1 for 1.00: get 0.8413

Because you are asking for > 46, subtract
from 1 to get pr(X > 46) = 1 – 0.8413 =
.1587
Copyright (c) Bani Mallick
19
Mechanics

m = 40, s = 6

Chance that a randomly selected person from
this population is aged 46 or less

pr(X <= 46)

z = (46 - m)/ s = 1

Look up in Table 1: chance is 84.13%
Copyright (c) Bani Mallick
20
Mechanics

m = 40, s = 6

Chance that a randomly selected person from
this population is aged 34 or less

pr(X <= 34)

z = (34 - m)/ s = -1

Look up in Table 1: chance is 0.1587 =
15.87%
Copyright (c) Bani Mallick
21
Aortic Stenosis Data

Two populations: healthy kids and kids with
aortic stenosis

Two outcomes: body surface area and aortic
value area

Size adjusted aortic value areas is the ratio of
aortic value area to body surface area
Copyright (c) Bani Mallick
22
8
125
6
4
2
99
72
79
88
0
Stenosis Data,
AVA to BSA
Ratio: Note the
huge outlier in
the stenotic kids.
He/she has a
huge aortic value
area relative to
his/her body size
-2
N=
70
56
Healthy
Stenoti
Health Status
Copyright (c) Bani Mallick
23
Aortic Stenosis Data

Healthy kids and AVA/BSA Ratio

Sample mean = 1.38, s = 0.51

Let’s pretend the population has m = 1.4, s
= 0.5

As it turns out, the sample mean of stenotic
kids is 0.7

So, let’s ask: for healthy kids, what is
pr(X < 0.7)?
Copyright (c) Bani Mallick
24
Aortic Stenosis Data

Healthy kids and AVA/BSA Ratio

m = 1.4, s = 0.5

For healthy kids, what pr(X <= 0.7)?

z = (0.7 - m)/s = -1.4

look up in Table 1

You should get 0.0808
Copyright (c) Bani Mallick
25
Aortic Stenosis Data

For healthy kids, pr(X <= 0.7) = 0.0808

Stenotic kids have a mean ava/bsa ratio of
0.7

Thus, the average stenotic kid has a lower
ava/bsa ratio than 91.92% of healthy kids

91.92% = 100% - 8.08%
Copyright (c) Bani Mallick
26
Not all Data are Normally Distributed

“Time to an event”, e.g., time to a heart
attack

Number of things that happen, e.g., number
of heart attacks

These typically have a skew shape
.2
.1
DENSITY
0.0
-.1
-1
0
1
2
3
4
5
6
X
Copyright (c) Bani Mallick
27
Not all Data are Normally Distributed

These typically have a skew shape

Statisticians have special models to handle
this (Gamma, Poisson)

You will usually try to eliminate some of the
skewness by data transformation
.2
.1
DENSITY
0.0
-.1
-1
0
1
2
3
4
5
6
X
Copyright (c) Bani Mallick
28
Not all Data are Normally Distributed

The standard data transformations are

Square root

Logarithm: but if you have zeros in the data
set, you have to add a small constant, since
log(0) = 
Copyright (c) Bani Mallick
29
Inference



The basic building blocks for inference are
statistics
Let’s start with the population mean m, the
sample mean  and the sample standard
deviation s
Standard error (of the mean) is
Copyright (c) Bani Mallick
s/ n
30
Inference


The sample mean
is a random variable

This means that it varies from sample to
sample

Of course, if we were able to “sample” the
entire population, the sample mean would
equal the population mean m
Copyright (c) Bani Mallick
31
Inference


The sample mean
is a random variable

Its own “population” mean is m

It’s standard deviation is σ/ n

Note how the standard deviation of the
sample mean becomes smaller as the
sample size becomes larger

Why does this make sense?
Copyright (c) Bani Mallick
32
Central Limit Theorem


The sample mean
is a random variable

Its own “population” mean is m

It’s standard deviation is σ/ n

In “large enough” samples, the sample mean
is very nearly normally distributed, i.e., has a
bell--shaped histogram

What does this mean?
Copyright (c) Bani Mallick
33
Warning

It is incredibly easy to have difficulty
understanding that the sample mean is itself
a random variable

But it is the crucial concept

If I take repeated samples and compute the
sample mean each time, I will not get the
same number.

Thus, the sample mean is a random variable
Copyright (c) Bani Mallick
34
Women’s Interview Survey of Health

Funny case-control study

Seemed to indicate that those women who
ate a lot of non-chocolate sweets were at
higher risk of breast cancer

271 women controls were interview for their
diets

They completed 6 24-hour recalls
Copyright (c) Bani Mallick
35
Women’s Interview Survey of Health

271 women controls were interview for their
diets and completed 6 24-hour recalls

Hawthorne effect: the more you ask
people about their lives, the more they will
change

Does this happen here?

If so, we’d expect that their caloric intake
decreased the more they were asked about
their diet
Copyright (c) Bani Mallick
36
Women’s Interview Survey of Health

To test the Hawthorne effect, we took the
average caloric intake from the first two
interviews, and subtracted it from the
average caloric intake from the last 2
interviews

X = (average of 5 & 6) – (average of 1 & 2)

Do you think the population mean of X is
positive or negative?
Copyright (c) Bani Mallick
37
WOMEN’S INTERVIEW SURVEY OF
HEALTH (WISH)

My guess was that because of various factors
(societal pressure, awareness of diet,
Hawthorne effect), they will report fewer
calories at the second time period

My hypothesis is that the population mean of
X is < 0.
Copyright (c) Bani Mallick
38
WISH: Change in Caloric Intake
2000
247
1000
0
Does it look like
a big change?
-1000
-2000
217
239
208
-3000
N=
271
Change in mean Energ
Copyright (c) Bani Mallick
39
WISH: Change in Calories
Normal Q-Q Plot of Change in mean Energy
2000
Expected Normal Value
1000
0
-1000
-2000
-3000
-2000
-1000
0
1000
2000
Does this look
straight
enough to be
happy
thinking that
X is
approximately
normally
distributed?
Observed Value
Copyright (c) Bani Mallick
40
What does an IQR
of 838 mean?
WISH
Descriptives
Change in mean
Energy: last 2 recalls
minus firs t 2 recalls
Mean
95% Confidence
Interval for Mean
Lower Bound
Upper Bound
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtos is
Copyright (c) Bani Mallick
Statis tic
-180.1262
-253.4050
Std. Error
37.2202
-106.8474
-171.6543
-128.2150
375428.7
612.7223
-2235.00
1567.96
3802.96
838.1900
-.253
.608
.148
.295
41
WISH

The sample size is n = 271

The sample mean change = -180 calories!

The sample standard deviation = 612

The sample standard error = 37

Empirical rule, the chance is 95% that the
population mean is with 1.96 * 37 = 74 of 180, i.e., between - 254 and -106
Copyright (c) Bani Mallick
42
WISH

Empirical rule, the chance is 95% that the
population mean between
- 254 and -106

What does this mean?

Is there a Hawthorne effect going on?

Can you attach a probability to this?
Copyright (c) Bani Mallick
43