Download Slide 1

Document related concepts

History of statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Operations research wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
‫مرکز پژوهشهای علمی دانشجویان‬
‫دانشگاه علوم پزشکی تهران‬
‫کارگاه آمار و‪spss‬‬
Statistics
Arsia Jamali
Medicine & MPH Student
Students’ Scientific Research Center
Tehran University of Medical Sciences
Outline
Descriptive Statistics
Analytical Statistics
Statistical Tests
6/23/2009
Arsia Jamali-Students' Scientific Research Center
3
Descriptive Statistics Outline
Overview
Variables
Data Presentation
Exercise
6/23/2009
Arsia Jamali-Students' Scientific Research
Center
4
Types of Statistics
Descriptive Statistics: is used to describe
characteristics of our sample
Inferential (Analytical) Statistics: is used to
generalise from our sample to our population
6/23/2009
Arsia Jamali-Students' Scientific Research Center
5
Variables
Definition
Types:
Qualitative
Nominal: Blood Group
Ordinal: Staging
Quantitative
Interval : (0C)
Ratio: (0K)
6/23/2009
Arsia Jamali-Students' Scientific Research Center
6
Presentation of Data
Frequency Tables
Graphical Techniques
Measures of Central Tendency
Measures of Spread (Variability)
6/23/2009
Arsia Jamali-Students' Scientific Research Center
7
Charts
Disrtribution of Stage of the
Pancreatic Cancer In Patients
Disrtribution of Stage of the Pancreatic
Cancer In Patients
120
100
IV
80
III
60
IV
40
III
20
II
0
I
II
I
IV
Pie Chart
6/23/2009
III
II
I
Bar Chart
Arsia Jamali-Students' Scientific Research
Center
8
Charts
Histogram
6/23/2009
Arsia Jamali-Students' Scientific Research
Center
Area
9
Charts
Box Plot
6/23/2009
Error Bar
Arsia Jamali-Students' Scientific Research
Center
10
Charts
Disrtribution of Stage of the Pancreatic Cancer According
To Age In Patients
4500
4000
3500
60
50
40
Male
Female
30
Birth Weight
Number of The Patients
70
3000
2500
2000
1500
20
1000
10
500
0
0
IV
III
II
I
105
80
20
10
0
10
20
30
40
50
Gestational Week
Stage
Clustered Bar
6/23/2009
Arsia Jamali-Students' Scientific Research
Center
Scatter Plot
11
Charts
•
•
•
•
•
•
•
•
Pie chart (Qualitative Variables)
Bar Chart (Qualitative Variables)
Histogram (Quantitative Variables)
Area (Polygon) (Quantitative Variables)
Clustered Bar (Two Variables)
Box Plot (Two Variables)
Error Bar (Two Variables)
Scatter Plot (Two Variables)
6/23/2009
Arsia Jamali-Students' Scientific Research
Center
12
Significance of Cluster Bar
Distributaion of Stage in
Pancreatic Cancer Pateints
Distribuation of Gender in
Pancreatic Cancer Pateints
30%
60%
25%
50%
20%
40%
15%
30%
10%
20%
5%
10%
0%
0%
Stage I Stage II Stage III Stage IV
6/23/2009
Male
Arsia Jamali-Students' Scientific Research Center
Female
13
Significance of Cluster Bar
90.0%
80.0%
70.0%
60.0%
50.0%
Male
40.0%
Female
30.0%
20.0%
10.0%
0.0%
Stage I
5/24/2017
Stage II
Stage III
Arsia Jamali-Students' Scientific Research Center
Stage IV
14
Measures of Central Tendency
Mean (Average)
Median
Mode
6/23/2009
Arsia Jamali-Students' Scientific Research Center
15
Mean
•
•
•
•
•
•
µ= ∑X/ N
Takes into account all values
If describing a population, denoted as µ “mu”
If describing a sample, denoted as “x-bar”
Appropriate for describing measurement data
Seriously affected by outliers
6/23/2009
Arsia Jamali-Students' Scientific Research Center
16
Median
•
•
•
•
50th percentile.
Appropriate for describing measurement data
Is not affected much by unusual values
Calculation in Odd & Even samples
6/23/2009
Arsia Jamali-Students' Scientific Research Center
17
Mode
• Most frequent value
• One data set can have many modes
• Appropriate for all types of data, but most
useful for categorical data or discrete data
with only a few number of possible values
• Unaffected by extreme scores
• Not useful when there are several values that
occur equally often in a set
6/23/2009
Arsia Jamali-Students' Scientific Research Center
18
Exercise
4, 6, 3, 7, 5, 7, 8, 4, 5,10,10, 6, 8, 9, 3, 5, 6,
4, 11, 6
Mode = 6
Median = 6.5
Mean = 6.17
6/23/2009
Arsia Jamali-Students' Scientific Research Center
19
Note
The mean is the preferred measure of central
tendency, except when
There are extreme scores or skewed distributions
Non interval data
Discrete variables
6/23/2009
Arsia Jamali-Students' Scientific Research
Center
20
Measures of Spread (Variability)
Range = Highest - Lowest
Variance
Standard Deviation
Coefficient of Variation (CV)
6/23/2009
Arsia Jamali-Students' Scientific Research Center
21
CV Example
Consider the following:
Mean of the body weight in our sample is 70 and
its sigma is 5.
Mean of the height in our sample is 165 and its
sigma is 8.
The variability of which variable is less?
CVweight=7.1 & CVheight =5.1
6/23/2009
Arsia Jamali-Students' Scientific Research Center
22
Real Examples
• Article 1 (Table)
• Article 2 (Table & Graph )
• Your Turn!!!
6/23/2009
Arsia Jamali-Students' Scientific Research
Center
23
6/23/2009
Arsia Jamali-Students' Scientific Research
Center
24
6/23/2009
Arsia Jamali-Students' Scientific Research Center
26
Analytical Statistics
6/23/2009
Arsia Jamali-Students' Scientific Research
Center
27
Frequency Distribution
Distribution of frequency of two sides of a coin in 100
time trial
‫ بار پرتاب سکه‬100 ‫نمودار توزیع فراوانی شیر و خط در‬
6/23/2009
Arsia Jamali-Students' Scientific Research Center
28
Characteristics of A Normal
Distribution (1)
Bell Shaped
Symmetric
Unimodal
Mean = Mode = Median
Extends to +/- infinity
Area under the curve=1
6/23/2009
Arsia Jamali-Students' Scientific Research Center
29
Characteristics of A Normal
Distribution (2)
Can be completely specified by two
parameters:
Mean
Standard Deviation
The mean mu controls the center and sigma
controls the spread.
6/23/2009
Arsia Jamali-Students' Scientific Research Center
30
Frequency
6/23/2009
Arsia Jamali-Students' Scientific Research Center
31
Frequency
6/23/2009
Arsia Jamali-Students' Scientific Research Center
32
Standard Normal Distribution
• The standard normal distribution has mean = 0
and standard deviation sigma=1
6/23/2009
Arsia Jamali-Students' Scientific Research Center
33

Characteristics of A Normal
Distribution (3)
For any normal curve with mean mu and standard
deviation sigma:
68 percent of the observations fall within
one standard deviation sigma of the mean (  1)
 95 percent of observation fall within
2 standard deviations (  2)
 99.7 percent of observations fall within
3 standard deviations (  3)

6/23/2009
Arsia Jamali-Students' Scientific Research Center
34
Frequency
Z Transformation
0
6/23/2009
Arsia Jamali-Students' Scientific Research Center
35
Frequency
Z Transformation
0
6/23/2009
Arsia Jamali-Students' Scientific Research Center
36
Frequency
Z Transformation
0
6/23/2009
Arsia Jamali-Students' Scientific Research Center
37
Frequency
Z Transformation
0
6/23/2009
Arsia Jamali-Students' Scientific Research Center
38
Frequency
Z Transformation
0
6/23/2009
Arsia Jamali-Students' Scientific Research
Center
39
Frequency
Z Transformation
0
6/23/2009
Arsia Jamali-Students' Scientific Research Center
40
Z Transformation
Z- Transformation
Z = Xi - µ
δ
Application
Use of Z Table
6/23/2009
Arsia Jamali-Students' Scientific Research Center
41
Question
If Ali’s Math=17 and Science=15. Is he better at math or
science? Is this information enough?
Mean = 16, 14
SD= 1, 0.5
Consider mean of systolic BP in human being is 120 mmHg and
the standard deviation of the population is 10 mmHg. In what
percent of the people systolic BP is less than 134?
That is, P(Z < 1.40) = ?
Answer= 0.5 + P (0< Z <1.40)
6/23/2009
Arsia Jamali-Students' Scientific Research Center
42
Distribution of Means
6/23/2009
Arsia Jamali-Students' Scientific Research Center
43
Central Limit Theorem
Regardless of the distribution of the population,
the distribution of the means of random samples
approach a normal distribution for a large
sample size.
Xi  μ
Z
σ/ n
SEM = σ / √ n
Mean Distribution Chart
6/23/2009
Arsia Jamali-Students' Scientific Research Center
44
6/23/2009
Arsia Jamali-Students' Scientific Research Center
45
Estimation
In many situations, conducting a census is very
difficult and expensive
Actually it is not possible to repeat an study
for many times
So what should we do?
We may Estimate…
6/23/2009
Arsia Jamali-Students' Scientific Research Center
46
Confidence Interval (CI)
• An example: The mean of the height in a
sample is 175cm and the standard error of
mean is 5. May we guess what is the mean of
the population?
• With 95% probability, the mean of the
population will reside in 165-185 interval.
6/23/2009
51
Confidence Interval (CI)
Parameters: Interval limits
Probability
In applied practice, confidence intervals are
typically stated at the 95% confidence level
Remember there is 5% probability that mean of
the population does not rely in this interval
6/23/2009
Arsia Jamali-Students' Scientific Research Center
52
Hypothesis
In each problem considered, the question of interest
is simplified into two competing claims/hypotheses
between which we have a choice; the null hypothesis,
against the alternative hypothesis.
6/23/2009
Arsia Jamali-Students' Scientific Research Center
54
Hypothesis Example
Consider the goal of a study is:
Our drug is more effective in reducing pain than
morphine
So the hypothesis would be:
H0: Our drug is not more effective than
morphine in relieving pain
H1: Our drug is more effective than morphine in
relieving pain
6/23/2009
Arsia Jamali-Students' Scientific Research Center
55
Testing The Hypothesis
Null hypothesis (H0):
The hypothesis to be tested
No difference is seen
Alternative hypothesis (H1=Ha):
The hypothesis to be considered as an
alternative to the null hypothesis.
Note: The alternative hypothesis is the one believe
to be true, or what you are trying to prove is true
6/23/2009
Arsia Jamali-Students' Scientific Research Center
56
Types of Hypothesis Testing (1)
• Two Tailed Test
• Example: Mean height of the boys and girls
does not differ.
H0: heightboys= heightgirls
H1: heightboys≠ heightgirls
6/23/2009
Arsia Jamali-Students' Scientific Research Center
57
Types of Hypothesis Testing (2)
• Left-sided Test
• Example: Plasma albumin level in cirrhotic
patients is lower than 3.5 mg/dl
H0: Albuminpateints > 3.5
H1: Albuminpateints ≤ 3.5
6/23/2009
Arsia Jamali-Students' Scientific Research
Center
58
Types of Hypothesis Testing (3)
• Right-sided test
• Example: Systolic blood pressure of diabetic
patients is higher than 120mmHg
H0: SBPpateints<120
H1: SBPpateints≥ 120
Note: A hypothesis test is called a one-tailed
(directional) test if it is either right- or left-tailed
6/23/2009
Arsia Jamali-Students' Scientific Research Center
59
Results of Hypothesis Testing
Possible conclusions from hypothesis-testing
analysis are reject H0 or fail to reject H0
But we may make mistakes in the test:
Type I error: reject the null hypothesis when in fact
it is true.; that is, H is wrongly rejected.
0
probability of type I error is denoted by α
Type II error: accept the null hypothesis when it is
wrong.
probability of type II error is denoted by β
Arsia Jamali-Students' Scientific Research Center
60
Types of Errors
6/23/2009
Arsia Jamali-Students' Scientific Research Center
61
Type I error Example
Example: in a clinical trial of a new drug, the null
hypothesis might be that the new drug is no
better, on average, than the current drug;
H0: there is no difference between the two drugs on average.
A type I error would occur if we concluded that
the two drugs produced different effects when
in fact there was no difference between them.
6/23/2009
Arsia Jamali-Students' Scientific Research Center
62
Type I error
A type I error is often considered to be more serious, and
therefore more important to avoid, than a type II error.
The hypothesis test procedure is therefore adjusted so
that there is a guaranteed 'low' probability of rejecting
the null hypothesis wrongly; this probability is never 0.
This probability of a type I error can be precisely
computed as:
P (type I error) = significance level =α
6/23/2009
Arsia Jamali-Students' Scientific Research Center
63
Type II error
A type II error occurs when the null hypothesis H0, is not
rejected when it is in fact false.
For example, in a clinical trial of a new drug, the null
hypothesis might be that the new drug is no better, on
average, than the current drug. A type II error would occur
if it was concluded that the two drugs produced the same
effect, i.e. there is no difference between the two drugs on
average, when in fact they produced different ones.
P(type II error) = β
6/23/2009
Arsia Jamali-Students' Scientific Research Center
64
Type of Errors
• If we do not reject the null hypothesis, it may
still be false (a type II error) as the sample may
not be big enough to identify the falseness of
the null hypothesis (especially if the truth is
very close to hypothesis).
• For any given set of data, type I and type II errors are
inversely related; the smaller the risk of one, the
higher the risk of the other
6/23/2009
Arsia Jamali-Students' Scientific Research Center
65
Power
• Measures the test's ability to reject the null
hypothesis when it is actually false
• In other words, the power of a hypothesis test
is the probability of not committing a type II
error. It is calculated by subtracting the
probability of a type II error from 1
• Power = 1 - P(type II error) = 1-β
• The maximum power a test can have is 1, the
minimum is 0. Ideally we want to have high
power, close to 1.
6/23/2009
Arsia Jamali-Students' Scientific Research Center
66
Interrelationship between α, β, and
n
6/23/2009
Arsia Jamali-Students' Scientific Research Center
67
Sample Size
• How many individuals will I need to study?
• An appropriate sample size generally depends
on five study design parameters:
Minimum expected difference (also known as the
effect size)
Measurement variability
Statistical power
Significance level
one- or two-tailed analysis
Ref: Eng J. Sample size estimation: how many individuals should be studied? Radiology.
2003;227:309-13
68
Sample Sizes For Comparative
Studies (1)
N is the total sample size (the sum of the sizes of
both comparison groups),
σ is the assumed SD of each group (assumed to be
equal for both groups),
Zcrit ----significance criterion,
Zpwr ----statistical power,
D is the minimum expected difference between the
two means
Ref: Eng J. Sample size estimation: how many individuals should be studied? Radiology.
Arsia Jamali-Students' Scientific Research
6/23/2009
2003;227:309-13
Center
69
Sample Sizes For Comparative
Studies (2)
Ref: Eng J. Sample size estimation: how many individuals should be studied? Radiology.
70
Sample Sizes For Descriptive
Studies (1)
N is the sample size of the single study group,
σ is the assumed SD for the group,
Zcrit value,
D is the total width of the expected CI.
Note: this equation does not depend on statistical
power because this concept only applies to
statistical comparisons
Scientific Research
Ref: Eng
J. Sample size estimation: Arsia
howJamali-Students'
many individuals
should be studied? Radiology.
6/23/2009
Center
71
Sample Sizes For Descriptive
Studies (2)
Scientific Research
Ref: Eng
J. Sample size estimation: Arsia
howJamali-Students'
many individuals
should be studied? Radiology.
6/23/2009
Center
72
P value
• Consider we want to investigate whether the
SBP of diabetic patients is higher than general
population. We know the mean of SBP in
general population is 120mmHg and its
standard deviation is 10.
• In our study in find that mean of SBP in our 25
diabetic patients is 125mHg.
• What can we conclude?
6/23/2009
Arsia Jamali-Students' Scientific Research Center
73
P value
•
•
•
•
Z= (125-120)/(10/√25)
Z=05/2=2.5
α= 0.05
P value = 0.006
6/23/2009
Arsia Jamali-Students' Scientific Research Center
74
P value
P value: The probability of getting a value of the
test statistic as extreme as or more extreme
than that observed by chance alone, if the null
hypothesis H0, is true.
It is the probability of wrongly rejecting the null
hypothesis if it is in fact true.
The p-value is compared with the actual
significance level of our test and, if it is smaller,
the result is significant.
6/23/2009
Arsia Jamali-Students' Scientific Research
Center
75
P value
If the P value is 0.03,
that means that there is a 3% chance of observing
a difference as large as you observed even if the
two population means are identical (the null
hypothesis is true)
Does not show casual association
6/23/2009
Arsia Jamali-Students' Scientific Research
Center
76
P value
• Another Example: In a study, the height of 25
girls and 25 boys were measured. The height
of girls were 170cm and the height of boys
were 175cm. We want to see whether the
difference is significant or it was seen by
chance.
6/23/2009
Arsia Jamali-Students' Scientific Research
Center
77
P value
6/23/2009
Arsia Jamali-Students' Scientific Research Center
78
P value
• 95%CI are overlapping. So H0 (the no
difference hypothesis) is not rejected.
Remember it is not accepted as well.
6/23/2009
Arsia Jamali-Students' Scientific Research Center
79
P value
• 95%CI are not overlapping. So H0 (the no
difference hypothesis) is rejected. Therefore,
the difference is not seen by chance, it is real.
6/23/2009
Arsia Jamali-Students' Scientific Research Center
80
Measures for comparing rates
• Odds Ratio (OR)
• Relative Risk (RR)
6/23/2009
Arsia Jamali-Students' Scientific Research
Center
81
OR
• Odds definition: Probability of A/ Probability of  {P(A)/P(Â)}
• OR: Exposure odds in disease group divided by exposure odds
in non-disease group
• Case-Control Study
• Interpretation:
– OR=1
– OR>1
Smoking
OR<1
Not–Smoking
Lung Cancer
Healthy
Total
500
1800
2300
500
Disease+
7200
Total
Exposure+
1000
a
10000
Exposure Odds
Exposure500:500=1
OR
b
11000
c
1800:7200=1/4
d
Exposure(500:500)/(1800:7200)=4
Odds
a/c
Odds Ratio
6/23/2009
Control
7900
Arsia Jamali-Students' Scientific Research Center
b/d
ad/bc
82
RR
• Risk of the disease in the exposure group
divided by the risk of the disease in the nonexposed group
• Cohort Study
Healthy
Total
• Interpretation: Lung Cancer
Smoker
– RR=1
Not Smoker
Total – RR>1
Risk of Cancer
in Smokers
– RR<1
1000
4000
100
4900
Disease+
a8900
1100
Exposure+
Exposure- 1000:5000=1/5 c
Risk of Cancer in Non-Smokers
100:5000=1/50
Risk
a/(a+b)
RR6/23/2009
5000
5000
Control
10000
b
d
c/(c+d)
Relative(1000:5000)/(100:5000)=10
Risk
{a/(a+b)}/{c/(c+d)}
Arsia Jamali-Students' Scientific Research Center
83
Websites & Real Examples
Some Websites:
– http://www.hutchon.net/ConfidORselect.htm
---------------------------------------------------------------– http://www.metanumerics.net/Samples/ContingencyCalculator.aspx
----------------------------------------------------------------
Example 1
The mom Song
6/23/2009
Arsia Jamali-Students' Scientific Research Center
84
Real Examples
• Your Turn!
•
Female Driver
6/23/2009
Arsia Jamali-Students' Scientific Research Center
85
6/23/2009
Arsia Jamali-Students' Scientific Research Center
86
•
6/23/2009
Arsia Jamali-Students' Scientific Research Center
88
Statistical Tests
• Allow us to estimate the likelihood that the
apparent differences between groups are real
and not due to chance.
• Since there are two types of variables , we
need two types of statistical tests:
6/23/2009
Arsia Jamali-Students' Scientific Research Center
89
T test
• Compares means of two groups
• T test prerequisites:
1) Two groups of samples should be independent
2) Samples should have a normal distribution
• Note: samples greater than 30 do not need to have a
normal distribution
3) Samples should have similar standard deviation
6/23/2009
Arsia Jamali-Students' Scientific Research Center
90
Types of Tests
1) Independent-samples:
Tests the relationship between two independent
populations
2) Paired-samples:
Tests the relationship between two linked samples, e.g.
means obtained in two conditions by a single group of
participants
Examples: comparing the SBP of a group of patients before
and after administration of propranolol
6/23/2009
Arsia Jamali-Students' Scientific Research Center
91
ANOVA
Ali
6/23/2009
Nasim
Omid
Arsia Jamali-Students' Scientific Research
Center
Vahid
Akram
92
Chi square Test (1)
Qualitative data, odds, and risk
Chi2 test prerequisites:
1) The variable must be in the form of actual accounts not
frequencies
2) The frequency data must have a precise numerical
value and must be organized into categories or groups
3) The expected frequency in any one cell of the table
must be greater than 5
4) The total number of observations must be greater than
20.
5) Observations must be independent
6/23/2009
Arsia Jamali-Students' Scientific Research Center
93
An Example of Chi square Test
• Consider the following study: We have
assessed depression and gender of 100
people. The results of the study is summarized
in the following table:
Male
6/23/2009
Female
Total
Depressed
10
20
30
Not Depressed
40
30
70
Total
50
50
100
Arsia Jamali-Students' Scientific Research Center
94
Chi square Test (2)
χ2 = The value of chi square
Obs = The observed value
Exp = The expected value
∑ (Obs – Exp)2 = all the values of (O – E) squared then
added together
df = (R-1)(C-1)
6/23/2009
Arsia Jamali-Students' Scientific Research Center
95
An Example of Chi square Test
• If there is no difference between two ganders
(in depressed/not depressed people), we
expect the cells are filled as follows:
Male
Female
Total
Depressed
10 (15)
20 (15)
30
Not Depressed
40 (35)
30 (35)
70
Total
50
50
100
6/23/2009
Arsia Jamali-Students' Scientific Research Center
96
An Example of Chi square Test
• χ2 =
(10-15)2 + (20-15)2 + (40-35)2 + (30-15)2 =4.76
15
15
35
35
• Critical χ2 with 1 df (at p=.05) = 3.84
• Reject Ho : depression and gender are NOT
independent; they are associated.
6/23/2009
Arsia Jamali-Students' Scientific Research Center
97
Is your Dependent Variable (DV) continuous?
YES
NO
Is your Independent
Variable (IV) continuous?
YES
Is your Independent
Variable (IV) continuous?
NO
YES
NO
Do you have
only two
groups?
YES
6/23/2009
NO
Arsia Jamali-Students' Scientific Research enter
98
The road to happiness lies
in two simple principles:
find what it is that
interests you and that you
can do well, and when
you find it put your whole
soul into it – every bit of
energy and ambition and
natural ability you have.
John D. Rockefeller III
6/23/2009
Arsia Jamali-Students' Scientific Research
Center
99