Download What is data?

Document related concepts

Psychometrics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Welcome
to the
Biostatistics 1
Course instructor: Dr. JMA Hannan
Class hours: Monay 6:00 pm – 9.00 pm
Cell: 01199248989
E-mail: [email protected]
How to do well in this class?
1. Forget about your previous failure.
2. Attend lectures and take notes.
3. * Effort = Result.
4. Read the syllabus.
5. Read exam questions carefully.
6. Answer all parts of a given question.
7. Turn assignments in on time
8. Ask if you have questions.
General Policy
Examination
Marks
Midterm 1
Midterm 2
Final exam
Class tests
Assignment
Class participation
20%
20%
40%
10%
5%
5%
Total marks
100
Grading Policy
Numerical Scores
Letter Grade
93 and above
90 – 92
87 – 89
83 – 86
80 – 82
77 – 79
73 – 76
70 – 72
67- 69
60 – 66
<60
A
AB+
B
BC+
C
CD+
D
F (Fail)
If you are absent in 3 consecutive classes you will be given “F”
Topics
Lecture 1 : Introduction to Biostatistics – scope of Biostatistics in biology
and medical sciences. Data & presentation, Mean, Median and Mode; Rang,
Standard Deviation, Standard error and Coefficient of variation.
Lecture 2: Normal distribution , Test of hypothesis
Lecture 3 : z-test, t-test
Lecture 4 : One way ANOVA
1st Midterm Exam (July 10 - 15, 2008)
Lecture 5 : Post Hoc tests (Bonferroni, Duncan, Dunnet, LSD, Tukey test),
Repeated measure ANOVA
Lecture 6: Mann-whitney, Wilcoxon rank test & Kruskal-Wallis test (Tukey
test)
Lecture 7: Chi-square test, Relative risk, Odds ratio
Lecture 8: Simple Correlation & Rank Correlation
Lecture 9: Regression analysis.
Lecture 10-12: Introduction to SPSS and analysis of data using SPSS.
Lecture 13 - : Review class
Final Exam (September 10 – 15, 2008)
Objective of this course
• To develop and understand the fundamental concepts of
statistics.
• To be knowledgeable about different application of
statistical methods in the MPH context.
• To enable students to conduct statistical analyses via a
user friendly software package like SPSS and to correctly
interpret the output.
• To be capable to correctly analyze simple data sets and to
report the results in a precise and concise way.
Textbook and reference books
• Text Book of Medical & Pharmaceutical Statistics –
Dr JMA Hannan
• Biostatistics : A Foundation for Analysis in the Health
Science, by Wayne W. Daniel.
• Medical Statistics by Michael J. Campbell, David
Machin.
STATISTICS - HISTORICAL PERSPECTIVES
 Statistics seems to be derived from Latin word ‘Status’ or
Italian word ‘Statista’ or German word ‘Statistik’ or French
word ‘Statistique’ which all meaning ‘political state’.
 In ancient time the king used to collect information about
total population, land, wealth, soldiers of the country and
thus statistics served as an index of a country’s overall
condition. In olden days, statistics was regarded as ‘the
science of kings’.
STATISTICS - HISTORICAL PERSPECTIVES
In mid 17th century, the theoretical development
in modern statistics came with the introduction
of ‘Theory of Probability’ and ‘Theory of Games
and Chances’.
Gambling, in the form of games of chance, led to
this theory of probability being originated by the
French mathematician Pascal (1623-1662).
Pascal
STATISTICS - HISTORY PERSPECTIVES
 Francis Galton (1822-1921) introduced the
concept of regression line.
F. Galton
 Galton and his friend Karl Pearson later
introduced correlation analysis and chaisquare test which play an important role in
modern theory of statistics.
 W.S. Gosset (1876-1937), student of Karl
Pearson, introduced ‘Student t-test’ is the
basic tool of statistical analysis.
Karl Pearson
W.S. Gosset
STATISTICS - HISTORY PERSPECTIVES
 Sir R.A. Fisher (1890-1962), known as the
father of statistics, introduced a number of
statistical procedures such as Analysis of
Variance (ANOVA) and design of experiments
and so on.
R.A. Fisher
DEFINITION OF STATISTICS
Croxton and Cowden have given a very simple and
concise definition of statistics.
Statistics is the science of
•
collection
•
organization
•
presentation
•
analysis and
•
interpretation of data
DEFINITION OF BIOSTATISTICS
BIOSTATISTICS is derived from Greek word Bios (Life) & Metron
(Measure).
Thus biostatistics is the term used when tools of statistics are
applied to the data that derived from biological and medical
science.
Biostatistics is the science of
• Collection
• Organization
• Presentation
• analysis and
• interpretation of data that is derived from biological sciences such
as medicine.
Why use Statistics?
• Simplifies complexity
• Helps to compare
APPLICATION OF OF BIOSTATISTICS
The concepts of statistics may be applied to a number of fields that include public
health, pharmaceutical company, business, psychology, agriculture etc.
In Medicine
In the field of medicine, statistical methods are used to evaluate
effectiveness of a new drug and method of treatment. A drug is given to
animal or human to explore whether the changes produced by the drug are
due to the action of drug or by chance, or to compare the action of two or
more different drugs or different dosages of the same drug are studied
using statistical methods.
To find an association between disease and risk factors such as myocardial
infarction (MI) and alcohol intake, we need the help of statistics.
To define the normal range/limit of physiological and biochemical
parameters for example: the average systolic blood pressure is 120 mmgHg
or random blood glucose level is 6.7mmol/l but upto what limits it may be
normal on either side of average which may be established with
appropriate statistical technique.
Continuation……..
APPLICATION OF OF BIOSTATISTICS
In Community Medicine and Public Health
 In epidemiological studies – the role of causative factors is statistically tested.
For example, deficiency of iodine as an important cause of goiter in a
community is confirmed only after comparing the incidence of goiter cases
before and after giving iodized salt.
 To test usefulness of vaccines in the field – percentage of attacks or deaths
among the vaccinated subjects is compared with that among unvaccinated ones
to find whether the difference observed is statistically significant.
 Statistics play an important role in many decisions –making processes in public
health like:
What factors increase the risk that an individual will develop coronary hart disease?
 To address these issues and others, we rely on the methods of bio-statistics.
What is data?
•
The raw material of statistics is data.
•
We may define data as numbers or observations usually obtained by
some process of counting or measurement.
• It is the outcome of
 facts (sex, occupation),
 events (birth, death, disease)
 measurements (height, weight)
About many individual i.e. when these happens for number of people then it
becomes data e.g.
 Sex: male/female,
 Birth: live birth/still birth
 Death: cause/age/sex
 Occupation: teacher/physician/labor etc
Types of data
1. Qualitative data
» Nominal data
» Rank data
2. Numerical or Quantitative
» Discrete data
» Continuous data
Qualitative Data
Nominal Data
• Nominal data are data that one can name.
• They are not measured but simply counted.
• They often consist of unordered ‘either-or’ type
observations,
• for example: Dead or Alive; Male or Female;
Cured or Not Cured; pregnant or Not pregnant
Qualitative Data
Ranked Data
• If there are more than two categories of classification it
may be possible to order them in some way.
• For example, after treatment a patient may be either
improved, the same or worse; a woman may never have
conceived, conceived but spontaneously aborted, or give
birth to a live infant.
• In some situations we have a group of observations that
are first arranged from highest to lowest according to
magnitude and then assigned numbers correspond to each
observation’s place in the sequence. This type of data is
known as ranked data.
Numerical/Quantitative data
Discrete Data
• Such data consist of counts which are only
isolated points.
• Example may be the number of deaths in a
hospital per year.
Continuous Data
• Such data are measurement that can, in theory
at least, take any value within a given range.
• Example: Diastolic blood pressure, which is
continuous, is converted into hypertension and
normotension.
Collection of data
1. Interviewing or enumeration
2. Questionnaire
3. Experiments
Data from Physiology, Pharmacolgy and clinical pathology lab,
hospital ward, fundamental research etc
4. Surveys
Data of incidence/prevalence of health or disease situation in a
community such as incidence of malaria or prevalence of leprosy etc
5. Records
Records are maintained as a routine in register or books over along
period of time for still birth, death etc. Data are collected from these
records.
Methods of presentation of data
Every study or experiment yields a set of data. Its
size can range from a few measurements to many
thousands of observations.
The principal object of data presentation, whether
tabular or graphical, is to convey the essential
features of the study to any reader of the final
publication.
1. Tabulation of data
2. Diagrammatic presentation
Tabulation of data
A statistical table is a systematic organisation of data in
columns and rows in accordance with some
characteristics. Tabulation is the process of presenting
data in tables.
Objectives of tabulation:
 To clarify the object of investigation.
 To simplify complex data.
 To facilitate comparison
Rules for Tabulation of data
Construction of a good statistical table is a specialized art and
requires great skill, experience and common sense.

The table should be simple and compact

All title, subtitle, caption etc should be arrange in a systemic
manner.

The unit of measurement should be clearly defined in the table.

A table should be complete and self-explanatory.

A table should be attractive to draw attention of readers.

Accurate statistical analysis should be done.

Abbreviation should be avoided
•
If units of measurements are involved, such as mg/100 ml for the
serum cholesterol levels should be specified.
Parts of tabulation
•
•
•
•
•
•
Table number
Title
Caption
Headings of columns and rows
Body of the table
Foot-note
Number & Title of the table
Row
Caption
Heading
Col. Heading
Col. Heading
Row sub
heading
Row sub
heading
Row sub
heading
Body
Total
Col. Heading
The following Table Shows the consumption per person
among adolescent boys
Year
Number of cigarettes
consumption per adolescent boy
1996
654
1997
700
1998
900
1999
1200
2000
1500
2001
1350
Example: Tabulation of data
Table 2: Effects of propranolol on blood pressure in human
Group
Blood pressure
Systolic BP
Diastolic BP
MAP
Control (n=10)
210±40
100±24
74±29
Propranolol treated (n=20)
120±30
65±20
60±23
4.23/0.01
3.12/0.02
2.13/0.05
t/p value (upaired t-test)
Control vs propranolol
Data are presented as mean±SD. Unpaired t-test was done as the test of
significance. *p<0.05, **p<0.01.
Graphical presentation of data
A diagram is a visual form for presentation of data.
Complicated data through a diagram or graph can easily
be understood. It is convincing to the eye and mind.
Importance of diagrams:
 They are attractive and impressive.
 They save time and labour to understand
 They make data simple.
 They make comparison easy
 They provide more information than table
Types of diagram

Line diagram.

bar diagram (simple & multiple)

Pie diagram

Histogram

Scatter
Line Diagram
Number of cigarettes consumption per adolescent
boy
2000
1500
1000
500
0
1996
1997
1998
1999
2000
2001
10
8
6
4
2
0
0 min
30 min
Glucose
60 min
90 min
Diadetic food1
Fig: AUC and Glycemic index of diabetic food in rats.
120 min
Bar Diagram showing Number of cigarettes
consumption per adolescent boy
Number of cigarettes consumption per
adolescent boy
2000
1500
1000
500
0
1996
1997
1998
1999
2000
2001
250
200
150
100
50
0
AUC
Glucose only
Glycemic
Index
Diabetic food
Fig: AUC and Glycemic index of diabetic food in rats.
Pie Diagram for Current Contraceptive
Use in Bangladesh(BDHS 2004)
19%
Pill
Injectable
2%
45%
Condom
Sterilization
IUD and Norplant
10%
Traditional
7%
17%
SUMMARIZING DATA: Measures of location
A measure of location or central tendency or average is
a single value used to represent a set of data.
Objective of average:
1. To get single value that represent the entire data.
2. To facilitate comparison between groups of data of similar nature.
Important measures of central tendency are:
1. Mean
2. Median
3. Mode
Mean
Mean = sum of all the observation values ÷ number of observations
The mean
x
of ‘n’ observations
x1 , x2 , x3 ............xn
is given by
x1 , x2 , x3 ........xn
x
n
x

x
n
where
x stands for an observed value.
n stands for the number of observations in the data set.
stands for the sum of all observed x values.
stands for the mean value of x.
Example: mean of 10, 20, 30, 25, 15 is (10±20±30±25±15)/5 = 20.
Mean
Merits:
• It is the most popular average easy to understand and easy to calculate.
• It takes all the observation into account.
• The mean is used in computing other statistics (such as the variance, standard
deviation etc)
Limitation:
• Mean is affected by extremely high or low values.
• It is not a good measure of average in extremely asymmetric
distribution of observations.
Median
Median = the middle value of a set of data.
When all the observations of a set of data are arranged in either
ascending or descending order, the middle observation is known
as median. If the number of observation is even, the mean of the
two central values is taken as the median.
Median for group data
Example: Median of a grouped frequency distribution
Mark in test
5 - 9
9 - 13
13 - 17
17 - 21
21 - 25
25 - 29
Frequency
12
8
15
19
14
7
Cumulative
frequency
12
20
35
54
68
75
Range of
Cumulative frequency
< 12
12 - 20
20 - 35
35 - 54
54 - 68
68 - 75
75
Here n= 75, Therefore n/2 = 75/2 = 37.5. Looking at the cumulative range column in the
table, we find that n/2 (37.5) falls in the range 17 – 21. This means that median value
lies between 17 and 21.
L = 17, F = 35, f = 19, c = 5.
Here
n
F
2
Median  L 
c
f
= 17 ±
37.5  35
5
19
= 17.66
Where, L = The lower limit of the median class (median class is that class which contains n/2 observations of the
series). N = Total number of observation
F = Cumulative frequency of the class just preceding the median class.
f = Frequency of the median class
c = The class interval of the median class.
Median
Merits:
• Median is easy to understand and easy to calculate.
• It is not affected by extremely high or low values.
Limitation:
• It is not based on all the observations. It is a position
average and thus it is not determined by each and every
observation.
• It is less reliable average than mean when number of
observation is small.
Mode
The mode is the value of a set data that occurs most
frequently. It is the typical or commonly observed value
which occurs maximum number of times.
Example: the mode of the observations 3, 6, 7, 9, 6, 8, 6
= 6.
Mode
Merits:
• Mode is easy to understand and easy to calculate.
• Like median, mode is not at all affected by extremely high or
low values.
• When there is a large frequency in a distribution, mode happens
to be meaningful as an average.
Limitation:
• It is not based on all the observations.
• It is less reliable average than mean when number of
observation is small.
Since average is a single value representing a
group of values it must be properly
interpreted otherwise there is a possibility to
wrong conclusion.
COMPARISON OF MEAN, MEDIAN & MODE
•
The mode is useful for non-numeric data. It provides little information about the rest of the
values in the data.
•
The mean can be seriously affected by the presence of outliers (When an observation is very
different from all other observations in a data set, it is called an outlier i.e very small or large
values, eg. 200) but the median is not.
5
7
8
8
median = 12,
5
7
8
12
15
19
21
23
19
21
23
mean = 13.1
8
median = 13.5,
12
15
200
mean = 31.8
•
The median (a position average) does not alter because it is only dependent on the
middle observation's value. The mean does change, however, because it is dependent on
the average value of all observations. So, in the above example, as the last value of the
last observation increases, so too does the mean.
•
Outliers can sometimes occur as a result of error or deliberate misinformation. In these
cases, the outliers should be excluded from the measure of central tendency. Other
times, outliers just show how different one value is, and this can be a very useful piece
of data.
COMPARISON OF MEAN, MEDIAN & MODE
3.
Cont’d
Half of the data lies below the median and half of the data lies above it. This will be
approximately true for the mean when the data is symmetric. If the data is skewed,
then the median may differ significantly from the mean and usually the median
would be used.
4.
By choosing a wrong measure of central tendency, one can mislead people with
statistics. In fact, this is commonly done.
SUMMARIZING DATA: Measures of variation
Measure of Dispersion (variation) is the measure
of extent of deviation of individual value from the
central value (average). It determines how much
representative the central value is. Dispersion is
small if the values are closely bunched about their
mean and it is large if the values are scatted
widely about their mean.
The median and mean mark for both tests are 20 but data A is more spread
out than data B.
Important measures of dispersion are:
1. Range
2. Variance & standard deviation
3. Standard error of Mean
4. Co-efficient of variation.
Range
Range is the absolute difference between the highest
value and the lowest value in a series of observations.
Range = largest value - smallest value
Example: the weight of 10 students are:
25, 28, 33, 36, 40, 45, 49, 52, 55, 57.
Range is 57 – 25 = 32.
• The range is the simplest measure of dispersion.
• It is a rough measure of dispersion as its measure depends
upon the extreme items and not on all the items.
• It does not tell us anything about the distribution of values in
the series.
Range
Application:
 Range is used in medical science to define the normal limits of biological
characteristics.
 Example: normal ranges of systolic and diastolic blood pressure are 100 –
140 mm and 80 –90 mm respectively. Ordinarily observations falling within a
particular range are considered normal and those falling outside the normal
range are considered as abnormal.
 Range for a biological character such as blood cholesterol, fasting blood
sugar, hemoglobin, bilirubin etc is worked out after measuring the
characteristics in large number of healthy persons of the same age, sex, class
etc.
Range
Merits:
•
It is simple to compute and understand.
•
It gives a rough but quick answer
Limitation:
1. It is not a satisfactory measure as it is based only on two
extreme values, ignoring the distribution of all other
observations within the extremes. These extreme values vary
from study to study, depending upon the size and nature of
sample and type of study.
Variance & Standard deviation
• Karl Pearson introduced the concept of Standard Deviation in 1893.
• The standard deviation is a statistic that tells us how tightly all the
values are clustered around the mean in a set of data.
The mean of the squares of the deviations of every observation from
their mean is a measure of spread and is called the variance. The
standard deviation is the square root of the variance.
Variance 
2
(
x

x
)

n 1
S.D. ( ) 
2
(
x

x
)

n 1
It is computed as the root of average squared deviation of each number from its
mean. For example, for the numbers 1, 2, and 3 the mean is 2 and the standard
deviation is:
SD = 0.667 = 0.44
Standard deviation
Merits:
•
It is the most important and widely used measure of dispersion.
•
It is based on all the observations and the actual sign of deviations
are used.
•
Standard deviation provides the unit of measurement for the normal
distribution.
•
It is the basis for measuring the coefficient of correlation, sampling
and statistical inference.
Limitation:
•
It is not easy to understand and difficult to calculate
•
It is affected by the value of every item in the series.
Calculations of SD:
= 20
= 20
In these two groups, means are same (20) but their variation (SD) is different (SDA, 8.2 and SDB, 5.5).
Calculations of SD with alternative formulas:
Greater SD, greater is variation of observation.
Mean is presented with SD as ….. Mean±SD.
Standard Error of Mean
The standard error of a sample mean is just the sample standard deviation
divided by the square root of the sample size.
SE

SD
n
 If we draw a series of samples from same population and calculate the mean of
the observations in each, we have a series of means. The series of means, like
the series of observations in each sample, has a standard deviation. The SE of
the mean of one sample is an estimate of the SD that would be obtained from
the means of a large number of samples drawn from the population.
 Another thing is if we draw random samples from the population their means
will vary from one to another. This variation depends on the variation of
population and size of samples. We do not know the variation of population so
we use the variation of the sample as an estimate of it. This is expressed in SD
and if we divide SD by squire root of the number of observations in the sample
we have an estimate of SE of mean, SEM = SD/n
Advantage of SE
•
To determine the significant difference of two means of different variables.
z 
x  
s
n
•
To calculate the size of sample. If SD is known.
SE 
 =
n
n

SE
Greater SE, greater is variation of observation.
Mean is presented with SE as ….. Mean±SE.
Co-efficient of variation (C.V.)
Relative measure of variation is called Co-efficient of
variation (C.V.).
C.V. is defined as the S.D. divided by the mean times
100.
SD
C.V . 
 100
Mean
It is useful in comparing distribution whose units or characters
may be different e.g. height in cm in one and in inches in the
other.
Co-efficient of variation (C.V.)
Example: Height (cm) of adult and children are given in the table
Mean
Adult
Children
160 cm
60 cm
SD
CV
10 cm
5 cm
6.25%
8.33%
It means though height in adult shows greater variation in SD,
but real thing is that children is greater variation.
Population & Sample
Population
• All possible values of a variable or all possible objects whose
characteristics are of interest in any particular investigation or
enquiry.
• If the income of the citizen of country is of interest to us, the
aggregate of all relevant incomes will constitute the population.
Sample
• A sample is a part of population.
• Although we are primarily interested in the properties of a
population or universe, it is often impracticable or even impossible to
study the entire universe.
• Thus inferences about a population are usually drawn on the basis of
a sample. It represents the population.
Normal Distribution
• The normal distribution was first introduced by the French mathematician
La Place (1749-1827).
• It is highly useful in the field of statistics. The graph of this distribution is
called normal curve or bell-shaped curve.
• In normal distribution, observations are more clusters around the mean.
Normally almost half the observations lie above and half below the mean
and all observations are symmetrically distributed on each side of the mean.
• The normal distribution is symmetrical around a single peak so that mean
median and mode will coincide. It is such a well-defined and simple shape, a
great deal is known about it. The mean and standard deviation are the only
two values we need to know o be able to describe a normal curve
completely.
Normal Distribution
Normal Distribution
Characteristics :
• The curve is symmetrical
• It is a bell shaped curve.
• Maximum values at the center and
decrease to zero symmetrically on each side
• Mean, median and mode coincide
Mean = Median = Mode
 It is determined by mean and standard deviation.
Mean1SD limits, includes - 68% of all observations
Mean  2SD - ,,
,,
- 95%
,,
,,
Mean  3SD - ,,
,,
- 99%
,,
,,
Normal Distribution
• Almost all statistical tests (t-test, ANOVA etc)
assume normal distributions. These tests work
very well even if the distribution is only
approximately normally distributed.
• Some tests (Mann-whitney U test, Wilcoxon W
test etc) work well even with very wide deviations
from normality.
Normal Distribution
Group
One
Two
Distribution is normal
Distribution is not normal
(Parametric tests)
(Nonparametric tests)
Mean±SD
t-test
Median (Range)
Non-parametric t-test (or Rank Test)
 Unpaired t-test
 The Mann-Whitney U test
 Paired t-test
 The Wilcoxon Matched-Pairs SignedRanks Test
Three or more
 One Way ANOVA
 The Repeated Measures
ANOVA
Relationship
between two
variables
• The Correlation
Coefficient
 Simple Linear Regression
 The Kruskal-Wallis One-way ANOVA
by Ranks
 The Friedman’s Test
 The Spearman Rank Correlation Coefficien
 Nonparametric Regression Analysis