Download • Mode: value which occurs most frequently. • If all values are

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Categorical variable wikipedia , lookup

Regression toward the mean wikipedia , lookup

Time series wikipedia , lookup

Transcript
Mode
• Mode: value which occurs most frequently.
• If all values are different there is no mode and a set of values
may have more than one mode.
• Used for quick estimation and for identifying the most common
observation.
• Properties:
• Not unique
• Simple
• Not robust, less stable than the median and the mean.
Dr. Alkilany 2012
Mode
• The condition of sixty patients
with arthritis is recorded using
a global assessment variable.
• A positive score indicates an
improvement and a negative
one a deterioration in the
patient’s condition after
treatment.
• The mean (0.77) [Do you think
the mean is the best
descriptive parameter for these
data?
Dr. Alkilany 2012
Mode
• A histogram of the above data shows that there are two distinct sub-populations.
• Slightly under half the patients have improved quality of life, but for the remainder, their lives
are actually made considerably worse.
Dr. Alkilany 2012
Mode
• Neither the mean nor the median indicator remotely describes the situation.
• The mean is particularly unhelpful as it indicates a value that is very untypical –
very few patients show changes close to zero.
• We need to describe the fact that
in this case, there are two distinct groups. The data consisted of values
clustered around some central points.
Dr. Alkilany 2012
Mode
• Data distribution can be ‘unimodal’ or ‘polymodal’ in the
case with several clustering.
• If we want to be more precise, we use terms such as bimodal
or trimodal to describe the exact number of clusters.
Dr. Alkilany 2012
How mean, median, and mode are related?
For symmetric distributions:
the mean and median are equal
For skewed distributions with a single mode
the three measures differ
Dr. Alkilany 2012
How mean, median, and mode are related?
For skewed distributions with a single mode
the three measures differ: mean>median>mode (positively skewed distributions)
mean<median<mode (negatively skewed distributions)
Dr. Alkilany 2012
Biostatistics
Lecture 4
Descriptive Statistics
“Indicators of dispersion”
Dr. Alkilany 2012
Indicators of dispersion
• If all observations are the same, there is no variability.
If they are not all the same, then dispersion is present in the data.
• Variation is an inherent characteristic of experimental observations due to
several reasons.
• it is always important to get an estimate of how much given objects tend to
differ from that central tendency
Dispersion (variability)
• In any experiment, variation will depend on:
• The instrument used for analysis.
• The analyst performing the assay.
• The particular sample chosen.
• Unidentified error commonly known as noise.
Central tendency
Dr. Alkilany 2012
Indicators of dispersion
(why we need them?)
1. A, B, C have the same mean
A
B
C
2. Based on similarity of the mean,
can we say the data sets are the
same?
3. What is the differences between
these data sets?
4. How we can describe the
(differences)?
Dr. Alkilany 2012
Dr. Alkilany 2012
Indicators of dispersion
Standard
deviation
Coefficient of
variation
Quartiles
Variance
Box and whisker
plot
Dr. Alkilany 2012
Alpha
Standard deviation
Bravo
• Two tabletting machines producing erythromycin
tablets with a nominal content of 250 mg.
• 500 tablets are randomly selected from each machine
and their erythromycin contents was assayed.
Mean~250 mg/tablet
Mean~250 mg/tablet
Although tablets from both machines had equal mean, do you
think the two machine still differ? How?
Dr. Alkilany 2012
Standard deviation
• The two machines are very similar in terms of average drug
content for the tablets, both producing tablets with a mean
very close to 250 mg. However, the two products clearly
differ.
• With the Alpha machine, there is a considerable proportion
of tablets with a content differing by more than 20 mg from
the nominal dose (i.e. below 230 mg or above 270 mg),
whereas with the Bravo machine, such outliers are much
rarer.
• An ‘indicator of dispersion’ is required in order to convey this
difference in variability and to decide which one has better
performance!!
Dr. Alkilany 2012
Standard deviation
n
SD

2
(X

X)
 i
i1
n 1
•This is the standard deviation (SD) for the sample
•For population it is usually donated : σ
•Same unit of the mean
Standard deviation is a widely used measure of variability and central dispersion
Let us go back to tabletting machines (raw data)!
Dr. Alkilany 2012
n
SD 
Standard deviation

Xi  X

Xi
__
X=
Dr. Alkilany 2012
(X
i1
i
 X) 2
n 1
Standard deviation
• The Alpha machine produces rather variable tablets and so
several of the tablets deviate considerably from the overall
mean.
• These relatively large figures then feed through the rest of the
calculation, producing a high final SD (8.72 mg).
• In contrast, the Bravo machine is more consistent and
individual tablets never have a drug content much above or
below the overall average.
• The small figures in the column of individual deviations, leading
to a lower SD (3.78 mg).
Dr. Alkilany 2012
Standard deviation
• Reporting the SD:
• The  symbol is used in reporting the SD
• The symbol  reasonably interpreted as meaning ‘more or
less’.
•  is used to indicate variability.
• With the tablets from our two machines, we would report
their drug contents as:
• Alpha machine: 248.78.72 mg (MeanSD mg)
• Bravo machine: 251.13.78 mg (MeanSD mg)
• The figures quoted before summarize the true situation. The
two machines produce tablets with almost identical mean
contents, but those from the Alpha machine are two to
three times more variable.
Dr. Alkilany 2012
Standard deviation and Coefficient of
variation
- Elephant tail=150±10 cm
- Mouse tail=7±3 cm
With this in mind, which is more variable: the
elephant tail length results or the one for the mouse?
CV  SD/ Mean *100
• Elephant tail: CV=10/150x100=6.7%
• Mouse tail: CV= 3/7x100=42.8%
• Coefficient of variation (CV) expresses variation
relative to the magnitude of data
• Useful to compare variation in two or more sets of
data with different mean values
• CV is has no unit (it is a ratio!)
Dr. Alkilany 2012
Variance
• The Variance: 2 (population) or S2 (sample) is a measure of
spread that is related to the deviations of the data values from
their mean.
Variance  SD2 
sample
Population
2

( X  )


2
SD
N
2
(X  X)


• Unit: same as mean but squared. If mean in mg, variance will
be in mg2
Dr. Alkilany 2012

n 1
2
Quartiles
Q2
Q1
Median
Q3
• Quartiles: the three points that divide the data set into four equal groups,
each representing a fourth of the population being sampled
• The median= Q2
First Quartile
Q1
cuts off lowest 25% of data
25th percentile
Second Quartile
Q2
cuts data set in half
50th percentile
Third Quartile
Q3
cuts off highest 25% of data, 75th percentile
or lowest 75%
Dr. Alkilany 2012
Quartiles
Interquartile range: difference between the upper and lower quartiles
IQR= (Q3 – Q1)
Dr. Alkilany 2012
Finding Quartiles
• To find the quartiles for a set of data, do the following:
1.Arrange the data from smallest to highest (ordered array)
2.Locate the median (Q2)
3.The half to the left: locate their median (Q1)
4.The half to the right: Locate their median (Q3)
Half to the left
Q1
Median
Q2
Dr. Alkilany 2012
Half to the right
Q3
Finding Quartiles
• Example with odd (n)
Times needed for 15 tablets to disintegrate in minutes:
5, 10 10 10 10 12 15 20 20 25 30 30 40 40 60
1. Data is already in an order from smallest to highest
2. Median is the (n+1/2)th=8th=20 (in bold red)
3. For the half to the right: n=7, median=4th=10 minutes
4. For the half to the right: n=7, median=4th=30 minutes
5. Q1=10 minutes; Q2= 20 minutes; Q3=30 minutes. IQR=Q3-Q1=20 minutes
6. This means that 25% of tablets need less than 10 minutes to disintegrate. Also 50% of tablets need 20 minutes to
disintegrate. Before 30 minutes, 75% of all tables were disintegrated. 25% only of these tablets need more than
30 minutes to disintegrate.
This question can come in this form
Disintegration time (min)
Frequency
5
1
10
4
12
1
15
1
20
1
25
2
30
2
40
2
60
1
Total
Dr. 15
Alkilany 2012
Finding Quartiles
• Example with even (n)
Times needed for 20 capsules to disintegrate in minutes:
5, 10, 10, 15, 15, 15, 15, 20, 20, 20, 25, 30, 30 40, 40, 45, 60, 60, 65, 85
1.Data is already in an order from smallest to highest
2.Median is the mean of the two middle values (n/2)th and ((n/2) + 1)th (in bold red)=10th and
11th=(20+25)/2= 22.5
3.For the half to the right: n=10, median=mean of 5th & 6th=15 minutes
4.For the half to the right: n=10, median=mean of 5th & 6th=42.5
Disintegration time
Frequency
5.Q1=15minutes; Q2= 22.5 minutes; Q3=42.5 minutes. IRQ=??
(min)
5
1
10
2
15
4
20
3
25
1
30
2
40
2
45
1
60
2
65
1
85
Total
1
Dr. Alkilany 2012
20
Quartiles
• Consider the elimination half-lives of two synthetic steroids have been
determined using two groups, each containing 15 volunteers.
• The results are shown in the following table with the values ranked from
lowest to highest for each steroid.
Dr. Alkilany 2012
Quartiles and IQR as a measurement for data spread
The IQR for the half life of steroid 2 is only half that for steroid 1,
duly reflecting its less variable nature.
Just as the median is a robust indicator of central tendency, the
interquartile range is a robust indicator of dispersion. The
interquartile range is a more useful measure
of spread than range as it describes the middle 50% of the data
values and thus less affected by outliers.
Box and whisker plot
• A box-and-whisker plot can be useful for handling many
data values.
• It shows only certain statistics rather than all the data.
• Five-number summary is another name for the visual
representations of the box-and-whisker plot.
• The five-number summary consists of the median, the
quartiles, and the smallest and greatest values in the
distribution (not including outliers).
• Immediate visuals of a box-and-whisker plot are the
center, the spread, and the overall range of distribution.
Box and whisker plot
• The first step in constructing a box-and-whisker plot is to first find the
median (Q2), the lower quartile (Q1) and the upper quartile (Q3) of a given set
of data.
• Example: The following set of numbers are weights of 10 patients in hospital
(kgs):
75
1
62
62
2
67
78
3
73
96
4
75
73
5
78
93
6
79
85
7
81
81
8
85
67
9
93
79
10
96
Smallest (S)
Q1
Median=78.5 (Q2)
S
L
Q3
Q3
Q1
Largest (L)
Q2