Download Chapter 3 Descriptive Measures Measures of Center (Central

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Time series wikipedia , lookup

Transcript
STP226 Brief Class Notes Instructor: Ela Jackiewicz
Chapter 3 Descriptive Measures
Measures of Center (Central Tendency)
These measures will tell us where is the center of our data or where most typical value of a data set lies
Mode – the value that occurs most frequently in the data set
Obtain the frequency of each value
1.
If the greatest frequency is 1, then there is no mode.
2.
If the greatest frequency is 2 or greater, then any value with that greatest frequency is the mode
of the data set.
Example: 2, 3, 3, 3, 4, 4, 5
Mode = 3
Median – divides the bottom 50% of the data from the top 50%
Arrange the data in increasing order
1.
If the # of observations is odd, the median is the observation exactly in the middle.
2.
If the # of observations is even, the median is the mean of the two middle observations.
For n observations, the position of the median is the
( n+12 )
th
position in the ordered distribution.
Ex Weight gain in pounds for 6 young lambs
1 2 10 11 13 19 ,
position=(6+1)/2=3.5 (median is between observation #3 and #4),
Median=(10+11)/2=10.5 lb
If we add one more observation: 10lb, data becomes:
1 2 10 10 11 13 19 ,
position=(7+1)/2 =4,(median is observation #4) Median=10lb
Median is a robust (resistant) measure of center, it is relatively unaffected by changes in small portion
of the data.
Mean – sum of the observations divided by the number of observations.
n
∑ xi
i=1
̄x = Mean (arithmetic mean)= ̄x =
n
In our example ̄
x =56/6~9.33 lb
, where
x i −s are observations in the sample.
STP226 Brief Class Notes Instructor: Ela Jackiewicz
Differences between each data point and the mean
( x i−̄x ) are called deviations from the mean and
n
their sum
∑ ( x i−̄x )=0
for any data set.
i=1
In our example sum of all deviations = (- 8.33)+ (- 7.33)+.67+1.67+3.67+9.67=0
Mean can be visualized as a point of balance of the weightless seesaw with points (like children)
sitting on it.
Unlike median, mean is not robust, it is influenced by any data changes, very much by extremes. If
data has some extreme values then median is a better measure of center for that data.
Mean vs Median
right skewed distribution,
Mean>Median
left skewed distribution,
symmetric distribution,
Mean< Median
Mean=Median
Measures of dispersion (variability)
Range=Maximum-Minimum, gives overall spread of the data, easy to calculate, but very sensitive to
extreme data values.
Sample Standard Deviation
DEFINITION:
s=
√
n
∑ (x i − x)
2
i=1
n−1
s averages the squared deviations from the mean. Square root is taken at the end, so the units of s are
the same as the units of the data.
Properties: s≥0 , s=0 if all data points are the same
s has the same units as your data
larger s indicates more variability
STP226 Brief Class Notes Instructor: Ela Jackiewicz
s2 is the sample variance.
We will abbreviate SD for standard deviation, s will be used in the formulas.
Ex. Experiment on chrysanthemums , botanist measured stem elongation in 7 days (in mm)
n=5 ̄x =365/5=73
76, 72, 65, 70, 82
xi
x i− ̄x
( x i− ̄x )2
76
72
65
70
82
3
-1
-8
-3
9
9
1
64
9
81
total
0
164
s==
√
164
=6.40 mm
4
variance s2=41mm2
s gives typical distance of the observations from the mean, larger s means more variability. Similar to
the mean, s is also influenced by extreme data values (not a robust measure).
n-1 =degrees of freedom of s, as an intuitive justification why we use ( n-1) not n we can consider
n=1, when variability of 1 observation can't be computed, one data point gives no information about
variability.
Sample standard deviation
xi
x 2i
76
72
65
70
82
5776
5184
4225
4900
6724
365
26809
COMPUTATIONAL FORMULA: s=
√
2
n
n
2
∑ xi −
(∑ )
i=1
i=1
n−1
xi
n
STP226 Brief Class Notes Instructor: Ela Jackiewicz
s=
√
26809−
4
(365)2
5
=
√
√
26809−26645
164
=
=6.40 mm
4
4
The more variation there is in a data set, the larger its standard deviation.
Similar to the mean, standard deviation is not robust, it is influenced by any data changes, very much
by extremes.
Three Standard Deviations Rule:
Almost all of the observations in any data set lie within three (3) standard deviations to either side of
the mean.
More Precise Rules for any data set: (optional)
Chebychev’s rule : ~ 89% of the observations in any data set lie within three standard deviations to
either side of the mean.
Chebychev’s rule (more precisely):For any data set and any number k > 1, at least 100(1 – 1/k2)% of
the observations lie within k standard deviations to either side of the mean.
If the distribution is ~ bell-shaped, the Empirical Rule implies that ~ 99.7% of the observations lie
within three standard deviations to either side of the mean. We will tallk about “bell shaped”
distributions later.
Typical Percentages: The Empirical Rule
For a “nice” distribution (pretty symmetric, unimodal, no very long or very short tails) we expect to
find :
about 68% of all data points within the interval ( ̄y −SD , ̄y+ SD)
about 95% of all data points within the interval ( ̄y −2SD , ̄y+ 2SD)
more than 99% of all data points within the interval ( ̄y −3SD , ̄y+ 3SD)
Effect of Transformation of Variables
Sometimes when we work with a data set it is convenient to transform our variable(s). For example ,
we may want to change units or transform very small numbers that appear in scientific notation to
something easier to use by multiplying original data by 10,000.
Linear transformation is the simplest one: Let X be the original variable with mean ̄x and SD =s,
then X ' =aX +b is it's linear transformation, mean and SD of X ' are ̄x ' and SD= s'
respectively. That type of transformation does not change the essential shape of the distribution of
X, the histogram of transformed variable can be made identical to the original histogram by suitable
scaling of the horizontal axis.
STP226 Brief Class Notes Instructor: Ela Jackiewicz
How Linear Transformation Affects mean and SD? Only mean (but not s) is affected by the
additive transformation (adding positive or negative constant b to X), but both mean and SD are
affected by multiplying X by a positive or a negative constant a:
̄x ' =a ̄x +b
and
s ' =∣a∣s
Ex Suppose X=summer temperature in some American city in 2013 in °F, ̄x =79.6 °F and s=12.7 °F.
If we would like to change the X to °C, the transformation is as follows:
5 5
5
5
5
X '=( X −32)∗ = X − ∗32 , so new mean ̄x '= 79.6−( ∗32)=26.44 °C and
9 9
9
9
9
5
s ' = ∗12.7=7.06 °C
9
Nonlinear transformations like the following examples:
1
, X '= X 2 , can affect data in complex ways and they do
X
change essential shape of the frequency distribution. If the distribution is right skewed, for example,
and we wish to make it more symmetric, we can apply square root transformation to pool the righthand tail and push out the left -hand tail. Logarithmic transformation will deliver even more drastic
change in that regard (check out the histograms given at the end of this section)
X '= √ X ,
X ' =log X ,
X '=
The five-number summary; Boxplots
Median, Percentiles, Deciles, Quartiles, Interquartile Range are all resistant measures.
Percentiles – divide the distribution into 100 equal parts (P1, P2, …,P99)
P1 divides the bottom 1% of the data from the top 99%
P2 divides the bottom 2% of the data from the top 98%
Etc,
Median is the 50th percentile
Deciles – divide the distribution into 10 equal parts (D1, D2, …, D9)
D1 divides the bottom 10% of the data from the top 90%
D2 divides the bottom 20% of the data from the top 80%
Etc,
Median is D5
Quartiles – divide the distribution into 4 equal parts (Q1, Q2, Q3)
Q1 divides the bottom 25% of the data from the top 75%
STP226 Brief Class Notes Instructor: Ela Jackiewicz
Q2 divides the bottom 50% of the data from the top 50%
Q3 divides the bottom 75% of the data from the top 25%
Median is Q2
To find the Quartiles
Arrange the data in increasing order.
1.
2.
3.
Q1 is the median of the data set that lies at or below the median of the entire data set.
Q2 is the median of the entire data set.
Q3 is the median of the data set that lies at or abowe the median of the entire
data set.
Examples:
1. n=7 (odd) Data: 3, 4, 5, 6, 12, 13, 14
Q1 =(4+5)/2=4.5
Q2 = 6
Q3= (12+13)/2=12.5
When n is odd, calculator and your book have slightly different ways to calculate quartiles:
Your Book: To compute Quartiles, Median is included in lower and upper part of data
Your calculator: To compute Quartiles, Median is excluded from the computations, so you will get
somewhat different values:
Q1 = 4
Q2 = 6
Q3= 13
2. n=10 (even) Data: 1, 3, 4, 5, 6, 12, 13, 14, 15, 18
Q1 = 4
Q2 = (6+12)/2=9
Q3= 13
Interquartile Range (IQR) – difference between the first and third quartiles.
IQR = Q3 – Q1
IQR gives the range of the middle 50% of the observations (approximately)
The five-number summary of a data set consists of the minimum, maximum, and the quartiles in
increasing order. Min., Q1, Q2, Q3, Max.
Outliers – observations well outside of the overall pattern of the data
LL=Lower limit = Q1 – 1.5 (IQR)
UL=Upper limit = Q3 + 1.5 (IQR)
STP226 Brief Class Notes Instructor: Ela Jackiewicz
Potential outliers are observations outside of the Lower and Upper Limits.
Boxplot (box-and-whisker diagram) and the modified boxplot
To construct a boxplot
1.
2.
3.
Determine the 5 number summary (Min, Q1, Q2, Q3, Max.)
Draw a horizontal axis on which the numbers obtained in step 1can be located. Above this axis,
mark the quartiles and the minimum and maximum with vertical lines.
Connect the quartiles to each other to make a box, and then connect the box to the minimum
and maximum with lines.
The following is Boxplot for example 1 (top of previous page):
3
4.5
6
12.5
14
To construct a modified boxplot
1.
2.
3.
4.
5.
Determine the quartiles.
Determine potential outliers and the adjacent values.
Draw a horizontal axis on which the numbers obtained in steps 1 and 2 can be located. Above
this axis, mark the quartiles and the adjacent values with vertical lines.
Connect the quartiles to each other to make a box, and then connect the box to the most extreme
obs. that are still lying within the upper and lower limits
Plot each potential outlier with an asterisk.
The two lines stretching out on both sides are the whiskers.
Example Data represents systolic blood pressure (in mmHg) of 7 adult males
151 124 132 170 146 124 113
We order data first: 113 124 124 132 146 151 170
Min=113, Max=170, Median=132 Q1=124 Q3=151 (Median is excluded when we compute quartiles)
Boxplot connects all 5 numbers in the following way, the box represents middle half of the data.
STP226 Brief Class Notes Instructor: Ela Jackiewicz
110
120
130
140
150
160
170
Are there any outliers?
In our example: IQR=151-124=27, 1.5(IQR)=1.5*27 = 40.5
lower limit=124-40.5=83.5, upper limit = 151+40.5 = 191.5, all observations are within the limits, so
so there are no outliers in our data set.
Example Radishes growth (in mm) in the light.
4 5 5 7 7 8 9 10 10 10 10 14 20 21
Min=4, Max=21, Q1=7, Median=(9+10)/2=9.5 Q3=10
IQR=3, lower limit=2.5 upper limit=14.5, so 20 and 21 are outliers. Modified box plot exposes
outliers.
**
5
10
15
20
25
Descriptive Measures for Populations; Use of Samples
Statistical Inference is the process of drawing conclusions about the population based on the
observations in the sample.
Notation:
Sample
Size
n
Mean
x
SD
s
Population
N
μ
σ
STP226 Brief Class Notes Instructor: Ela Jackiewicz
Parameter – A descriptive measure for a population.
Example: x , s
Statistic – A descriptive measure for a sample.
Sample mean, x , is used to estimate a population mean,
Sample SD, s, is used to estimate population SD, σ
Population Mean
Example: , 
μ
μ (Mean of a Variable) – computed in same manner as for a sample mean
For a variable X, the mean of all possible obs. for the entire population is called the population mean
or mean of the variable X. It is denoted by x or when no confusion will arise, simply by . For a
finite population, we have
 =
x
N
where N is the population size.
Population Standard Deviation σ
(Standard Deviation of the Variable)
For a variable x, the standard deviation of all possible obs. for the entire population is called the
population standard deviation or standard deviation of the variable x. It is denoted by x or, when
no confusion will arise, simply by . For a finite population, we have
=
 (x )
N
2
=
√(
∑ x 2 −μ2
N
)
where N is the population size.
Population Variance 2
Standardized Variable – For a variable X, the variable z =
x

is called the standard score or z-score
z is also called the standardized version of x or the standardized variable corresponding to the
variable x.
The value of the z score tells us how many standard deviations above or below the mean is a
particular value of x.
STP226 Brief Class Notes Instructor: Ela Jackiewicz
Properties of z-scores:
z<0 if x is below the mean,
z>0 if x is above the mean
z=0 if x is equal to the mean.
z-scores have mean=0 and SD=1
z-scores have no units:
z =
z
N
 (z )
2
=0 , z=
=1
N
Most of the z-scores are between -3 and 3 (3 SD Rule)
Example: Final test scores in all Mat119 classes last semester have mean μ=72 and SD σ=10
a) Jane scored 86 points, find and interpret her z-score
z=
x
, z=(86-72)/10=1.4 Het test grade is 1.4 standard deviations above the average

b) Jack's z-score was -1.0, what was his test score is:
x=μ+ z σ , X=72-1.0(10)=62
c) True or false? Very few final test scores are below 42 points or above 102 points .
True, most of the scores are within 3SD-s from the mean , in an interval (42,102), so very few are
outside of that range