Download Numerical Summaries: Measuring Center of the Data Set

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Time series wikipedia , lookup

Transcript
Lecture 2. Descriptive Statistics: Measures of
Center
Descriptive Statistics
summarize or describe the important characteristics of a
known set of data
Inferential Statistics
use sample data to make inferences (or generalizations)
about a population
Numerical Summaries: Measuring Center of the Data Set






Example. You want to buy a 3-4 bdrm house. Need info on real
estate sales in Reno, say on 200 recently sold 3-4bdrm houses.
Data: $325,300, $287,650, $589,900, $230,900, …, $455,800.
Q: What is the “average” selling price for a 3-4 bdrm house?
What does AVERAGE mean?
Most common? Most frequent? Mode
Dividing selling prices in half, i.e. half are lower and higher that
the “average”? Median
Arithmetic average of all selling prices. Mean
MEAN
POPULATION MEAN: μ
Sample: n=sample size
SAMPLE MEAN
x
=(sum of all observations)/(the total number of
observations)
= (sum of all observations)/sample size
=
where
∑x
∑x.
n
is the sum of all n observations.
Example: Data with 5 observations of quiz scores: 8, 5, 7, 3, 7.
n=5. Mean score = (8+5+7+3+7)/5=6.
MEDIAN
SAMPLE MEDIAN is the “middle value” when the data is arranged in an
increasing (or decreasing) order. Equal numbers of observations are larger and
smaller than median.

SORT the data, then
Odd number of observations - the median is the middle observation.

Even number of observations - the median is the average of the two middle values.

Example. Quiz scores: 8, 5, 7, 3, 7. Find median of the quiz scores.
Step 1. Sort the data: 3, 5, 7, 7, 8.
Step 2. Even or odd n? Odd.
Step 3. Median is the middle observation =7.
Let’s add an observation: New quiz data: 8, 5, 7, 6, 3, 7. Find median of the quiz scores.
Step 1. Sort the data: 3, 5, 6, 7, 7, 8.
Step 2. Even or odd n? Even.
Step 3. Median is the average of the two middle observations. (6+7)/2=6.5=median.
MODE
SAMPLE MODE is the most frequent value in the data set.
Example. Quiz scores: 8, 5, 7, 3, 7. Find mode of the quiz scores.
Answer: Mode is 7 because 7 is most frequent.

Note that mode may not always be unique. Why?
Example: 1,2,4, 1, 5, 2, 7. Mode: 1 and 2: bimodal data

Note, that mode does not always exists.
Example: 1, 2, 5, 7, -1, 9, 3. No mode, all observations are different.
Midrange
Midrange is the value midway between the maximum and
minimum values in the original data set
Midrange =
maximum value + minimum value
2
Example. Quiz scores: 8, 5, 7, 3, 7. Find midrange of the quiz scores.
Solution: Maximum score=8, minimum score=3,
midrange=(8+3)/2=11/2=5.5
Round-off Rule for Measures of Center
Carry one more decimal place than is present in the original
set of values.


Example. Quiz scores: 8.3, 5, 7, 3, 7. reported with no
decimals. Find mean.
Solution: mean=(8.3 + 5 + 7+ 3 + 7 )/5=6.06 ≈6.1.
Mean from a Frequency Distribution
Assume that in each class, all sample values are equal to the
class midpoint, and use class midpoints of classes for variable x ,
f=frequency.
Example:
class
1 _ 10
11 _ 20
Mid
freq point fx
5
6 30
3
16 48
Mean=(30+48)/(5+3)=78/8= 9.75
OUTLIERS
Natural order of things – they
point to very important
phenomena like floods, heat
waves, hurricanes, etc. Should
not be discarded but studied.
2.0
1.5
1.0

Errors of measurements or
recording – in those cases,
people tend to disregard them.
0.5

0.0
Where do outliers come from?
2.5
OUTLIERS – observations FAR
outside the regular pattern of the
data.
0
Example. Waiting times (in minutes)
for a bus, 100 observations.
2
4
6
waiting time, min
8
10
OUTLIERS AND MEASURES OF CENTER
Example. Take quiz scores. 3, 5, 7, 7, 8. Suppose I made a recording error
and instead of 8 recorded 88. New data: 3, 5, 7, 7, 88.
New median= 7 = old median NO change,
New mode = 7 = old mode NO change,
New mean = (88+5+7+3+7)/5=22 LARGE change, old mean=6.
MEDIAN AND MODE are RESISTANT (ROBUST) TO OUTLIERS i.e. do not change
if we add outliers to a data set.
MEAN IS SENSITIVE TO OUTLIERS i.e. changes if we add an outlier to a data
set.
Summary: If you do not want the very large or the very small (outliers)
observations to affect the information you are getting about the “center” of
the data ask for median rather than the mean.
Best Measure of Center
Symmetry and Skewness of a Distribution
0.3
0.2
0.1
0.0
8
10
14
12
0.020
6
0.0
1.0
0.005
1.5
0.010
2.0
0.015
Skewed
histograms
0.5
0
0.0
Skewed
distribution of data is
skewed if it is not
symmetric and if it
extends more to one
side than the other
Symmetric
histogram
0.4
Symmetric
distribution of data is
symmetric if the left half
of its histogram is
roughly a mirror image
of its right half
0
2
4
6
8
10
200
400
600
800
1000
Skewness
MEAN = MEDIAN
MEAN < MEDIAN
MEAN > MEDIAN
MEASURES OF SPREAD
(VARIABILITY, VARIATION, DISPERSION)
Variability in Nature, life, and various processes we
0.0
0.0
0.01
0.1
0.02
0.2
0.03
0.3
0.04
investigate is fundamental to the theory of Statistics.
8
10
A
12
-20
0
20
B
40
Measures of variability
Range and Standard Deviation
RANGE is the difference between the largest and the smallest observations.
Range = maximum value – minimum value
The more variability or spread is in the data, the larger the difference between
the min and max, the larger the range.
Example. For the two data sets summarized in the histograms:
A: min = 6.96, max = 13, range=13 - 6.96 = 6.04
B: min = -22.74, max = 40.93, range= 40.93-(-22.74)=63.67
Sample Standard Deviation
STANDARD DEVIATION – measures average deviation of
the data from the mean.
2
Σ (x - x)
s=
n-1
Sample Standard Deviation (Shortcut Formula)
s
=
nΣ(x2) - (Σx)2
n (n - 1)
Computing standard deviation
Example. Data is number of km to school for a sample of 18 kids.
2 5 3 4 7 7 8 5 4 3 7 8 9 11 2 3 3 1.
Mean=(sum of obs)/18=92/18 ≈ 5.1
Numerator for Variance=(2-5.1)2 + (5-5.1)2 +… + (1-5.1)2 ≈ 133.8
Denominator = n-1= 17
Variance= 7.87
Standard deviation= square root of 7.87 ≈ 2.81 km
New data set: 2 5 3 4 7 7 8 5 4 3 7 8 9 11 2 3 3 30
New variance and standard deviation: variance = 40.57, st. deviation=
6.37 km.
Big change in standard deviation, caused by the presence of an
observation far from the rest of the data, an outlier.
Standard Deviation - Properties
The standard deviation is a measure of average variation of all
values from the mean.
The value of the standard deviation s is usually positive, always
nonnegative. Sample variance = 0 only if all deviations from the
mean are zero, that is all observations are the same.
 The value of the standard deviation s can increase dramatically
with the inclusion of one or more outliers (data values far away
from all others).
The units of the standard deviation s are the same as the units of
the original data values.
Variance
 The variance of a set of values is a measure of
variation
equal to the square of the standard deviation.
 Sample variance: Square of the sample standard
deviation s
 Population variance: Square of the population standard
deviation
σ
Round-off Rule for Measures of Variation
Carry one more decimal place than is present in the original
set of data.
If possible, round only the final answer, not values in the
middle of a calculation.
Empirical (68-95-99.7) Rule
For data sets having a distribution that is approximately bell
shaped, the following properties apply:
 About 68% of all values fall within 1 standard deviation of the
mean.
 About 95% of all values fall within 2 standard deviations of
the mean.
 About 99.7% of all values fall within 3 standard deviations of
the mean.
The Empirical Rule
Empirical Rule at work
Example: Exam scores follow an approximately bell shaped
distribution with mean 70 and standard deviation 5. What is
the approximate percentage of students scoring:
a.
b.
c.
d.
2 st. dev. or higher above the mean?
At most 3 st. dev. below the mean?
Between 2 st. dev below and 1 st. dev. above the mean?
More than 4 st. dev. above the mean?
Solution. a. 2.4% + 0.1% = 2.5%
b. 0.1%
c.13.5%+68%=79.5%
d. 0%
Definition
The coefficient of variation (or CV) for a set of sample or
population data, expressed as a percent, describes the
standard deviation relative to the mean.
Sample
CV
=
s • 100%
x
Population
CV
=
σ
• 100%
µ
Example: For the distance to school data (no outlier)
CV=5.1/2.81=1.81
With outlier: CV= 6.72/6.37=1.05