Download CHAPTER 4 NUMERICAL METHODS FOR DESCRIBING DATA

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Time series wikipedia , lookup

Transcript
CHAPTER 4
NUMERICAL METHODS
FOR DESCRIBING DATA
What trends can be
determined from individual
data sets?
4.1 Describing the Center of a Data Set
• What is the center of a data set and how can it be found?
Center and Spread
• Two of the most critical descriptors of a data set
• Graphical methods such as those in the last chapter give a
general impression of both
• Numerical methods give precise value that can be compared in
detail
The three M’s
• Mean
• Median
• Mode
• Also known as the average
• Also called the middle
• Most Frequent
n
x
x
i 1
n
Mean
i
formula for the sample mean
• x= each piece of data
• n= number of pieces of data in the data set
• xi= I indicates the position of the data from within the original data
set
Always use more accuracy (more decimals) than any one piece of data
has.
µ is used for the population mean
Greek letters are always used for population values
Median
• The middle value in a list of ordered values
– Median has no symbol but is often abbreviated
» Med
– Middle number if n is odd
– Mean of the two middle numbers when n is even
Compare and Contrast
of the Mean and Median
• Median divides the data into two equal parts
• 50% of the data is on either side of the median
• Mean is where the fulcrum would cause the “data scale” to
balance if the values had weight
• It is very sensitive to outliers
Balancing the “data scale”
Normal/Bell curve
mean
median
Skewed Left
Skewed Right
Dichotomy
p̂
pˆ 
number of successes
n
Trimmed Mean
• Makes the mean less susceptible to outliers
• Order the data
• Remove the same number of pieces of data from each end
• Recalculate the mean
% x n = number of pieces to be removed from EACH end
A small to moderate trim is 5% to 25%
Trimmed Mean
• Example:
Find the 15% Trimmed mean of:
3, 6, 8, 2, 9, 10, 7, 15, 4, 12, 20, 36, 15, 5, 3, 7, 10, 16, 17, 12
Order the numbers:
2, 3, 3, 4, 5, 6, 7, 7, 8, 9, 10, 10, 12, 12, 15, 15, 16, 17, 20, 36,
20 items • .15 = 3
136
4, 5, 6, 7, 7, 8, 9, 10, 10, 12, 12, 15, 15, 16 = 14  9.71
Weighted Mean
similar to an arithmetic mean (the most common
type of average), where instead of each of the
data points contributing equally to the final
average, some data points contribute more than
others.
Weighted Mean
# of students
Class average
1st period
20
75
2nd period
35
79
20  75  35  79
weighted ave. 
 77.54
55
Summarization
1. If Q1 = 20 and Q3 = 30 which of the
following must be true?
I. The median is 25
II. The mean is between 20 and 30
III. The standard deviation is at most 10
a)
b)
c)
d)
e)
I only
II only
III only
All are true
None are true
E, E
2. Suppose the average score on a
national test is 500 with a standard
deviation of 100. If each score is
increased by 25%, what are the new
mean and standard deviation?
a)
b)
c)
d)
e)
500, 100
500, 125
625, 100
625, 105
625, 125
4.1 Homework
• Page 160 to 163
2, 5, 9, 12, 13, 14, 15 16
4.2 Describing Variability in a Data Set
• What is data variability and how is it used to determine
standard deviation?
Measures of Variability
• Range = high – low
• Deviation from the mean= xi –
x
• if positive then xi is larger than the mean
• if negative then xi is smaller than the mean
• Sample Variance
n
s2 
 ( x  x)
i 1
i
n 1
2
Why divide by n-1
• Since it is a sample (not all the possible data
on a subject) and we know that ∑ (xi - x )=0
so if we know all but one xi - x the
remaining one can be found, this causes us to
divide by n-1
(it also has to do with degrees of freedom
concept to be discussed later)
• Sample Standard Deviation
• “average distance” the items vary from the
mean
2
• s s
– A small s or s2 indicates low variability
– A high s or s2 indicates large variability
• Population Variance (knowing all the data)
n
2 
 ( x  x)
i 1
2
i
n
• Population Standard Deviation
  
2
compute to the same accuracy as the population
• IQR
Interquartile Range
IQR = upper quartile (Q3) – lower quartile (Q1)
Lower quartile (Q1)—the median of the lower half
Upper quartile(Q3)—the median of the upper half
IF n is odd, the exact median is excluded
from the quartiles
Used because it is resistant to outliers
There is no special name for the population IQR
Uses of the IQR
• Standard deviation can be approximated by
»SD = IQR/1.35
»If SD > IQR/1.35
it suggests heavier or
longer tails than the
normal curve
Easy method of calculating the SD
Sxx means the sum of the deviations and can be
found with some manipulation by the following
formula.
S xx   ( x  x)

x

 x 
n
2
2
2
S xx
S 
n 1
2
to avoid round errors use 4 or 5 decimals past the accuracy of the data
Example
• 20, 15, 12, 18, 17, 15, 17, 16, 18, 25
x 
173
=17.3
10
• Reorder
12, 15, 15, 16, 17, 17, 18, 18 20, 25
Q1= 15
range =
iqr =
Median= 17
Q3= 18
continued
• Find the standard deviation
By hand by simplified rule
i
xi
1
12
2
15
3
15
4
16
5
17
6
17
7
18
8
18
9
20
10
25
totals
Xi - x
(xi- x )2
x2
• By iqr
• By calculator
12, 15, 15, 16, 17, 17, 18, 18 20, 25
Q1= 15
Median= 17
Q3= 18
minitab
Given:
154, 142, 137, 133, 122, 126, 135,
135, 108, 120, 127, 134, 122
The Minitab output would be:
Descriptive Statistics
Variable
Motion
N
13
Mean
130.38
Variable
Motion
Minimum
108.00
Median
133.00
Maximum
154.00
TrMean
130.27
Q1
122.00
Q3
136.00
StDev
11.47
SE Mean
3.18
Summarization
1. If a set of data has a standard deviation
of 0, you can conclude:
a) that there is no relationship between
the observations
2. During the great depression, the
weekly average hours worked in
manufacturing jobs by eleven
people were 45, 43, 31, 39, 39,
35, 37, 40, 39, 36, 37. What is the
variance?
b) that the average value is 0
a) 8.1
c) that all observations are the same
value
d) that a mistake in arithmetic has been
made
b) 2.99
c) 0
d) There is none
e) non of the above
e) Not enough information is given
C, A
4.2 Homework
• Page 169-171
17 (by Hand), 20, 22, 23,
24, 26, 27, 28,31
4.3 Summarizing a Data Set: Boxplots
• How can single variable data be summarized in graphical
format?
Boxplots
• Can be used for many types of summarizations
25%
25%
25%
25%
• Iqr = Q3 – Q1
• Outlier = data more than 1.5•iqr from the ends of the box
• Extreme=data more than 3•iqr from the ends of the box
Modified Boxplots
• Whiskers go to the last piece of data that is
not an outlier
Outlier
(closed
circle)
Extreme
Outlier
(open
circle)
example
• During the great depression, the weekly average hours worked in
manufacturing jobs by eleven people were 55, 43, 31, 39, 39, 35,
37, 40, 39, 36, 37. Create a box and whisker plot.
• Five number summary is: 31, 36, 39, 40, 55
IQR= 40-36 =4
Outliers are
4*1.5= 6 units from the ends of the box
40 + 6 = 46 and 36-6= 30
Extreme Outliers are
40 + 12= 52
30
35
40
45
50
55
4*3= 12 units from the ends of the box
• Given the following data:
example
244 191 160 187 180 176 174 205 211 183 211 180 194 200
• Create a modified box and whisker plot.
Five number summary is: 160, 180, 189, 205, 244
160
165
170 175 180 185
190 195 200 205
210 215
215 225 230 235
240 245
Boxplot
Video #3 18:00 to 20:40
Summarization
1.
2. Suppose the average score on a
national test is 500 with a standard
deviation of 100. If each score is
increased by 25%, what are the new
mean and standard deviation?
To which of the previous boxplots is the
histogram below likely to belong to:
0
a) A
10
b) B
20
30
c) C
40
50
d) D
a)
b)
c)
d)
e)
500, 100
500, 125
625, 100
625, 105
625, 125
60
e) E
C (remember—25% in each part, whiskers will not be that long and part is in
the box, E
4.3 Homework
• Page 176-178 32, 33, 36, 37
4.4 Interpreting Center and Variablity:
Chebyshev’s Rule, Empirical Rule, z-scores
• What determinations can be made about the
center of the data set?
Chebychev’s Rule
• One way of determining the percent of data k deviations from
the mean (remember that includes above and below the
mean)
1
100(1  2 )
k
• Use at least terminology
• Tends to underestimate the percentage
• Applicable to any data set
Empirical Rule
Uses approximately for its terminology
13.5%
-3

-2
-1 
13.5%
68%
2.35%

mean
1
2.35%
2
3



95%
99.7%
Since empirical rule refers to normal data sets,
the percentages can be divided in half for parts
above or below the mean
Z-Scores
• Measures the number of standard deviations a particular piece
of data is from the mean
• Often called the standardization formula
xi  x
z
s
Compare and Contrast
percent vs percentile
• Percent
x
 100
total possible
• Percentile
• The percent that fall at or below the given value
position w /in the ordered data set
 100
total pieces of data
• Use the position of the value farthest to the left for repeats
example
• Find the percent and percentile for each
– Sue scored 9 out of 10. There were 10 people in
the class. Eight people scored 10 , Sue scored 9,
and one score 0.
• Percent 9/10∙100= 90%
• Percentile 2/10∙100= 20th percentile
– If the Scores were 0, 5, 7, 7, 8, 8, 8, 9, 9, 10
– And Sue scored a 9
• Percent 9/10∙100=90 %
• Percentile 8/10∙100=80th percentile
example
• Given the following data:
• Estimate the median and
the 60th percentile
Freq
5 to 10
8
10 to 15
5
15 to 20
9
20 to 25
10
Rel
freq
Cum
rel
freq
example
chart
• A sample of a certain type of concrete specimens is selected, and the
compressive strength of each one is determined. The mean and st. dev
are 3000 and 500 respectively. The sample box-plot appears relatively
normal.
• Approx. what % of the sample falls below 2500?
• Approx what percent falls between 2500 and 4300?
• Approx what percent falls above 4500?
• What compressive strength falls at the 85th percentile
Return to
problem
Chart is on pg 810-811
Summarization
1. The 70 highest dams in the world have
an average height of 206 meters with a
standard deviation of 35 meters. The
Hoover and Grand Coulee dams have
heights of 221 and 168 meters,
respectively. The Russian dams the
Nurek and Charvak have heights with
z-scores of 2.69 and -1.13 respectively.
List the dams in order of ascending
size.
a)
b)
c)
d)
e)
Charvak, Grand Coulee, Hoover, Nurek
Charvak, Grand Coulee, Nurek, Hoover
Grand Coulee, Charvak, Hoover, Nurek
Grand Coulee, Charvak, Nurek, Hoover
Grand Coulee, Hoover, Charvak, Nurek
A, C
2. Given the following eleven scores
on a test worth 45 points were 45, 43,
31, 39, 39, 35, 37, 40, 39, 36, 37.
What is the difference in the values of
the percentage and percentile for 39.
a)
b)
c)
d)
e)
86.6
54.5
32.1
6
None of the above
4.4 Homework
• Page 184-186
39, 41, 43, 48, 51
1st interval
0≤x<2
4.5 Interpreting the results of Statistical Analysis
• Read section 4.5 Pages 135 to 137
See next slide for review
Review
• Pages 190 to 195
53, 54, 58, 60, 61, 64, 66, 70, 73