Download chapter2 - Web4students

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
2.1 – Overview
►►►Read the book, complete the notes, read the chapter problem and all
examples in the chapter.
In this chapter we present a variety of basic tools that will help us in
understanding a collected data. We will describe, explore and compare data sets.
Two General Divisions of Statistics
Descriptive: to summarize or describe characteristics of a set of data pictorially,
numerically, or by tabulation.
Inferential: when we use sample data to make generalizations and/or
predictions about a population.
Examples of Descriptive Statistics
1) The average SAT score for a certain College is 513.5
2) The final exam grades for my statistics class in the Fall 2003 ranged from
23% to 99%
Examples of Statistical Inference
We might infer from appropriate samples that:
1) Between 20% and 25% of American college students are married.
2) High cholesterol levels are associated with increased risk of heart disease
The same number may be used for either describing a smaller distribution
or making inferences about a larger distribution:
1) Nielsen reports that 24.7% of those who were interviewed watched the
President’s news conference last Sunday night.
2) Probably about 24.7% of all television viewers watched the President’s
news conference last Sunday night.
3) The average age of students enrolled in this class is 19.7 years
4) The average age of students enrolled at this college is probably 19.7 years
1
Important Characteristics of Data (CVDOT)
1) Center:
A representative or “average” value that indicates where the middle
of the data set is located.
2) Variation: A measure of how spread out the data values are.
3) Distribution:
The shape of the spread of the data. A distribution could be
bell-shaped, uniform, skewed, etc.
4) Outliers:
Sample values that lie very far away from the vast
majority of the other sample values. (Possibly due to errors
or unusual circumstances.)
5) Time:
Changing characteristics of the data over time
2
2.2 – Frequency Distributions
We’ll use the table from the next page to introduce the vocabulary.
• Frequency distribution, classes, frequency
• Advantage of a frequency distribution: makes a list more intelligible
• Disadvantage of a frequency distribution: original data is lost
• Lower and upper Class Limits
• Class Boundaries
upper limit of one class + lower limit of the next class
2
• Class Midpoints
Within a class do:
lower class limit + upper class limit
2
• Class Width
Difference between 2 consecutive lower class limits
• Relative Frequency
It is a percentage or fraction.
class. frequency
= usually as %
sum.of . frequencies
• Cumulative Frequency
Sum of frequencies at and below a given class.
3
►►►Frequency Distribution of the Systolic Blood Pressure of Women
(#2, page 44)
Systolic Blood
Pressure of
Women
80-99
100-119
120-139
140-159
160-179
180-199
Frequency
Relative
Frequency
9
24
5
1
0
1
a) List the lower class limits
b) List the upper class limits
c) List the class boundaries
d) List the class midpoints
e) What is the class width?
f) Construct the relative frequency distribution (complete column on table above)
g) Construct the cumulative frequency distribution
Systolic Blood
Pressure of
Women
Less than
Less than
Less than
Less than
Less than
Less than
Cumulative
Frequency
h) What is the number of women in the sample?
4
Guidelines for Constructing a Frequency Distribution
1. Be sure that classes are mutually exclusive.
2. Include all in between classes, even if the frequency is zero.
3. Use the same width for all classes. Sometimes open-ended interval
are impossible to avoid for first and/or last class.
4. Use between 5 and 20 classes. Usually 10 or fewer.
5. The sum of the class frequencies must equal the number of original data
values.
Constructing a Frequency Distribution
►►► Example -Days to Maturity for Short Term Investments
The following table displays the number of days to maturity for 40 short-term
investments. The data are from Barron’s National Business and Financial Weekly.
70
62
75
57
51
64
38
56
53
36
99
67
71
47
63
55
70
51
50
66
64
60
99
55
85
89
69
68
81
79
87
78
95
80
83
65
39
86
98
70
STEPS:
(1) Decide on the number of classes (5 to 20).
In this example we’ll use 7 classes.
(2) Find the class width =
(highest.value)  (lowest.value)
number.of .classes
ROUNDED UP to a convenient number.
(3) Select the starting point or lower limit of the first class (lowest score or
convenient value lower than the lowest score).
In this example we’ll use the lowest score as the lower limit of the first
class.
5
(4) Proceed finding all other lower limits by adding the class width, and display
them vertically
Classes
Tally
Frequencies
Relative
Frequencies
Cumulative
Frequencies
(5) Find the upper class limits
(6) Tally each score in the appropriate class and find the total frequency for
each class.
►►►Now do each of the following:
a) List the class boundaries
b) List the class midpoints
c) Construct the relative frequency distribution (4th column above)
d) Construct the cumulative frequency distribution (last column above)
Interpreting Frequency Distributions
Using frequency distributions to describe, explore, and compare data sets
►►►Read the 3 examples on pages 42 and 43, and refer to tables 2-4 to 2-7 on
page 43.
6
2.3 – Visualizing Data
Histograms
• Horizontal axis: values of the data
Use class boundaries for marks along the horizontal axis.
• Vertical axis: frequencies. (the vertical height of the histogram should be
about three-fourths of the total width)
• The height of a bar represents the frequency of each class.
• Both axes should be clearly labeled.
• Note: We cannot reconstruct the original data set from a histogram and have
sacrificed some accuracy for convenience in displaying the data.
• Remember to Interpret the histogram referring to the characteristics of data
CVDOT from section 2.1
►►►Construct a histogram for the “Days to Maturity” data from section 2.2,
page 4 of notes. Here is the frequency table constructed on page 5 of notes:
Days to maturity
36-45
46-55
56-65
66-75
76-85
86-95
96-105
Number of
Investments
3
7
8
9
6
4
3
Relative Frequency Histogram
• Use the relative frequencies along the vertical axis
• Shape should be the same as the regular histogram with the vertical axis
labeled differently.
►►►Construct a relative frequency histogram for the “Days to Maturity” data
7
Using the TI_83 to Sketch a Histogram
(1) Using Raw Data
►►►We’ll use the raw data “Days to Maturity” from section 2.2, page 4 of notes.
70
62
75
57
51
64
38
56
53
36
99
67
71
47
63
55
70
51
50
66
64
60
99
55
85
89
69
68
81
79
87
78
95
80
83
65
39
86
98
70
• Step 1:
Enter data in calculator into one of the L’s
SINCE WE’LL BE USING THE DATA FOR DIFFERENT
PROBLEMS WE ARE GOING TO CREATE A NEW LIST
Arrow up and to the right of L6, type the name of your list,
Call it DAYS, press ENTER, and type in the data
• Step 2:
Clear or inactivate any functions defined in Y=
• Step 3:
Define the statistical plot desired:
2nd Y= [STAT PLOT]
1: Plot1 ENTER to turn plot on
Arrow down and right to histogram, ENTER to select it
Xlist: L1 (or other L used)
To select the created list DAYS, do the following
2nd STAT (to access the LIST menu)
now select DAYS by pressing ENTER next to the name
Frequency: 1 (To change to 1, do ALPHA 1)
• Step 4:
Let the calculator select a window by pressing ZOOM 9
Press TRACE and arrow to the right to read the classes and
frequencies
The classes are probably different from the ones produced in the notes
page 6.
To produce the same classes as your frequency table do the following after step 3:
• Step 5:
Select a window by pressing WINDOW
Xmin = lower limit of the first class
Xmax = lower limit of the next class beyond the data
(xmin + number of classes * class width)
Xscl = class width
Ymin = 0 which is the lowest possible frequency
(or –10 to provide space for writing as we trace)
Ymax = larger than the highest frequency
• Step 6:
Press GRAPH
To display information on the screen press TRACE and
arrow to the right.
• Step 7:
2nd
• Step 8:
To turn stat plot off:
2nd Y= [STAT PLOT] , ENTER on plot 1, ENTER on Off
or: Y=, up to Plot1, ENTER to de-select
QUIT will return to the Home Screen.
8
(2) Using Grouped Data
►►►Use the frequency table for the data “Days to Maturity”
Days to maturity
Class Midpoints
36-45
46-55
56-65
66-75
76-85
86-95
96-105
Number of
Investments
3
7
8
9
6
4
3
• Step 1:
Clear Lists 2 and 3 (or any other Lists)
• Step 2:
Enter the class midpoints in L2.
• Step 3:
Enter class frequencies in L3.
• Step 4:
Clear or inactivate any functions defined in Y=
• Step 5:
Define the statistical plot desired:
2nd Y= [STAT PLOT]
1: Plot1 ENTER to turn plot on
Arrow down and right to histogram, ENTER to select it
Xlist: L2 (or other L being used)
Frequency: L3 (or other L being used)
• Step 6:
Select a window by pressing WINDOW
Xmin = lower limit of the first class
Xmax = lower limit of the next class beyond the data
(xmin + number of classes * class width)
Xscl = class width
Ymin = -10 to provide space when tracing
Ymax = larger than the highest frequency
• Step 7:
GRAPH
To display information on the screen press TRACE and arrow
to the right. Make sure the info agrees with the frequency table.
• Step 8:
2nd
• Step 9:
To turn stat plot off:
2nd Y= [STAT PLOT] , ENTER on plot 1, ENTER on Off
or: Y=, up to Plot1, ENTER to deselect
QUIT will return to the Home Screen.
9
Frequency Polygon Construction
• Plot the points with coordinates (class midpoint, class frequency)
• Connect points with line segments
• Extend the first and the last segments to the left and right so that the graph
begins and ends on the x-axis
►►►Construct a frequency polygon for the “Days to Maturity” data
Days to Midpoints Frequency
Maturity
36-45
3
46-55
7
56-65
8
66-75
9
76-85
6
86-95
4
96-105
3
Ogive Construction
• Plot the points with coordinates
(upper class boundaries, cumulative frequency)
• Connect points with line segments.
• The graph begins on the x-axis with the lower boundary of the first class and
ends with the upper boundary of the last class (must start at the 0% and end at
100%).
• Ogives are useful for determining the number of values below some particular
value.
►►►Construct an ogive for the “Days to Maturity” data
Upper
Cumulative
Class
Frequency
Boundaries
10
Dot Plots (see page 48, figure 2-5)
• Each data value is plotted as a point (or dot) along a scale of values.
• Numbers appear individually not in categories as it happens in a
histogram.
• Stack the values vertically when values occur more than once.
• Similar to histograms because we can see the distribution of the data
• We do not loose the particular values
►►►Construct a dot plot for the “Days to Maturity” data
11
Stem and Leaf Plots (see page 49)
• Similar to histograms because we can see the distribution of the data
• We do not lose the particular data values
• STEM (consists of the leftmost digit(s))
• LEAF (consists of the rightmost digit)
• Examine sidewise and see a histogram
• The number of stems should be kept between 5 and 20
• If there are too many values, expand, subdividing rows into:
digits from 0 to 4 and digits from 5 to 9
• If necessary, condense, that is reduce the number of rows
• Since it displays the data in order, it is a fast and easy procedure for
ranking data (arranging data in order)
►►►Construct a Stem-and-leaf plot for the “Days to Maturity” data
12
Pareto Charts (see page 51)
• It is a bar graph for qualitative data
• Bars are arranged in descending size
• Vertical scale can represent frequencies or relative frequencies as in the
histogram
►►► Do problem # 21, on page 57
Pie Charts (see page 51)
• Used to display qualitative data in a more understandable way so that we see
what part of the total data is represented by each category.
• Make a table with a column with relative frequencies (%), and a column
for degrees (% of 360)
►►► Do problem # 23, on page 58
Scatter Diagrams (see page 51)
• Is a plot of paired (x,y) data with a horizontal x-axis, and a vertical y-axis.
• The pattern of the plotted points is often helpful in determining the presence
and form of some relationship between the two variables.
►►►We’ll do scatter diagrams with more detail later in chapter 9
Time-Series Graph (see figure 2-8, page 52)
• Time-series data are data that have been collected at different points in time.
►►► Do problem # 27, page 58
Enter YEAR in L1, Enter STOCK VALUE in L2, Use the 2nd graph in the
STAT PLOT menu to graph.
Other Graphs
►►►Discuss the graphs on pages 53, and 54
13
2.4 – Measures of Center
Measure of Center: Value representing some type of measure of the center or
middle of a data set
Mean (arithmetic mean)
• The sum of the scores divided by the number of scores
• The mean is affected by low or high values
Notation:
n: number of values in a data sample
N: number of values in a data population
Sample mean
Population mean
x

x
n
x
N
►►►Do problem #3, on page 69
Median
• Middle value when data is arranged in ORDER
• If n is odd, the median is located exactly in the middle
• If n is even, it is the mean of two middle numbers
• Median is not affected (is resistant) to large or small data values (is “robust”)
►►►Do problem #3, on page 69
14
Mode
• The most frequent score or class
• Sometimes a data set can be bimodal, multimodal, or have no mode
►►►Do problem #3, on page 69
Midrange
• Value midway between highest and lowest data values
.
highest.value  lowest.value
2
►►►Do problem #3, on page 69
Round-off Rule
• Carry one more decimal place than is present in the original set of data values
• Round off only on the final answer. Keep several more decimal places during
intermediate calculations.
Computing the Mean on the TI-83
Non-grouped data or RAW data
Step 1:
Clear List 1
Step 2:
Enter the data in L1.
Step 3:
2nd QUIT to go to the home screen
STAT
CALC 1:1-Var Stats L1
ENTER
Note: If data is in another L, you must specify which L. Otherwise the calculator
always assumes L1.
►►► Find the mean of the DAYS TO MATURITY DATA that you have in your
calculator in the list named DAYS
15
Mean From a Frequency Distribution
To find the mean of data summarized in a frequency distribution we use the
formula:
x
 ( f  x)
f
,
where f denotes the frequency and x represents the class midpoint.
►►► Example: Use the frequency distribution from the “Days to Maturity”
example to find the mean of the days to maturity of those 40 short term
investments.
Days to maturity
Class Midpoints
x
Number of
Investments
f
36-45
46-55
56-65
66-75
76-85
86-95
96-105
3
7
8
9
6
4
3
Computing the Mean on the TI-83
Grouped Data
Step 1:
Clear Lists 5 and 6
Step 2:
Enter the class midpoints in L5
Step 3:
Enter class frequencies in L6.
Step 5:
2nd QUIT to go to the home screen
STAT CALC 1:1-Var Stats L5, L6
ENTER
►►► Use the calculator to find the mean for the grouped data of the “Days to
Maturity” example. Now compare with the answer that you got on page 14 when
using raw data. Explain the differences.
16
Weighted Mean
Is the mean computed with the different scores assigned different weights.
x
 ( w  x)
w
►►►Find the grade-point average of a student who took five classes and got the
following grades: A (in a 3-credit course), B (in a 5-credit course), A(in a 2-credit
course), C (in a 4-credit course), and A (in a 3-credit course). Remember that A = 4,
B = 3, C = 2, D = 1, F = 0.
The Best Measure of Central Tendency
See table 2-10 on page 67.
Skewness (see page 68)
A non-symmetric distribution that extends more to one side than another
Skewed to the left (negatively skewed, lopsided to the right)
The histogram is much lower on the left side and the mean is left of the
median which is left of the mode.
Skewed to the right (positively skewed, lopsided to the left)
The histogram is much lower on the right side and the mean is right of the
median which is right of the mode.
Symmetric (zero skewness, data not lopsided)
The histogram is mirror image about the data center, and the mean =
the median = the mode
17
2.5 Measures of Variation (Dispersion)
►►►Refer to problem # 9 on page 70. Complete the table given below and
construct a dot plot for each of the distributions (data is shown below)
The waiting times of customers (in minutes) is listed below for both banks.
Bank J
6.5
Single waiting line
6.6
6.7
6.8
7.1
7.3
7.4
7.7
7.7
7.7
Bank P
4.2
5.4
Multiple waiting lines
5.8
6.2
6.7
7.7
7.7
8.5
9.3
10.0
►►►Complete the following table:
Mean
Median
Mode
Bank J
Bank P
►►► Show the dot plot for both banks here:
Range
Range = (Highest value) – (Lowest value)
Only affected by 2 numbers (does not represent the whole data set)
►►► Find the range of the two distributions shown above
Standard Deviation
Is a measure of the average variation of values about the mean
►►► In your opinion, considering the two distributions given on problem #9,
page 70, which data set has the smallest standard deviation?
Variance
Is the square of the standard deviation.
18
Notation
s = Sample standard deviation
s 2 = Sample variance
 = Population standard deviation
 2 = Population variance
Formulas to find the Standard Deviation
Defining Formula
Sample
Population
s
 ( x  x)

Shortcut formula
2
s
n 1
(x  )
n  ( x 2 ) (  x ) 2
n(n  1)
2
N
Finding the Standard Deviation
►►►Find the standard deviation for the waiting times at the Jefferson Valley
Bank. (problem #9, page 70)
19
►►►Follow the same procedure to verify that for the Bank of Providence,
s = 1.82 min. Display your work in a table:
Grouped Data- Finding the Standard Deviation from a Frequency
Distribution
When dealing with grouped data we will use the calculator to find the standard
deviation instead of using the formula given on page 80.
Computing the Standard Deviation on the TI-83
Non-grouped data
Step 1:
Clear a list (L1)
Step 2:
Enter the data into the list (L1)
Step 3:
2nd QUIT to go to the home screen
STAT CALC 1:1-Var Stats L1
ENTER
►►►
a) Check the answer obtained for the waiting times at the Jefferson Valley Bank.
Enter data in L1.
b) Check answer obtained for the waiting times at the Providence Bank. Enter data
in L2.
c) Find the standard deviation for the Days to Maturity data which is stored in the
list DAYS
Grouped Frequency Table Values
Step 1:
Clear Lists 2 and 3
Step 2:
Enter the class midpoints in L2
Step 3:
Enter class frequencies in L3.
Step 5:
2nd QUIT to go to the home screen
STAT
CALC
1:1-Var Stats
L2, L3
ENTER
►►► Find the standard deviation for the grouped data of the Days to Maturity
example (refer to page 16 of notes)
20
Comparing Variation in Different Populations
The coefficient of variation (or CV) for a set of sample or population data,
expressed as a percent, describes the standard deviation relative to the mean.
• It is a measure of the importance of the data set’s variation.
CV 
s
 100%
x
CV 

 100%

►►► Read example on page 80
►►► Find the coefficient of variation for
a) Days to Maturity data
b) Waiting times at the Jefferson Bank
c) Waiting times at the Providence Bank
21
Interpreting and Understanding Standard Deviation
• It measures the average variation among scores of a data set
• A data set with many scores close together yields a small standard deviation
• A data set with scores spread farther apart yields a larger standard deviation
• It is a kind of yardstick by which we can compare one set of data with another
• Range rule of thumb:
range ~ 4 standard deviations (r = 4 s)
s ~ range / 4
• Values that are within 2 standard deviations from the mean are considered
“usual” values
• Minimum “usual” value ~ mean – 2 standard deviations
• Maximum “usual” value ~ mean + 2 standard deviations
• Most of the data is within the interval:
[(mean – 2 standard deviations)
, (mean + 2 standard deviations)]
• Values that are more than two standard deviations away from the mean are
considered “unusual” values.
►►►If the range of a certain distribution is 17, approximate the standard
deviation
►►►Do #24, page 90
►►►Identify usual and unusual values in
a) Days to Maturity data
b) Waiting times at the Jefferson Bank
c) Waiting times at the Providence Bank
22
Empirical Rule (or 68-95-99.7 rule)
If a Distribution is approximately bell shaped, then
• About 68% of the scores fall within 1 standard deviation of the mean
• About 95% of the scores fall within 2 standard deviations of the mean
• About 99.7% of the scores fall within 3 standard deviations of the mean
►►►For the Days to Maturity data,
a) Find the values x  s , and x  s
How many of the 40 pieces of data have values within 1 standard deviation of the
mean? What percentage of the sample is this?
b) Find the values x  2 s and x  2 s
How many of the 40 pieces of data have values within 2 standard deviation of the
mean? What percentage of the sample is this?
c) Find the values x  3s and x  3s
How many of the 40 pieces of data have values within 3 standard deviation of the
mean? What percentage of the sample is this?
d) Compare to the results predicted by the empirical rule. Does the result suggest
an approximately normal distribution?
23
Chebyshev’s Theorem
• It applies to any set of data, but its results are very approximate
The Proportion of data lying within k standard deviations of the mean is at
least
1
1
, where k is any positive integer greater than 1.
k2
• At least 3/4 or 75% of all scores fall within 2 standard deviations of the mean
1
1
22
• At least 8/9 or 89% of all scores fall within 3 standard deviations of the mean
1
1
32
• At least ......... or .......... of all scores fall within 4 standard deviations of the
mean
• For typical data sets, it is unusual for a score to differ from the mean by more
than 2 or 3 standard deviations
►►►For the Days to Maturity data, using Chebyshev’s theorem, we can say that
a) At least 89% of the data are within..........................
b) At least 75% are within....................................
►►►According to Chebyshev’s theorem, “At least 75% of the data fall within two
standard deviations for the mean” which is equivalent to stating that “At most 25
percent will be more than two standard deviations away from the mean”
Answer the following:
a) At most what percent of a distribution will be three or more standard
deviations from the mean?
b) At most what percent of a distribution will be four or more standard
deviations from the mean?
►►►Do #28, pg. 90
24
2.6 Measures of Relative Standing
z-Scores (or Standard Score)
• The z-score is the number of standard deviations that a given data value is
above or below the mean. A score with a positive z-score is above the mean and
a score with a negative z-score is below the mean.
• Z-scores enable us to standardize values so that they can be compared
►►►If you score 81 on your first exam, where
46 on your second exam, where
x
x = 75 and
s = 16, and you score
= 40 and s = 8, which score is relatively better?
___|___________|___________________|___
75
81
x
score
91
x +s
___|_______________________|_______|____
40
46
x
score
48
x +s
From the diagram we can say that
score = x + ? s
The number of standard deviations above or below the mean is called z score.
It makes sense to use the formula:
x=
x + z s then z 
x
xx
or z 

s
Round z to 2 decimal places
Note: We are using the standard deviation as a “yard stick”.
►►► Find the z-scores on the problem given above.
►►► Do problem #5, on page 100
25
• Z-scores can be used to differentiate between ordinary values and unusual
values
• Values with z-scores within [-2,2] are considered "ordinary" or "usual"
• Values with z-scores greater than 2, or less than -2, are considered "unusual"
FIGURE 2-14
►►► Do problem #6, on page 100
Quartiles and Percentiles
• Quartiles: Divide ranked data into 4 equal parts. (Q1, Q2, Q3)
(Similar to median which divides into 2)
• Percentiles: Divide ranked data into 100 parts. (P1, P2,...P99)
A score in the 88th Percentile means: Student's score is higher than 88% of the
scores
Finding the percentile corresponding to a particular score
Percentile =
number.of .values.less.than.x
• 100
total.number.of .values
►►►From the table 2-13 on page 95: Find the percentile corresponding to 245.
26
Reverse Process - Finding the score corresponding to a particular
percentile
What scores is at kth percentile?
(1)
Rank the data from lowest to highest
(2)
Find % of the total number = L
(Locator)
L
nk
100
a) If L is not a whole number, round up and find the score in that
position
b) If L is a whole #, find the average of the scores in positions L
and L+1
►►►Using the data from table 2-13, page 95
Find each one of the following:
a) P40 =
b) Q1=
c) Q2 = P50 =
d) Q3 = P75 =
e) P27 =
Interquartile Range
A statistic that we will use in the next section is defined in terms of quartiles. It is
the Interquartile Range, or IQR. It measures the spread of the middle 50% of the
data.
IQR= Q3 – Q1
►►► What is the IQR for the data in Table 2-13, page 95?
27
2.7 Exploratory Data Analysis - EDA
Exploratory Data Analysis is the process of using statistical tools (such as
graphs, measures of center, and variation) to investigate data sets in order to
understand their important characteristics.
Outliers (extreme values)
Outliers are values that are very far away from almost all of the other values.
• An outlier can have a dramatic effect on the mean.
• An outlier can have a dramatic effect on the standard deviation.
• An outlier can have a dramatic effect on the scale of the histogram so that the
true nature of the distribution is totally obscured.
(See example page 102)
Box Plots
Graphs that reveal central tendency, the spread of the data, the
distribution of the data, and the presence of outliers (extreme scores)
• Do not show as much detailed information (as histograms or stem-and-leaf
plots)
• Very useful for comparing two or more data sets (use the same scale)
• Used to identify the approximate shape of the distribution of a large data set.
• For small data sets, boxplots can be unreliable in identifying distribution shape.
(using stem and leaf plots or dot plots is more appropriate in this case)
Steps for Constructing a Box Plot
(1) ARRANGE data in Ascending Order
(2) Find the 5-number summary
Minimum value
First quartile: Q1, which is the median of the observations which
are to the left of the overall median
Median: Q2
Third quartile: Q3, which is the median of the observations which
are to the right of the overall median)
Maximum value
(3) Use these numbers to construct the box plot.
28
►►►Construct the box plot for the data set.
Here we have a sample of 15 incomes of college graduates, arranged in increasing
order. The incomes are given in thousands.
4,
25,
30,
30,
30,
31,
32,
35,
50,
50,
50,
55,
60,
74,
110
►►►Construct the box plot for the data set.
We are given below the earnings in 2001 of 16 randomly chosen people who have
high school diplomas but no college. For convenience we have arranged the
incomes in increasing order. The incomes are given in thousands.
5,
6,
12,
19,
20,
21,
22,
24,
25,
31,
32,
40,
43,
43,
47,
67
29
Constructing a Box Plot with the TI-83
There are two choices of box plots within a STAT PLOT
The 4th one is the modified box plot and shows outliers, if there are any.
Step 1:
Step 2:
Step 3:
Step 4:
Clear L1.
Enter data in L1. (or L1 and L2 if grouped data)
Define the statistical plot desired:
2nd Y= [STAT PLOT]
1: Plot1 ENTER to turn plot on
Arrow down and right to the 4th graph icon , ENTER to
select it.
Xlist: L1
Freq: 1 (or L2, if grouped data)
Select a window by pressing ZOOM, and selecting
9:ZoomStat
►►► The following table gives the number of public High School graduates, in
thousands, for 15 states in 1980 and 1990. Use the calculator to construct the
boxplots for the data. Compare the two data sets.
State
FL
PA
MI
OH
AL
WY
NJ
ME
LA
IL
WI
NY
AZ
TN
SD
1980
87.3
146.5
124.3
144.2
45.2
6.1
94.6
15.4
46.3
135.6
69.3
204.1
38.6
49.8
10.7
1990
88.9
110.5
93.8
114.5
40.5
5.8
69.8
13.8
36.1
108.1
52
143.3
32.1
46.1
7.7
Box Plots and Distributions
Figure 2-17 on page 105.
►►►Construct the box plot for the Days to Maturity data set.
30
Math 116 – Chapter 2 Highlights
Choosing an appropriate number to describe the data
Measuring the center of a distribution

The mean cannot resist the influence of extreme observations. It is not a
resistant measure of the center

The median is a resistant measure of the center (not affected by extreme
observations).

If the distribution is symmetric, the mean and median are the same.

If the distribution is close to symmetric, the mean and median are very
close in values.

In a skewed distribution, the mean is farther out in the direction of
skewness (long tail) than is the median

Reports about incomes and other strongly skewed distributions usually
give the median rather that the mean.
Example 1: Distributions of incomes are usually skewed to the ............... Which
measure of the center is more appropriate? Why?
Example 2: The mean and median selling price of existing single-family homes
sold in June 2002 were $163,900 and $210,900. Which of these numbers is the
mean and which is the median? Explain how you know.
Example 3: A class of 9th grade students takes a test designed for 6th graders.
The mean and median scores are 83% and 87%. What direction is the skewness
of the test scores, and which number is the mean?, median?
31
Measuring the spread of a distribution

The minimum and maximum values show the full spread of the data (but
they may be outliers. Also, the spread of the in-between numbers is
ignored.)

The quartiles mark the spread of the middle half of the data as well as the
spread of the upper and lower 25% of the data.

Box Plots – Five Number Summary
- In a symmetric distribution, the first and third quartiles are equally
distant from the median
- In most distributions that are skewed to the right, the third quartile will
be farther above the median than the first quartile is below it.

The standard deviation measures the spread of the data by looking at
how far the observations are from their mean and averaging those values.
Choosing measures of center and spread

The five-number summary is usually better than the mean and standard
deviation for describing a skewed distribution or a distribution with strong
outliers.

Use the mean and standard deviation only for reasonably symmetric
distributions that are free of outliers.
32