Download Class One: 30 March 2005 - homepages.ohiodominican.edu

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Time series wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
Class Two
Before Class Two
Chapter 8: 34, 36, 38, 44, 46
Chapter 9: 28, 48
Chapter 10: 32, 36
Read Chapters 1 & 2
For Class Three:
Chapter 1: 24, 30, 32, 36, 44
Chapter 2: 26, 28, 38, 42, 50
Complete Quiz #1
Read Chapters 3, 4 & 5
Objectives for Class Two
• Identify categorical and quantitative variables.
• Represent data graphically using:
– bar charts
– pie charts
– histograms
– stem plots
– box plots
– time plots
• Describe the distribution of a variable in terms of overall
pattern and identify potential exceptions or outliers
• Compute standard measures of the center and spread of a
distribution and interpret their values.
Answering the question: Will?
Making sense of what you’ve got.
• Now that you have designed and completed your study or
experiment you must make sense of all of the data that you
have collected before it can be interpreted. This process is
called: Exploratory Data Analysis.
• The purpose of this process is to organize the data to determine
what statistical tools we may use to make predictions or
decisions about the population the data was collected from and
to give us some basic insights into the content of the data.
• Organizing Data
–
–
–
–
Group and count each variable
Look for relationships between/among variables or groups
Create simple graphs
Create numerical summaries
Answering the question: Will?
continued
• Remember the two types of variables are:
– categorical (words)
– quantitative (numbers)
• Distribution: describes the value of a variable and the
frequency with which it appeared in the data set.
– For categorical variables the value of the variable will be the specific
word(s) you use to describe the individuals and the frequency is a count
of how many individuals are described by the word.
– For quantitative variables the value of the variable will be the
number(s) you collected on an individual and the frequency is a count
of how many individuals are described by the number.
– Sometimes it is necessary with quantitative variables to make closely
related groups of numerical values and then your frequency is a count
of how many individuals have a number in your group.
Answering the question: Will?
Working with categorical variables.
• identify/calculate the distribution of each categorical variable
– count: the number of individuals in that category
– percent: the ratio of the number of individuals in that category to the
total of all individuals in all the categories
– round-off error occurs when each category is rounded separately from
the total of all categories
• select a graphical display that will convey the importance of
the relationships
– pie charts give a good visual comparison of percentages but are poor
ways to communicate counts
– bar graphs provide a good visual comparison of counts
– make sure that all diagrams are clearly labeled so that the viewer easily
understands the information and relationships being displayed
– make sure to include an other category for pie charts if the categories
do not total to 100%
Data Table
Year
Count
Percent
Freshman
18
41.9%
Sophomore
10
23.3%
Junior
6
14.0%
Senior
9
20.9%
Total
43
100.1%
Pie Chart
Senior
20.9%
Freshman
41.9%
Junior
14.0%
Sophomore
23.3%
45.0%
41.9%
Bar Graph
40.0%
35.0%
Percent
30.0%
23.3%
25.0%
20.9%
20.0%
14.0%
15.0%
10.0%
5.0%
0.0%
Freshman
Sophomore
Junior
Year in School
Senior
Answering the question: Will?
Working with quantitative variables
• Display distribution of quantitative variables with either a
histogram, a stem plot or a time plot.
– A histogram is like a bar graph only instead of the x-axis being labeled
with categories it is labeled with groups of closely related values. It is
used to diagram cross-sectional data from a fixed moment in time.
– A stem plot is a chart that saves time and space by writing numbers
with the same first digits, called stems, in rows followed by a list of the
last digits, called leaves.
• Often it is easy to convert a stem plot into a histogram by using the
stems as your groupings.
– A time plot allows the data to be tracked over time and reveal trends
that would not be evident if only a single moment were analyzed.
Creating a Histogram
• Choose classes: divide the range of the data into classes of
equal width
– as the eye scans the histogram it responds to the area of each rectangle
as a function of its height since all of the bases are of equal size.
– too few classes will give a “skyscraper” effect
– too many will give a “pancake” effect
• Count the individuals in each class
• Draw the histogram: note Microsoft Excel® will label the
classes on the x-axis differently than the histograms in your
text. Excel will center the bar over the value of your class,
grouping individuals based on whether they are below or equal
to the class value but greater than the next lower class.
Weight Data
192
152
135
110
128
180
260
170
165
150
110
120
185
165
212
119
165
210
186
100
195
170
120
185
175
203
185
123
139
106
180
130
155
220
140
157
150
172
175
133
170
130
101
180
187
148
106
180
127
124
215
125
194
Weight Data: Frequency Table
Weight Group
100 - <120
120 - <140
140 - <160
160 - <180
180 - <200
200 - <220
220 - <240
240 - <260
260 - <280
Count
7
12
7
8
12
4
1
0
1
Weight Data: Histogram
14
Number of students
12
10
8
6
Frequency
4
2
0
100
120
140
160
180 200
Weight
220 240
260
280
* Left endpoint is included in the group, right endpoint is not.
Interpreting Histograms
• Look for the overall pattern as well as any striking deviations
from the pattern.
• Overall pattern is described using words for:
– shape: give the number of peaks and whether it is skewed right (lots of
low bars on the right), skewed left (lots of low bars on the left),
symmetric (roughly bell shaped), or has clusters of bars each with their
own shape, center and spread.
– center: midpoint (middle) of the values, the category or group where
half of the observations are below and half are above
– spread: give the smallest and largest values usually excluding outliers
• Deviations are known as outliers because they lie outside the
overall pattern.
Shape: Symmetric Bell-Shaped
Shape: Symmetric Mound-Shaped
Shape: Symmetric Uniform
Shape: Asymmetric Skewed to the Left
Shape: Asymmetric Skewed to the Right
Creating a Stem Plot
• Separate each observation into a stem, consisting of all but the
final (rightmost) digit, and a leaf, the final digit. Stems may
have as many digits as needed, but each leaf contains only a
single digit.
• Write the stems in a vertical column with the smallest at the
top, and draw a vertical line at the right of this column. Do not
skip and stem values even if there is no data with that
particular stem.
• Write each leaf in a row to the right of its stem, in increasing
order out from the stem. If there are no leaves for a stem leave
the area next to it blank.
• Special Circumstances:
– rounding: if data have more than three digits sometimes it is better to
round numbers to three significant digits before creating the stem plot
– split stems: each stem can be split into two with leaves 0-4 appearing
on the first stem and leaves 5-9 appearing on the second stem
– back-to-back stems are helpful when comparing two distributions
Weight Data
192
152
135
110
128
180
260
170
165
150
110
120
185
165
212
119
165
210
186
100
195
170
120
185
175
203
185
123
139
106
180
130
155
220
140
157
150
172
175
133
170
130
101
180
187
148
106
180
127
124
215
125
194
Weight Data:
Stemplot
(Stem & Leaf Plot)
Key
20|3 means
203 pounds
Stems = 10’s
Leaves = 1’s
10
11
12
13 5
14
15 2
16
17
18
19 2
20
21
22
23
24
25
26
192
152
135
Weight Data:
Stemplot
(Stem & Leaf Plot)
Key
20|3 means
203 pounds
Stems = 10’s
Leaves = 1’s
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
0166
009
0034578
00359
08
00257
555
000255
000055567
245
3
025
0
0
Creating a time plot
• Time plots are used for quantitative variables that are measure
at regular intervals over time.
• Time is always the variable plotted on the x-axis.
• Connecting the data points with line segments will often
emphasize the trend over time.
Class Make-up on First Day
(Fall Semesters: 1985-1993)
Class Make-up On First Day
70%
60%
Percent of Class
That Are Freshman
50%
40%
30%
20%
10%
0%
1985
1986
1987
1988
1989
1990
Year of Fall Semester
1991
1992
1993
Average Tuition (Public vs. Private)
Numbers as a measure of center
• Mean ( x ): an arithmetic average found by finding the sum of
all of the data and dividing by the number of data. It is NOT a
resistant measure of center, meaning outliers will pull the
mean towards themselves; therefore we only use the mean
with symmetric data.
n
1
x1  x 2  ...  x n
 xi
x
n i 1
n
• Median (M): the midpoint of the data. It is a resistant
measure of center, meaning it is effected little by outliers;
therefore we use the median with skewed data.
– arrange the data in order by size from least to greatest
– if n is odd then M is the center of the ordered list or it is (n+1)/2
observations from the beginning of the list
– if n is even then M is the mean of the two center positions
• Mode: is the most frequent observation
Comparisons of Measures of Center
• Symmetric Distributions: the mean and the median will be
close together. If the distribution is perfectly symmetrical then
the mean and the median will have the exact same value.
• Skewed Distributions: the mean will be pulled along the tail
of the distribution towards any outliers.
Basic Measures of Spread
• Range: the difference between the maximum and minimum
observations (usually outliers are omitted)
• Quartiles: mark out the middle half of the data
– 1st quartile (Q1) is one-quarter of the way up the list or is larger than
25% of the list
– 2nd quartile (M) is the median, half of the way up the list or is larger
than 50% of the list
– 3rd quartile (Q3) is three-quarters of the way up the list or is larger than
75% of the list
• Interquartile Range (IQR): is the difference between the 1st
and 3rd quartiles or Q3 - Q1 = IQR
Weight Data: Sorted
100
101
106
106
110
110
119
120
120
123
124
125
127
128
130
130
133
135
139
140
148
150
150
152
155
157
165
165
165
170
170
170
172
175
175
180
180
180
180
185
185
185
186
187
192
194
195
203
210
212
215
220
260
10
11
Weight Data:
12
Quartiles
first quartile 13
14
15
16
median or second quartile
17
third quartile 18
19
20
21
22
23
24
25
26
0166
009
0034578
00359
08
00257
555
000255
000055567
245
3
025
0
0
Five-Number Summary
•
•
•
•
•
minimum = 100
Q1 = 127.5
M = 165
Q3 = 185
maximum = 260
Interquartile
Range (IQR)
= Q3  Q1
= 57.5
IQR gives spread of middle 50% of the data
Diagramming the Basic Measures of Spread
• Five Number Summary: includes the minimum observation,
Q1, M, Q3, and the maximum observation
• The five number summary is diagrammed using a box plot
sometimes also known as a box and whisker plot.
– a central box (IQR) spans the quartiles Q1 and Q3
– a line marks the median
– lines (whiskers) extend from the box out to the minimum and
maximum values of the observations
– Any whisker that is longer then 1.5 times the IQR(the box) indicates
the presence of outliers.
Weight Data: Boxplot
min
100
Q1
125
M
150
Q3
175
Weight
max
200
225
250
275
More Measures of Spread
• Variance (s2): is the average of the squares of the deviations of
the observations from the mean
x  x   x 2  x 
s2  1
2
2
n 1
 ...  x n  x 
2
1 n
2



s 
x

x


i
 n  1 i 1
2
• Standard Deviation (s): is the square root of the variance
s
x1  x 2  x 2  x 2  ...  x n  x 2
n 1
• degrees of freedom: is equal to n – 1
2
 1 n

s 
x

x
 i
 n  1 i 1
Usefulness of Standard Deviation
• s measures the spread about the mean and should only be used
when the mean is chosen as the measure of center
• s = 0 only when there is no spread. This happens only when
all of the observations have the same value. Otherwise s > 0.
As the observations become more spread out about the mean, s
gets larger.
• s has the same units of measure as the original observations.
• s is NOT resistant. Strong skewness or a few outliers can
greatly increase s.
Choosing Descriptive Statistics
• Use the five number summary and box plots for distributions
with skewness or outliers
• Use mean and standard deviation for distributions that are
symmetric
• Always plot your data; remember a picture is worth a
thousand words. Keep in mind that bar graphs and pie charts
are best for categorical variables and histograms, time plots
and stem plots are best for quantitative variables.
Objectives for Class One
• Identify categorical and quantitative variables.
• Represent data graphically using:
– bar charts
– pie charts
– histograms
– stem plots
– box plots
– time plots
• Describe the distribution of a variable in terms of overall
pattern and identify potential exceptions or outliers
• Compute standard measures of the center and spread of a
distribution and interpret their values.
Next Week Class Three
To Be Completed Before Class Two:
Chapter 1: 24, 30, 32, 36, 44
Chapter 2: 26, 28, 38, 42, 50
Complete Quiz #1
Read Chapters 3, 4 & 5