Download PPT 3

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Categorical variable wikipedia , lookup

Time series wikipedia , lookup

Transcript
Presentation 3
Describing Quantitative
Variables
What is a quantitative variable?

Quantitative variables are recorded as numerical
values. They are measurements or counts taken
on each unit in the sample.
Consider the following examples:
1. Age of a person.
2. Number of times a person sees a dentist in a year.
3. Weight of a dog.
4. Number of credits a student takes in a semester.
Note: Quantitative variables can be either continuous or
discrete. Continuous variables can take on any numerical
value in a range. Discrete variables can take on only fixed
values.
What is not a quantitative variable…

Numbers that represent categories are NOT
quantitative variables.

Your SSN# for example is a label, not a
measurement.
Helpful Hint: When considering if something is a
quantitative variable consider if an average of the variable
is meaningful. The average height in a sample would
certainly be of interest. The average SSN# would not.
The Four Features of Quantitative Data




Location: What is the center or average value?
Spread: What is the spread or variability of the
values? Do they fall closely around the center or
far apart?
Shape: What is the shape of the data? Bellshaped or skewed? Symmetric?
Outliers: Are there any extreme or unusual
observations?
Tools to Describe Quantitative Data



Five Number Summary: Table that consists of the
minimum, first and third quartiles, median, and the
maximum values of a sample. Used to describe both the
center and spread of values.
Graphs: Dotplots, Histograms, and Boxplots are useful
to illustrate location, spread, and shape of data, as well
as identify outliers.
Numerical Summaries: All five members of the five
number summary, in addition to the sample mean and
standard deviation.
How to Construct a Five Number Summary:
Finding the Median

Consider the following data set: Exercise Hours per
week for 11 Men. Sort the values in increasing order.
Data: 0, 1, 1, 2, 5, 7, 8, 10, 11, 14, 25


Median is the middle value in the data, such that half the
observations are greater and half are less. The median
is the middle value for an odd number of observations,
or the average of the middle two values for an even
number of observations.
In this case there are 11 observations so the median is
the middle, or 6th number, which happens to be 7.
How to Construct a Five Number Summary:
Finding the Quartiles


The median divides the data into halves, and the quartiles
further divide the data into quarters.
The first quartile (Q1) is the median of the lower half of
the data, the third quartile is the median of the upper half
(Q3).
Data: 0, 1, 1, 2, 5, 7, 8, 10, 11, 14, 25

The upper and lower parts of the data set are highlighted
above. We ignore the median value 7 in our calculations.
Q1 = 1
Q3 = 11
How to Construct a Five Number Summary:
Min and Max

The last part of the five number summary is the
minimum and maximum. It is easy to see that the min
value was 0, and the max was 24. The summary is
usually displayed in a table as follows:
Five Number Summary: Outline
Median
Five Number Summary: Our
Example
7
Q1
Q3
1
11
Min
Max
0
24
Interpreting the Five Number Summary
From “handheight” Data Set in Text CD:
Sex
Heigh
t
FNS: Hand Span for 89 Women
Hand Span
Female
68
21.5
Male
71
23.5
Male
73
22.5
Female
64
18
Male
68
23.5
Female
59
20
Male
73
23
Male
75
24.5
Female
65
21
…
…
…
Min
Q1
25%
20
18.5
20.5
16
23
Median
25%
Q3
25%
Max
25%
Interpreting the Five Number Summary

50% of the sample falls below the median, and fifty
percent of the sample falls above the median.

50% of the sample falls between Q1 and Q3.

25% of the sample falls below Q__.

25% of the sample falls above Q__.

75% of the sample falls below Q__.

75% of the sample falls above Q__.
Example: What is the five number summary
for this data?

Number of hours spent on internet per week:
12, 4, 16, 18, 1, 6, 10, 8
Graphs for Quantitative Variables

There are 4 main graphs for quantitative
variables.
1. Stem-and-Leaf Plot
2. Dotplot
3. Histogram
4. Boxplot
Show individual
data points. Okay
for small data sets.
Better for large
data sets. Most
commonly used.
Graphs for Quantitative Variables
Example of Stem-and-Leaf Plot and Dotplot Using
Hand Span Data for Women
Stem-and-Leaf Display: Hand Span
Female
Dotplot for Hand Span Females
Stem-and-leaf of Handspan N = 89
16 0
16 5
17 0000
17 55
18 00
18 5555555555555
19 00000000000
19 5555555555
20 000000000000000000
20 55555555555
21 00000000
21 55555
22 0
22 5
23 0
16
17
18
19
20
21
Hand Span Female
22
23
Graphs for Quantitative Variables
Creating a Stem-and-Leaf Plot
1.
2.
3.
Determine the stem values: All but the last of the
displayed digits of a number. It is reasonable to have
between 6 and 15 stems defining equally spaced
intervals.
Attach a “leaf” for each individual to the appropriate
stem. This is the last displayed digit of the number.
At each stem value, put leaves in increasing order.
Example: Create a stem and leaf plots for the
following samples:
(a) 75, 84, 68, 95, 87, 93, 56, 87, 83, 82, 80, 62, 91, 84
|5| 6
|6| 2
|6| 8
|7|
|7| 55
|8| 02344
|8| 77
|9| 13
|9| 5
OR
|5| 6
|6| 2 8
|7| 55
|8| 0234477
|9| 135
(b) 470 257 163 188 223 245 399 680
Graphs for Quantitative Variables
Histogram
Horizontal Axis: Determine equally spaced intervals to divide the data.
(5-15 intervals)
Vertical Axis: Frequencies or relative frequencies (percentages).
Frequency
20
10
0
15.5
16.5
17.5
18.5
19.5
20.5
Hand Span Female
21.5
22.5
23.5
Graphs for Quantitative Variables
How to Draw a Boxplot
Step 1: Label either a vertical axis or a horizontal axis
with numbers from min to max of the data.
Step 2: Draw box with lower end at Q1 and upper end at
Q3.
Step 3: Draw a line through the box at the median.
Step 4: Draw a line from Q1 end of box to smallest data
value that is not further than 1.5  (Q3- Q1) from Q1.
Draw a line from Q3 end of box to largest data value
that is not further than 1.5  (Q3- Q1) from Q3.
Step 5: Mark data points further than 1.5  IQR from
either edge of the box with an asterisk. Points
represented with asterisks are considered to be
“outliers”.
Graphs for Quantitative Variables
Boxplot
NOTE:
Min=16
is greater than
Q1-1.5(Q3-Q1)
=18.5-1.5(2)
=15.5
SO…stop at Min
Max=23
is less than
Q3+1.5(Q3-Q1)
= 20.5+1.5(2)
= 23.5
So…stop at Max.
Min
Max
Q1
16
17
18
Median
19
20
Hand Span Female
Q3
21
22
23
Shape of Data

We can use a graphs to look at the shape of the
quantitative variable distribution.
An example of a bell-shaped or normal distribution
which appear often in nature:
30
Frequency

20
10
0
60
70
Height Male (inches)
80
Skewed Distributions
Example: Exam Scores
Scores from an easy
exam, skewed left.
Scores from a hard
exam, skewed right.
50
40
30
Frequency
Frequency
40
30
20
20
10
10
0
0
40
50
60
70
Test Score
80
90
100
0
50
Test Score
Skewed data often occurs when the variable is naturally
bounded in some way and a great many units fall close to the
boundary. For example, the variable number of pets.
100
Numerical Summaries: Location





Median: The middle value such that half the
observations are greater and half less.
Mean: The average value in the data set. The mean
equals the sum of all observations divided by the number
of observations. Symbol: x = sample mean
If the distribution is symmetric the mean will equal the
median.
If the data is right skewed, the mean is ___________
than the median.
If the data is left skewed, the mean is ___________ than
the median.
Numerical Summaries: Spread


Range: The distance between the most extreme values
in the data set.
Range = Maximum – Minimum.
Interquartile Range (IQR): The distance between
the first and third quartiles.
IQR = Q3 – Q1

Standard Deviation: Approximately the average
distance a value falls from the mean.
Symbol = s = sample standard deviation
Here is the formula for the standard deviation square,
which is called Variance of the sample.
2
(x

x
)

2
s 
n 1
Example – Calculate Variance by hand
Consider we ask 5 persons how many high school friends they have and we
plotted their responses below. What is the sample variance?
1. Find difference between each data point and mean.
______, ______, ______, ______, ______
2. Square the differences, and add them up.
______+______+ ______+ ______+ ______=_______
3. Divide by one less than the number of data points and you will get
the result.
variance = _______/________ =_________
Outliers
Definition: An outlier is a data point that is not
consistent with the bulk of the data.
Possible Reasons for Outliers:
1. An error was made while taking the measurement or
entering it into the computer.
2. The individual belongs to a different group than the
bulk of individuals measured.
3. The outlier is a legitimate, though extreme data value.
Identifying Outliers
Graphs are one of the best methods to identify outliers. In the case of the
boxplot below the outlying observation is indicated by an asterisk.
62
67
72
77
82
Height Male
Boxplot Outlier Rule: Any observation which is less
than 1.5*IQR below Q1 or greater than 1.5*IQR above
Q3 is considered an outlier and receives an asterisk.
Resistant Statistics

Resistant statistics are those that are “resistant”
to the influence of outliers.
Resistant: Median, IQR
Non-Resistant: Mean, Std. Deviation, and Range
The most appropriate measure of
variability depends on …
the shape of the data’s
distribution.


If data are symmetric, with no serious
outliers, use range and standard deviation.
If data are skewed, and/or have serious
outliers, use IQR.
The Empirical Rule



The Empirical Rule states that for any bell-shaped
curve, approximately
68% of the values fall within 1 standard deviation
of the mean in either direction.
(i.e. x plus or minus s)
95% of the values fall within 2 standard deviation
of the mean in either direction.
99.7% of the values fall within 3 standard
deviation of the mean in either direction.