Download Descriptive Statistics, Normal Distribution, Histograms

Document related concepts

Data mining wikipedia , lookup

History of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Descriptive Statistics, Histograms, and
Normal Approximations
Math 1680
Overview







Obtaining Data Sets
Descriptive Statistics
Histograms
The Normal Curve
Standardization
Normal Approximation
Summary
Obtaining Data Sets

Before we can analyze a data set, we
need to have a data set



How far do you travel to get to class, in
miles?
How tall are you?
Today, numerical data is easily stored
and organized (and even analyzed) by
several computer programs
Obtaining Data Sets


Notice that in its raw form, the data is
difficult to deal with
By sorting the data, we can get a
better picture of its distribution, or
shape

We are often interested in…



Where the data are centered
How spread out they are
With what frequency numbers appear
Obtaining Data Sets

Usually, the entire data set is too large
to work with directly


We want ways to summarize the data
We have quantitative (numerical) and
pictorial descriptions available to us


Descriptive Statistics
Histograms
Descriptive Statistics


We can summarize the data set with a few
simple numbers, called descriptive (or
summary) statistics
The first and most often-used summary stat
is the average (or mean)

Represents the central tendency of the data set


Gives an idea of where the bulk of the points lie
To calculate the average, add up the values of
all of the points and divide by the total number of
points in the set
Descriptive Statistics

Calculate the average of the following data sets

60 60 60 60 60
60  60  60  60  60 300

 60
5
5

18 59 60 63 100
18  59  60  63  100 300

 60
5
5

18 35 60 87 100
18  35  60  87  100 300

 60
5
5
Descriptive Statistics

Despite having the same average, the
three data sets are clearly different

The average alone usually does not
describe data sets uniquely
Descriptive Statistics

The median is another central
tendency measure

The median marks the point where
exactly half of the data are less than (or
equal to) the median


If there are an odd number of data points,
then the median is just the number in the
middle of the sorted set
Otherwise, the median is the average of the
two points in the middle of the sorted set
Descriptive Statistics

Calculate the median of each data set

1 4 5 7 10 15 18

0
3
3
9
9
10 10 13 13 13 13 15 17 17
17 20 20 20 21 21 22 23 24 26 27 28 28 29
29 29 30 31 31 31 34 35 38 38 44 44 49 52
22  23
 22.5
2
Descriptive Statistics

The average is like a balance point



It represents the place where the data set
is equally “heavy” on both sides
If there are outliers on one side of the
data set, the average will be skewed
The median is more robust

What this means is that it is usually less s
affected by outliers or data entry errors.
Descriptive Statistics

In a certain class of 13 students, 10
showed up the first exam, while 3 blew
it off

Here are the grades; in order:
0 0 0 55 68 78 79 81 84 87 93 94 98

Calculate the class median…


Including all students 79
Not counting those who slept in 82.5
Descriptive Statistics

In a certain class of 13 students, 10
showed up the first exam, while 3 blew
it off

Here are the grades; in order:
0 0 0 55 68 78 79 81 84 87 93 94 98

Calculate the class average…


Including all students 62.8
Not counting those who slept in
81.7
Descriptive Statistics

Suppose the teacher mistyped the
grade of 55 as being a 15

Not counting the sleepers,
0 0 0 55 68 78 79 81 84 87 93 94 98
0 0 0 15 68 78 79 81 84 87 93 94 98

What is the new median?
82.5

What is the new average?
77.7
Descriptive Statistics



Earlier, we saw that the average did
not necessarily uniquely describe a
data set
We use the standard deviation (SD) to
measure spread in a data set
When paired, the average and SD are
highly effective summary statistics
Descriptive Statistics

The Root-Mean-Square (RMS) measures
the typical absolute value of data points in
a set

Calculated by reading its name backwards




Square all entries in the data set
Take their mean
Take the square root of that mean
Find the average and then the RMS size
of the numbers of the list 1  3 5  6 3
Average = 0
RMS = 4
Descriptive Statistics

The SD embodies the same concept of
“typical” distance


Where the RMS measures typical
distance from 0, the SD measures typical
distance from the data set’s average
This is accomplished by subtracting the
average from every data point and then
taking the RMS of the differences (or
deviations from the mean)
Descriptive Statistics

1 4 5 7 10 15 has an average of 7

The deviations are then -6 -3 -2 0 3 8

Note how the subtraction process recenters the data set so that the average
is at 0
Descriptive Statistics

Taking the RMS of the deviations
gives the standard deviation


Normally, about two thirds to three
quarters of a data set should be within
one SD of the mean
1 4 5 7 10 15 has an average of 7 and
an SD of about 4.5
Descriptive Statistics
1 4 5 7 10 15
(1+4+5+7+10+15)/6 = 7
1-7 4-7 5-7 7-7 10-7 15-7
Average = 7
-6 -3 -2 0 3 8
(-6)2 (-3)2 (-2)2 02 32 82
36 9 4 0 9 64
(36+9+4+0+9+64)/6 = 122/6 ≈ 20.3
√(20.3) ≈ 4.5
SD ≈ 4.5
Descriptive Statistics




What we had on the previous slide is called the SD
of the sample. However, if the goal is to use this
sample to estimate the SD of a larger population,
we would divide by n-1 instead of n (where n is the
number of points) and call the result Sample SD.
Most calculators actually calculate the sample SD.
In general, the higher a set’s SD, the more spread
out its points are
An SD of 0 indicates that every point in the data set
has the same value
Descriptive Statistics

Calculate the SD’s of the data sets

60 60 60 60 60
0

18 59 60 63 100
26.0

12 35 60 87 100
30.7
Histograms


Often, we would prefer a pictorial
representation of a data set to a twonumber summary
The most common way to graphically
represent a data set is to draw a
frequency histogram (or just
histogram)
Histograms

Histograms tend to look like city skylines


In a histogram, the area under the curve
between two points on the horizontal axis
represents the proportion of data points between
those two points
Continuing the city skyline analogy, the size of
the building determines how many people live
there

A long, low building can house as many people as a
thin skyscraper
Histograms

To draw a histogram, we first need to
organize our data into bins (or class
intervals)



Often, the bins are dictated to us
If we get to choose them, we try to pick the bins
so that they give a fair representation of the data
Then mark a horizontal axis with the bin
values, spacing them correctly
Histograms

Often, data is given in percentage form


If not, divide the number of points in the bin by
the number of points in the data set to get the
percentage
Draw a box for each bin so that the area of
the box is the percentage of the data in that
bin

To get the correct height of the box, divide the
percentage of the box by the width of the bin
Histograms

Note that the average and median can be visually
located on a histogram




If the histogram was balanced on a see-saw, the fulcrum
would meet the histogram at the average
If you draw a vertical line through the histogram so that it
splits the area in half, then the line passes through the
median
On a symmetric histogram, the average and median
tend to coincide
Asymmetric tails pull the average in the direction of
the tail
The Normal Curve

A great many data sets have similarlyshaped histograms





SAT scores
Attendance at baseball games
Battery life
Cash flow of a bank
Heights of adult males/females
The Normal Curve

These histograms are similar to one
generated by a very special
distribution

It is called the normal distribution, and it
is identified by two parameters we are
already familiar with


average
standard deviation
The Normal Curve

This is the standard normal curve, where the
average is 0 and the SD is 1
The Normal Curve

Though the equation used to draw the
curve is not easy to work with, there is
a table of values for the standard
normal distribution


We will use this table to find areas under
the curve
The table is on page A-105 of your text
The Normal Curve

Properties of the standard normal curve

The curve is “bell-shaped” with its highest point
at 0

It is symmetric about a vertical line through 0

The curve approaches the horizontal axis, but
the curve and the horizontal axis never meet
The Normal Curve

Area underneath the standard normal curve




Half the area lies to the left of 0; half lies to the
right
Approximately 68% of the area lies between –1
and 1
Approximately 95% of the area lies between –2
and 2
Approximately 99.7% of the area lies between
–3 and 3
Standardization


Most data sets do not have a mean of 0
and an SD of 1
To be able to use the standard normal
curve, we’ll need to standardize numbers
in the original data set


To standardize a number, subtract the data
set’s average and then divide the difference
by the data set’s SD
Standardizing is basically a change of scale

Like converting feet to miles
Standardization

Suppose there are two different sections of
the same course

The scores for the midterm in each section were
approximately normally distributed

In first section, the average was 64 and the standard
deviation was 5


In second section, the average was 72 and the
standard deviation was 10


Tina scored a 74 in first section
Jack scored an 82 in second section
Which of the two scores is most impressive,
relative to the students in his/her section?
Standardization

Convert the following scores in the first
section to standard units

Alice got a 50
-2.8

Bob got a 61
-0.6

Carol got a 64
0

Dan got a 77
2.6
Standardization

In Jack’s section, students with grades
between 62 and 82 received a B

What percentage of students in this
section received Bs?
68.27%

Is this percentage exact?
No
Normal Approximation

According to the HANES study, the height of
U.S. women was 63.5 inches with an SD of
2.5 inches
Normal Approximation

The normal curve is a smooth-curve
histogram for normally distributed data

We can estimate percentages within a
given range

Find the area under the curve between those
ranges using the standard normal table
Normal Approximation

Sometimes will require cutting and
pasting different areas together


The standard normal table on page A105 takes a standard score z
It returns to you the area under the curve
between –z and z
Normal Approximation

Find the area between –1.2 and 1.2
under normal curve
76.99%
Normal Approximation

Find the area between 0 and 1.65
under the standard normal curve
45.055%
Normal Approximation

Find the area between 0 and 3.3 under
the standard normal curve
49.9515%
Normal Approximation

Find the area between –0.35 and 0.95
under normal curve
46.58%
Normal Approximation

Find the area between 1.2 and 1.85
under the normal curve
8.29%
Normal Approximation

Find the area between –2.1 and –1.05
under the normal curve
12.9%
Normal Approximation

Find the area to the right of 1 under
the normal curve
15.865%
Normal Approximation

Find the area to the left of 0.85 under
the normal curve
80.235%
Normal Approximation


If a data set is approximately normal in
distribution, we can use the normal curve in
place of the data set’s histogram
If you want to estimate the percentage of the
data set between two numbers…



Standardize the numbers to get z scores
Look each z score up in the standard normal
table
Cut and paste the areas to match the region you
originally wanted

The percentage under the curve will be close to the
percentage in the data set
Normal Approximation

It is generally helpful to sketch the
curve first and shade in the desired
area

This will remind you what the target area
is
Normal Approximation

According to the HANES study, the
height of U.S. women was 63.5 inches
with an SD of 2.5 inches

What percentage of women has heights
between 60 and 68 inches?
88.71%
Normal Approximation

According to the HANES study, the
height of U.S. women was 63.5 inches
with an SD of 2.5 inches

What percentage of women are taller
than 66 inches?
15.865%
Normal Approximation

Sometimes, you will be given the
percentage of the data set


Want to find score(s) which mark(s) off that
percentage
Adjust the area to “center” it


Look up the z score associated with that area in
the table
Unstandardize the z score by multiplying it by
the SD and adding the average to the product
Normal Approximation

For a certain population of high school
students, the SAT-M scores are normally
distributed with average 500 and SD=100


A certain engineering college will accept only
high school seniors with SAT-M scores in the
top 5%
What is the minimum SAT-M score for this
program?
665
Normal Approximation

One way to determine how large a
number is in the data set is to find its
percentile rank


The kth percentile is the value so that k
percent of the data set have values below
it
Percentile ranks can be calculated for
any data set
Normal Approximation

In one year, the 1600-point SAT
scores were approximately normal with
an average of 1030 and an SD of 190

If a student scores a 1460, what is her
percentile rank?
98th percentile
Summary

It is often useful to describe a data set with
summary statistics

The average and median are central tendency statistics


The average is more sensitive to outliers
The standard deviation (SD) is the most common
summary statistic for describing a data set’s spread


The SD is calculated by taking the RMS of the deviations
from the mean of each data point in the set
Most of the points in most data sets will lie within one or
two SD’s from the average
Summary

We can represent a data set graphically by drawing
a histogram




The percentage of the data set in a bin is the area under
the histogram of that section
The height of each block in a histogram is the percentage
of the data in the corresponding bin divided by the width of
the bin
The total area under any histogram is 100%
The average of a data set is located at the balance
point of the histogram

Long tails pull the average in the direction of the tail
Summary

Using the average and SD, we can
standardize numbers in the data set


The standard score (z) of a number is its
distance from the average in terms of
SD’s
We can also take a standard score and
convert it back to a raw score
Summary

Many data sets are approximately normal

We can estimate the percentage of points in a
data set that fall between two numbers



Convert the numbers to standard units
Find the area under the standard normal curve by
using the normal table
If a data set is approximately normal, we can
use the normal table to estimate percentile ranks