Download what is statistics?

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Time series wikipedia , lookup

Categorical variable wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
WHAT IS STATISTICS?
Unit 6:
A Brief Look at the World of
Statistics
WEBSTER’S DEFINITION
statistics
1 numerical data assembled and
classified so as to present significant
information
2 the science of compiling such data
Another way to think about it:
Statistics is the science (and art)
of learning from data. Data are
numbers with a context.
According to our Math 2 book:
• Statistics are numerical
values used to summarize
and compare sets of data.
NEW TERMINOLOGY
Individuals are the objects described by a
set of data. They may be a single person, an
animal, group, or thing.
A variable is any characteristic of an
individual.
Categorical vs. Quantitative
DAY 1
CENTRAL TENDENCY
&
MEASURES OF DISPERSION
EXPLORATORY
DATA ANALYSIS
Using statistical tools and
ideas to examine data in
order to describe their
main features.
CENTER:
Measures of central tendency
• Mean
– the traditional “average” of a data set. This can be found
by adding up all of the values and dividing by the
number of values
• Median
– this is the value that would be in the middle of the data
set if all of the value were written in order.
• Mode
– this is the value in a data set that occurs the most
frequently.
CENTER:
MEAN
• Mean—the traditional “average” of a data set. This
can be found by adding up all of the values and
dividing by the number of values.
• Example:
The grades for a quiz are as follows:
90 70 96 92 69 53 70 87 80 89 78
72 91 76 97 70 82 75 74 72 84 76
90  70  96 
x
22
 76
 79.2273
So the mean is 79.2273 (note the symbol used).
CENTER:
MEDIAN
• Median—this is the value that would be in the middle of the
data set if all of the value were written in order.
• Example:
The grades for a quiz are as follows:
90 70 96 92 69 53 70 87 80 89 78
72 91 76 97 70 82 75 74 72 84 76
First, put them in order:
53,69,70,70,70,72,72,74,75,76,76,78,80,82,84,87,89,90,91,92,96,97
with 22 numbers, the median will be the average of the
two “middle” numbers. In this case, 76 and 78 are the 11th
and 12th terms. Therefore, the median is 77.
CENTER:
MODE
• Mode—this is the value in a data set that occurs the most
frequently.
• Example:
The grades for a quiz are as follows:
90 70 96 92 69 53 70 87 80 89 78
72 91 76 97 70 82 75 74 72 84 76
Put them in order (this helps detect the mode(s)):
53,69,70,70,70,72,72,74,75,76,76,78,80,82,84,87,89,90,91,92,96,97
We can see that 70 occurs three times while the next highest
occurrence is seen only two times. Therefore, 70 is the mode.
Note, this doesn’t tell us very much about the set of data as a
whole. Also note, there can be no mode or multiple modes.
SPREAD:
RANGE
• Range—this is a simplistic measure of spread that is
calculated as the difference between the greatest and least
data values.
• Example:
The grades for a quiz are as follows:
90 70 96 92 69 53 70 87 80 89 78
72 91 76 97 70 82 75 74 72 84 76
First, put them in order:
53,69,70,70,70,72,72,74,75,76,76,78,80,82,84,87,89,90,91,92,96,97
We can see that the lowest number is 53 and the highest is
97. Therefore, the range is 44 (found by 97-53).
Tonight’s Assignment
• You will need a SCIENTIFIC CALCULATOR
for this unit.
• Homework in Text Book:
– p. 261 #1-13 All
– Find the Min, Mean, Median, Mode, Max & Range
for ALL problems
– (You do not need to calculate the Standard Deviation).
DAY 2
MEASURES OF DISPERSION
SPREAD:
Measures of dispersion
• Range
– this is a simplistic measure of spread that is calculated as the
difference between the greatest and least data values.
• Mean Absolute Deviation
– you learned about this measure last year. It is the average of the
absolute deviations from the mean.
• Standard Deviation
– this is a more complex calculation that is the most commonly used
measure of spread in the practice of statistics.
• Interquartile Range (IQR)
– this is calculated as the difference between the 3rd and 1st quartiles.
It is often used to help calculate outliers.
SPREAD:
MEAN ABSOLUTE DEVIATION
• Mean Absolute Deviation—you learned about this measure last year.
It is the average of the absolute deviations from the mean.
• Example:
The grades for a quiz are as follows:
90 70 96 92 69 53 70 87 80 89 78
72 91 76 97 70 82 75 74 72 84 76
• Recall, the mean is 79.2273, so to calculate the mean absolute
deviation we subtract the mean from each value, take the absolute
value, add up all such values, and divide by the number of values.
90  79.2273  70  79.2273  96  79.2273 

22
• So the mean absolute deviation is 8.7025
 76  79.2273
 8.7025
SPREAD:
STANDARD DEVIATION
• Standard Deviation—this is a more complex calculation that
is the most commonly used measure of spread in the
practice of statistics.
• Example:
The grades for a quiz are as follows:
90 70 96 92 69 53 70 87 80 89 78
72 91 76 97 70 82 75 74 72 84 76
• Recall, the mean is 79.2273, so to calculate the standard
deviation we subtract the mean from each value, square this
value, add up all such values, and divide by the number of
values. Then take the square root.
 90  79.2273   70  79.2273  96  79.2273 
2

2
22
2

  76  79.2273 
2
 10.4921
Variance, what is it?
• Just so you are aware, variance = standard deviation
squared.
2
• So, variance = 
while, standard deviation = 
• Of course, that means you can also consider the
standard deviation to be the square root of the
variance.
• Our book doesn’t directly address variance, but you
may see it in some situations.
SPREAD:
INTERQUARTILE RANGE (IQR)
• Interquartile Range (IQR)—this is calculated as the difference
between the 3rd and 1st quartiles. It is often used to help
calculate outliers.
• Example:
The grades for a quiz are as follows:
90 70 96 92 69 53 70 87 80 89 78
72 91 76 97 70 82 75 74 72 84 76
First, put them in order:
53,69,70,70,70,72,72,74,75,76,76,78,80,82,84,87,89,90,91,92,96,97
Then divide them into 4 equal sets
53,69,70,70,70,72,72,74,75,76,76,78,80,82,84,87,89,90,91,92,96,97
• Now there are 5.5 (22/4) values in each quarter of the data set.
SPREAD:
INTERQUARTILE RANGE (IQR) cont.
53,69,70,70,70,72,72,74,75,76,76,78,80,82,84,87,89,90,91,92,96,97
Q1
M
Q3
We had already determined that the Median was 77 (avg of 76 & 78).
That divided the set into two halves. To find Q1 and Q3, we simply
find the median of the first and second halves. Seen here, Q1 is 72,
the Median is 77 and Q3 is 89.
So, the IQR = Q3 – Q1 = 17
This measure essentially lets us know how close together the middle 50% of
all the data is located. Or how far spread out is the middle 50%.
5 Number Summary
• The 5 Number Summary is:
Minimum
Q1
Median
Q3
Maximum
• For our example, the minimum (lowest value) was
53 and the maximum (highest value) was 97.
• So our 5 Number Summary for this data set is:
53
72
77
89
97
OUTLIERS:
Deviations from the majority of the data
• When you look at a graph for a set of data, an outlier is
typically a visibly different point. It will not “fit” with the
rest of the data.
• There are multiple ways to define an outlier.
• An outlier is a data point that is more than two standard
deviations from the mean.
Outlier Example
• Assume that the mean is 75 and the standard
deviation is 11. We would consider anything about
a 97 an outlier. Likewise, we would consider
anything below a 53 and outlier.
• To determine if there are any outliers, we simply
look at the set of data to see if there are any values
more than 2 standard deviations away from the
mean.
Tonight’s Assignment
• Worksheet 1:
– Absolute Mean Deviation
– Standard Deviation
– Finding Outliers
DAY 3
Exploring Basic One Variable Graphs
DISTRIBUTION?
The distribution of a variable tells
us what values it takes and how
often it takes these values.
How do we display a distribution?
 bar graph
 pie chart
 histogram
 stem plots
 time plots
 dot plots
 box plot
Why do we graph?
We want to get the overall picture of
what is taking place before we start
looking at numerical summaries of
the data. The graphs will have
particular features worth discussing
that give us insight into the data.
What graph & when?
Categorical Variables
bar graph
pie chart
use with parts of a whole
Quantitative Variables
histogram
when values are wide spread
dot plot
when few values are taken
stem plot
good with small data sets (<100)
time plot
to display change over time
box plot
to display the 5 number summary
Categorical variables
• Given Categorical
variables, we can use bar
charts and pie charts to
express them in a visual
manner.
• Ex. 1 – For the given data,
create a bar chart and a pie
chart to express them in a
clear and visual manner.
Favorite
Music
Genre
Count
(thousands)
Percent
Classical
20
6.5%
Rock
100
32.3%
Country
40
12.9%
Alternative
90
29.0%
Heavy
metal
60
19.3%
BAR CHART
Thousand of People
Favorite Music Genre
120
100
80
60
40
20
0
Classical
Rock
Country
Genre
Alternative
Heavy Metal
Pie Chart
Favorite Music Genre
19%
6%
Classical
33%
Rock
Country
Alternative
29%
Heavy Metal
13%
Graphing Focus
• In this class, we will not be creating graphs for
categorical variables.
• We will focus on graphing quantitative variables.
• By virtue of learning how to create these, you
should be more comfortable reading these graphs
when they are presented to you.
Constructing Histograms
A histogram is used to graph the distribution of a single
quantitative variable.
To construct a histogram first divide the
range of the data into classes of equal
width. Second, count the number of
observations in each class. Finally, draw
the histogram being sure to title and
label the graph appropriately.
Interpreting Histograms
(and other similar graphs)
• There are really three things for us to consider:
– CENTER
– SPREAD (or dispersion)
– OUTLIERS
• We have already spent some time exploring
measures of center and spread.
• We want to also consider outliers.
Tonight’s Assignment
• Worksheet 2: Histograms
DAY 4
Samples & Populations
How can we gather data about a
very large group?
Population vs. Sample
A population is a
group of people or
objects that you
want information
about.
A sample is a
subset of a
population.
Example: The
height of 15 year
old girls in the U.S.
Example: A sample
of 15 year old girls
in the U.S.
Types of Samples
Self-selected sample: People volunteer to participate
in the group.
Systematic sample: A rule is used to select members
of a population (people or data) to participate in the
group.
Convenience sample: The easiest members of a
population are selected (such as people sitting in the
1st row).
Random sample: Each member of the popultation has
an equal chance of being selected.
What is good & bad about these
sample
types?
Self-selected sample: People volunteer to participate
in the group.
Systematic sample: A rule is used to select members
of a population (people or data) to participate in the
group.
Convenience sample: The easiest members of a
population are selected (such as people sitting in the
1st row).
Random sample: Each member of the population has
an equal chance of being selected.
The Goal: Unbiased Sample
An unbiased sample accurately represents the
population.
A biased sample may over-represent some members
of the population, so is less likely to represent the
entire population accurately.
How do we know if we have a
good sample?
Margin of Error
• We can calculate how closely a sample
measures the exact population by using the
Margin of Error.
• The Margin of Error gives a limit on how
much the sample data will vary from the
entire population data.
• It is calculated as:
1
p
n
Lunch Habits
• In a survey of 990 workers, 30% said that
they eat lunch at home during a typical
work week.
• What is the Margin of Error for the survey?
• What is the interval of workers that is likely
to contain the exact percent of all workers
who eat at home each week.
Lunch Habits
• In a survey of 990 workers, 30% said that they eat lunch at
home during a typical work week.
• What is the Margin of Error for the survey?
– Margin of Error =
1
 0.32  32%
990
• What is the interval of workers that is likely to contain the exact
percent of all workers who eat at home each week.
– Find the low end and high end of the population range:
30%  3.2%  26.8%
30%  3.2%  33.2%
– So, between 26.8% and 33.2% of all workers are likely to eat lunch at
home each week.
Tonight’s Assignment
• Text Book:
– p. 270 # 1-25 ODD
p. 275 # 5-9 ALL
• QUIZ TOMORROW on Central Tendency,
Samples & Populations!