Download PowerPoint

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
So What Do We Know?
• Variables can be classified as
qualitative/categorical or quantitative.
• The context of the data we work with is
very important. Always think about the
“Five W’s”—Who, What, When, Where,
Why (and How)—when examining a set of
data.
The Three Rules of Data Analysis
• The three rules of data analysis won’t be difficult
to remember:
1.Make a picture—things may be revealed that are not
obvious in the raw data. These will be things to
think about.
2.Make a picture—important features of and patterns
in the data will show up.
3.Make a picture—the best way to tell others about
your data is with a well-chosen picture.
Qualitative Data :: Making Piles
• We can “pile” the data by counting the number
of data values in each category of interest.
• We can organize these counts into a frequency
table, which records the totals & category names.
• A relative frequency table is similar, but gives the
percentages (instead of counts) for each category.
What Do Frequency Tables Tell Us?
• Frequency tables and relative frequency tables describe the
distribution of a categorical variable because they name the
possible categories and tell how frequently each occurs.
• Graphs … Pie Charts & Bar Graphs (software)
• A contingency table allows us to
look at two qualitative variables together.
• Note the totals in the margins of the table.
Each set of totals gives us the marginal
distribution of the respective variable.
So What Do We Know?
• Qualitative variables can be summarized in
frequency or relative frequency tables.
• Categorical variables can be displayed
with bar graphs and/or pie charts.
• A contingency table summarizes two variables at a
time. From a contingency table we can find the
marginal distribution for each variable or the
conditional distribution for one variable
conditioned on the other variable.
Describing Quantitative Data
• describing a quantitative variable’s distn,
make sure to always tell about L.O.S.S. !!!
Location/Center/Typical Value
Outliers
Spread/Dispersion
Shape/Distribution
Displaying Quantitative Data
HISTOGRAMS
• First, slice up the entire span of values covered by the
quantitative variable into equal-width piles called bins.
• The bins and the counts in each bin
give the distribution of the quantitative variable.
• One graphical display of the distribution of a quantitative
variable is called a histogram, which plots the bin counts as
the heights of bars (like a bar graph).
• A relative frequency histogram displays
the percentage of cases in each bin instead of the count.
• Stem-and-leaf displays show the
distribution of a quantitative variable, like
histograms do, while preserving the
individual values.
• Stem-and-leaf displays contain all the
information found in a histogram.
• A dotplot is a simple display.
It just places a dot for each case in the data.
SHAPE
1. Symmetric
2. Skewed
3. Uniform or Rectangular
Symmetric
Skewed
Right-Tailed … Left-Tailed
Uniform/Rectangular
So What Do We Know?
• Quantitative variables can be displayed using
histograms, dotplots, and/or stem-and-leaf
displays. These displays help us to see the
distributions of the variables.
• Consider L.O.S.S. when looking at these displays!
• Distributions can be classified as symmetric or
skewed (look at how the tails behave with respect
to the rest of the distribution).
• The mean for quantitative data is obtained
by dividing the sum of all values by the
number of values in the data set.
x

x
n
The following are the ages of all eight
employees of a small company:
53 32 61 27 39 44 49 57
Find the mean age of these employees.
x 362

x

 45.25 years
n
8
Thus, the mean age of all eight employees of
this company is 45.25 years,
or 45 years and 3 months.
The median is the value of the middle term
in a data set that has been ranked in
increasing order.
The calculation of the median consists of
the following two steps:
1.Sort/Arrange the data set in increasing order
2.Find the middle term in a data set with n values.
The value of this term is the median.
The following data give the weight lost
(in pounds) by a sample of five
members of a health club at the end of
two months of membership:
10 5 19 8 3
Find the median.
First, we rank the given data in increasing
order as follows:
3 5 8 10 19
Therefore, the median is 3 5 8 10 19
The median weight loss for this sample of
five members of this health club is 8
pounds.
The median gives the “center” with half the data
values to the left of the median and half to the
right of the median. The advantage of using the
median as a measure of central tendency is that
it is less influenced by outliers & skewness.
Consequently, the median is preferred over the
mean as a measure of central tendency for data
sets that contain outliers and/or skewness.
The mode is the value that occurs with the
highest frequency in a data set.
• Range = Largest value – Smallest Value
• The range, like the mean has the disadvantage
of being influenced by outliers.
• Its calculation is based on two values only:
the largest and the smallest.
• The standard deviation is
the most used measure of dispersion.
• The value of the standard deviation tells
how closely the values of a data set are
clustered around the mean.
x
82
95
67
92
deviation
82 – 84 = -2
95 – 84 = +11
67 – 84 = -17
92 – 84 = +8
∑(deviation) = 0
• standard deviation stdev =
sqrt (sum squared deviations divided by n-1)
Example :: sqrt[(4 + 121 + 289 + 64)/3]
sqrt(478/3) = sqrt(159.3) = 12.62
• A numerical measure such as the mean,
median, mode, range, variance, or standard
deviation calculated for a population data
set is called a population parameter, or
simply a parameter.
• A summary measure calculated for a sample
data set is called a sample statistic, or
simply a statistic.
For a bell shaped distribution
approximately
1. 68%
of the observations lie within
one standard deviation of the mean
2. 95% of the observations lie within
two standard deviations of the mean
3. 99.7% of the observations lie within
three standard deviations of the mean
The age distribution of a sample of 5000
persons is bell-shaped with a mean of 40
years and a standard deviation of 12 years.
Determine the approximate percentage of
people who are 16 to 64 years old.
Quartiles are three summary measures that
divide a ranked data set into four equal
parts. The second quartile is the same as the
median of a data set. The first quartile is the
value of the middle term among the
observations in the lower half, and the third
quartile is the value of the middle term
among the observations in the upper half.
Each of these portions contains 25% of the observations of a
data set arranged in increasing order
25%
25%
Q1
25%
Q2
25%
Calculating Interquartile Range
The difference between the third and first
quartiles gives the interquartile range; that
is,
IQR = Interquartile range = Q3 – Q1
The following are the ages of nine
employees of an insurance company:
47 28 39 51 33 37 59 24 33
a)Find the values of the three quartiles. Where
does the age of 28 fall in relation to the ages
of the employees?
b)Find the interquartile range.
The following data are the incomes
(in thousands of dollars)
for a sample of 12 households.
35 29 44 72 34 64 41 50 54 104 39 58
Construct a box-and-whisker plot for these data.
Step 1.
29 34 35 39 41 44 50 54 58 64 72 104
Median = (44 + 50) / 2 = 47
Q1 = (35 + 39) / 2 = 37
Q3 = (58 + 64) / 2 = 61
IQR = Q3 – Q1 = 61 – 37 = 24
Step 2.
1.5 x IQR = 1.5 x 24 = 36
Lower inner fence = Q1 – 36 = 37 – 36 = 1
Upper inner fence = Q3 + 36 = 61 + 36 = 97
Step 3.
Smallest value within the two inner fences = 29
Largest value within the two inner fences = 72
First
quartile
Third
quartile
Median
25
35
45
55
65
Income
75
85
95
105
Smallest value
within the two inner
fences
Third
First
quartile
quartile
Median
An
outlier
Largest value
within two inner
fences

25
35
45
55
65
Income
75
85
95
105
What Can Go Wrong?
• Do a reality check—
don’t let technology do your thinking for you.
• Don’t forget to sort
the values before finding the median … quartiles.
• Don’t compute numerical summaries of a
categorical variable.
• Watch out for multiple peaks—multiple peaks
might indicate multiple groups in your data.
What Can Go Wrong?
• Be aware of slightly different methods—
different statistics packages and calculators may
give you different answers for the same data.
• Beware of outliers.
• Make a picture (make a picture, make a picture).
• Be careful when comparing groups
that have very different spreads.
So What Do We Know?
• We describe distributions in terms of L.O.S.S.
• For symmetric distributions, it’s safe to use the
mean and standard deviation; for skewed
distributions, it’s better to use the median and
interquartile range.
• Always make a picture—don’t make judgments
about which measures of center and spread to use
by just looking at the data.