Download Standard Deviation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
Introduction to Statistics
Topics 7 - 10
Nellie Hedrick
Topic 7 – Displaying and Describing Distribution
• Center – the center of data distribution is the most
important part of the data analyzing
• Spread, variability, consistency – how data are
distributed is a second important part of data
analysis.
• Shape of distribution third important component of
analyzing data.
Symmetric and Skew Distribution
Skewed to the Left
Symmetric – Single Pick
Skewed to the Right
Symmetric – Two Picks
Graphical Representations of Data
Quantitative Variables
Stem plot (21, 20, 40, 22, 31, 19, 25, 23, 22, 18, 10)
Stem
Leaf
Stem
Leaf
1
980
1
089
2
102532
2
012235
3
1
3
1
4
0
4
0
• Activity 7-5
• Exercise 7-10
• Exercise 7-21
Definition
• Side by side Stem plot- common set of stems is placed in the
middle of the display with leaves branching out in either
direction to the left and right. The convention is to order the
leaves from the middle out from least to greatest.
• Histogram is graphical display similar to dot plot or stem plot.
Histogram is more feasible with the larger dataset.
• Construct the range data into subintervals (bins) of equal
length.
• Counting the number(frequency) of observational units in
each subinterval.
• The bar height represent proportions (relative
frequencies) of observational units in the subinterval.
Wrap up, Watch out and in Brief
• Direction of skewed is the indicated by the longer tail
• Pay attention to the units of the stem plot
• Pay attention to outliers – identify them, investigate
possible explanations for their occurrences. Make
sure if it is not typo error
• Remember context! Your description of the data
should be clear for everyone to be able to read.
• Remember to label
• Examine different type of graph to see which gives you
better representation
• Anticipate features of the data by considering the
nature of the variable involved.
Topic 8 – Measures of Center
• Mean – is the ordinary average. It is calculated by adding all the
numbers and dividing it by the number of observational units.
• Median – the value of the middle observational units when
observational units are sorted low to high.
• Median of the odd number of observational units is in (n+1)/2
location
• Median of even number of observational units in average of
the middle two values.
• Resistant, a measure whose value is relatively unaffected by the
presence of outliers in a distribution. Median is resistant, mean is
not.
• Mode - numerical value that appears more often in a distribution.
Describing Distributions with Numbers
Example: 20, 40, 22, 22, 21, 31, 19, 25, 23
•
•
•
•
•
Mean - Average
Median – Measuring Center
Mode – Most repeated
Minimum – smallest value
Maximum – largest value in the data set
Describing Distributions with Numbers
Example: 20, 40, 22, 22, 21, 31, 19, 25, 23
• Mean – Average
•
•
•
•
20  40  22  22  21  31  19  25  23
 24.78
9
Median – Measuring Center
Minimum
Maximum
Mode
Sort the data: 19 20 21 22 22 23 25 31 40
Median: 9 different data + 1 is 10, the divide by 2 is 5 so the
median is the 5th location. (22)
Minimum = 19, Maximum = 40, Mode = 22
Describing Distributions with Numbers
Example: 20, 40, 22, 22, 21, 31, 19, 25, 23
• Mean - Average
• Median – Measuring Center
• Minimum
• Maximum
• Mode
TI83: [1.edit] Enter all the data in the example 1 for L1. Press after each entry.
After completing data entry, press
[Quit]
[calc] [1:1-var stats]
[L1] . Use
( or ) to view all the information.
Median and Mean of a Density Curve
symmetric
Mean
Median
Mode
Mode
Median Mean
Skewed right
Mean Median Mode
Skewed left
Wrap up and Warning • Center is a property. Mean and median are two
ways to measure center. Neither one is
synonymous with center. Either one is have their
own properties and straight.
• Center is only one aspect of a distribution of data.
Measures of center do not tell the whole story.
Other important features are spread, shape, cluster
and outliers.
• Mode does not apply to categorical as well as
quantitative variables.
• Notion of center does not make sense in categorical
values.
• Exercise 8-7 page 161
• Exercise 8-9 page 161
• Exercise 8-17 page 163
Topic 9 – Measures of Spread
•
•
•
•
Range – difference between maximum and minimum
Lower quartile – data located ¼th = 25% location
Upper quartile – data located 3/4th = 75% location
Inter quartile range (IQR) difference between upper
and lower quartile
• Start here
Measuring the Spread
 The Standard Deviation (s) – Square
root of the Variance
1
2
 xi  x 
s

n 1
Standard deviation: Measure of the spread about the mean of a
distribution. It is an average of the squares of the deviations of the
observations from their mean, also equal to the square root of the
variance.
Describing Distributions with Numbers
• Be aware that various software packages and
calculators might use slightly different rules for
calculating quartiles
• It can be tempting to regard range and IQR as an
interval of values, but they should each be reported
as a single number that measures the spread of the
distribution
• Measure of spread apply only to quantitative
variables, not categorical ones.
Activity 9-5 page 182
Exercise 9-12 page 190
Exercise 9-22 page 193
Watch out
• Variability can be tricky concept to grasp! But it is
the absolute fundamental to working with data
• When looking at the variable distribution, make sure
to focus on variability in the horizontal values (the
variable) and not the heights (frequency)
• The number of distinct values represented in a
histogram does not necessary indicates greater
variability. Consider how far the values fall from the
center more than the variety of their exact
numerical values.
Mound-Shaped Distribution – Empirical rule
68% of data fall within one standard deviation from Mean
95% of data fall within two standard deviation from Mean
99.7% of data fall within three standard deviation from Mean
68%
95%
99.7%
The 68-95-99.7 rule
Attendance at a university's basketball games follows a normal
distribution with mean µ = 8,000 and standard deviation σ =
1,000. Use the 68–95–99.7 rule and give your answer as a percent.
a.
b.
c.
d.
e.
f.
Estimate the percentage of games
8,000 people in attendance.
Estimate the percentage of games
people in attendance
Estimate the percentage of games
people in attendance
Estimate the percentage of games
people in attendance
Estimate the percentage of games
people in attendance
Estimate the percentage of games
people in attendance
that have between 6,000 to
that have more than 7000
that have less than 6,000
that have less than 8,000
that have less than 5,000
that have more than 10,000
Mound-Shaped Distribution – Empirical rule
68% of data fall within one standard deviation from Mean
95% of data fall within two standard deviation from Mean
99.7% of data fall within three standard deviation from Mean
34%
2.35%
0.15%
13.5%
2.35%
13.5%
34%
The 68-95-99.7 rule
0.15%
The Standard Normal Distribution
As 68-95-99.7 rule suggest all the normal distribution share a
common property.
Z-score
The z-score is process of standardization. If x is an
observation from a distribution that has a mean  and
standard deviation , the standardized value of x is
observation  mean x  
z

standard deviation

Calculating Standard Normal Z
Example: Calculate standard normal for x = 120, where Mean 
=170 and standard deviation  = 30.
z
x

120  170
z
30
120
µ = 170
 = 30
z  1.67
-1.67
µ=0
=1
Normal distribution
Same Mean, but different standard deviation (S2 < S1) larger
spread with larger standard deviation.
S2
S1
The length of human pregnancies from
conception to birth is known to be
normally distributed with a mean of 266
days and standard deviation of 16 days.
1. What proportion of pregnancies last
between 250 and 282 days?
2. What proportion of pregnancies last
between 232 and 282 days?
Wrap up
• In study of variability, you see that even if two
databases have similar center, the spread of the
values might differ substantially.
• Z-score is a useful tool when you are comparing two
or more dataset. Z-score serves as a ruler for
measuring distances.
• Variability is a property of a distribution; standard
deviation and IQR are two ways to measure
variability.
• Standard deviation, mean absolute deviation, loosely
interpreted as the typical deviation of an observation
from the mean.
Topic 10 – More Summary Measures and Graph
• Five-number summery (FNS) – the FNS provides a
quick and convenient description of where the four
quarters of the data in a distribution fall.
• Median
• Quartiles (Q1, Q3)
• Extremes (min, max)
• Box Plot – the FNS forms the basis for a graph called
a box-plot. Box plot are especially useful for
comparing distributions of a quantitative variable
across two or three groups.
Measuring the Center and Spread
• Five-number summary
• Mean and standard deviation
Choosing a Summary
Five-number summary
Skewed distribution
Outlier
Mean and standard deviation
Symmetric distribution
The Five-Number Summary
Box Plot
•
•
•
•
•
Maximum
Q3
Median
Q1
Minimum
Modified box plot
• Modified box plot – convey additional information by
treating Outliner differently. On these graphs the
outlier is marked differently using special symbol and
extended the whisker to the next non-outliers.
• We call any observation falling more than 1.5 times
the IQR away from the nearer quartile to be an
outlier.
Activity 10-1 page
Exercise 10-22 page 217
Watch out and Wrap up
• Box plot can be tricky to read and interpret. It only
provides that data is divided into 4 pieces and each
containing 25% of the data.
• Box plot and modified box plot is nice tool to
compare between groups. Make sure to use a same
scaling.