Download SIA Unit 3

Document related concepts

Data mining wikipedia , lookup

History of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
SCIENTIFIC INQUIRY AND
ANALYSIS
UNIT 2
STATISTICAL DATA ANALYSIS
SCIENTIFIC DATA ANALYSIS
1
STATISTICAL DATA ANALYSIS
OBJECTIVES:
The student will be able to:
• Create a frequency table from a set of data.
(CCCS.HSS.ID.A.1)
• Compute, interpret, and analyze the measures of central
tendency (mean, median, and mode) of a set of data.
(CCCS.HSS.ID.A.2)
• Compute measures of spread (variance, standard
deviation, quartiles, and interquartile range)
(CCCS.HSS.ID.2)
• Graph one variable by hand. (histogram, boxplot)
(CCCS.HSS.ID.A.1)
SCIENTIFIC DATA ANALYSIS
2
STATISTICAL DATA ANALYSIS
OBJECTIVES:
The student will be able to:
• Identify outliers informally and recognize their effect on a set of
data. (CCCS.HSS.ID.A.3)
• Define the characteristics of the Normal distribution by examining a
histogram. (CCCS.HSS.ID.4)
• Explain how a histogram, which is a discrete probability
distribution, is related to the Normal distribution curve, a continuous
probability distribution. (CCCS.HSS.ID.A.4)
• Determine if a given set of data is approximately Normal using the
empirical rule (68 - 95 - 99.7 rule). (CCCS.HSS.ID.A.4)
• Estimate areas under the Normal curve using the empirical rule.
SCIENTIFIC DATA ANALYSIS
3
STATISTICAL DATA ANALYSIS
OBJECTIVES:
The student will be able to:
• Graph two variables by hand (scatterplot).
(CCCS.HSS.ID.B.6)
• Describe a scatterplot in terms of form, direction, strength,
and the presence of outliers. (CCCS.HSS.ID.B.6)
• Find equations of lines of best fit by fitting a line by hand
and using technology (TI-84 regression function and/or
Excel). (CCCS.HSS.ID.B.6.A)
• Interpret the slope (rate of change) and the intercept
(constant term) of a linear model in the context of the data.
(CCCS.HSS.ID.C.7)
SCIENTIFIC DATA ANALYSIS
4
STATISTICAL DATA ANALYSIS
OBJECTIVES:
The student will be able to:
• Compute the correlation coefficient using technology
(TI-84 or Excel) and interpret it in the context of the
data. (CCCS.HSS.ID.C.8)
• Informally assess the fit of a function by plotting and
analyzing residuals. (CCCS.HSS.ID.B.6.B)
• Make predictions based upon analysis of data.
(5.2.12.A.3)
• Distinguish between correlation and causation.
(CCCS.HSS.ID.C.9)
SCIENTIFIC DATA ANALYSIS
5
STATISTICAL DATA ANALYSIS
• Statistics – collection of methods for planning
experiments, obtaining data, and then
organizing, summarizing, presenting,
analyzing, interpreting, and drawing
conclusions from the data.
SCIENTIFIC DATA ANALYSIS
6
STATISTICAL DATA ANALYSIS
• Measures of Central Tendency – a method to
describe the entire sample or population in a
single number known as an average (mode,
median and mean)
• Mode – the value that occurs most frequently
in data.
– Example 1: What is the mode of the following
data: (2, 5, 3, 2, 1, 6, 4, 10, 44, 2, 4, 1, 10, 3, 2, 5)?
SCIENTIFIC DATA ANALYSIS
7
STATISTICAL DATA ANALYSIS
• Mode
– Example 2: What is the mode of the following
data: (2, 5, 3, 10, 1, 4, 4, 10, 1, 2, 3, 4, 1, 10, 3, 2,
5, 5)?
– Mode is not a stable average, but it gives you the
most common value in a distribution if that is the
information desired.
– There can sometimes be more than one mode in a
given piece of data.
SCIENTIFIC DATA ANALYSIS
8
STATISTICAL DATA ANALYSIS
• Median – the central value that occurs in an
ordered distribution of data.
– If there is an odd number of data, it is the center
value.
– If there is an even number of data, there are two
center values therefore:
Median = sum of two middle values / 2
SCIENTIFIC DATA ANALYSIS
9
STATISTICAL DATA ANALYSIS
• Median
– Example 1: What is the median of the following
data: (62, 3, 5, 28, 67, 33, 22, 2, 10)?
– Example 2: What is the median of the following
data: (62, 3, 5, 28, 67, 33, 22, 2, 10, 120)?
– Median is a more stable average than the mode,
but it does not indicate the range of values above
or below it.
SCIENTIFIC DATA ANALYSIS
10
STATISTICAL DATA ANALYSIS
• Mean – adds all values of a distribution of
data and divides by the amount of data.
𝑥
𝑠𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 #′ 𝑠
𝑆𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛 = 𝑥 =
=
𝑛
𝑡ℎ𝑒 𝑎𝑚𝑡. 𝑜𝑓 #′ 𝑠
𝑥
𝑃𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑚𝑒𝑎𝑛 = 𝜇 =
𝑛
SCIENTIFIC DATA ANALYSIS
11
STATISTICAL DATA ANALYSIS
• Mean
– Trimmed Mean: will remove the highest and
lowest values of a group of data before taking a
mean. The typical trim amounts are either 5% or
10%.
– 5% Trim Mean: take 5% of the number of data
points, round out the answer, take that amount off
the top and bottom, and then take the average.
SCIENTIFIC DATA ANALYSIS
12
STATISTICAL DATA ANALYSIS
• Mean
– Example: Given the following data take the 5%
trimmed mean: 34, 56, 72, 74, 78, 82, 85, 85, 88,
90, 90, 92, 95, 95, 99, 100.
• 5% of 16 values is .8, therefore round up to 1 and
remove the top and bottom scores.
• Remove 34 & 100; add up the remains = 1181 / 14 =
84.4%
• If no trimming is done, then the mean would be 82.2%.
SCIENTIFIC DATA ANALYSIS
13
STATISTICAL DATA ANALYSIS
• Measures of Variation – a cross reference of
the spread of the data.
• Range – the difference between the largest and
smallest values of a distribution.
– Example 1: What is the range of the following
data: (2, 5, 3, 2, 1, 6, 4, 10, 44, 2, 4, 1, 10, 3, 2, 5)?
– Range fails to tell how much values vary from one
another.
SCIENTIFIC DATA ANALYSIS
14
STATISTICAL DATA ANALYSIS
• Sample Standard Deviation – a measurement that gives
you a better idea of how the data entries differ from the
mean.
𝑆𝑎𝑚𝑝𝑙𝑒 𝑠𝑡𝑑. 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = 𝑠 =
𝑆𝑎𝑚𝑝𝑙𝑒 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 =
𝑠2
=
𝑥−𝑥
𝑛−1
𝑥−𝑥
𝑛−1
2
2
– x = a value in the distribution
– 𝑥 = the sample mean value of the distribution.
– n = the total number of values in a sample distribution
SCIENTIFIC DATA ANALYSIS
15
STATISTICAL DATA ANALYSIS
• Population Standard Deviation – this is the same as the
sample standard deviation with the exception that this
includes the complete population that you are studying not
just a sample set. NOTE: the symbol is different and you
divide by the whole population (N).
𝑃𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑠𝑡𝑑. 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = 𝜎 =
𝑥−𝜇
𝑁
2
– x = a value in the distribution
– 𝜇 = the population mean value of the distribution.
– N = the total number of values in the population
SCIENTIFIC DATA ANALYSIS
16
STATISTICAL DATA ANALYSIS
• Standard Deviation
– Example: Find the standard deviation of the following values:
(1, 2, 7, 9, 10, 10).
𝑥
𝒙−𝒙
(𝒙 − 𝑥)2
1
1 – 6.5 = -5.5
30.3
2
7
9
10
10
Σ(𝒙 − 𝑥)2 =
𝑀𝑒𝑎𝑛 = 𝑥 =
s2 =
s=
SCIENTIFIC DATA ANALYSIS
17
STATISTICAL DATA ANALYSIS
• Standard Deviation – the following is an
alternate means to calculate sample std.
deviation.
𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑡𝑑. 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = 𝑠 =
𝑆𝑆𝑥
𝑤ℎ𝑒𝑟𝑒 𝑆𝑆𝑥 = Σ(𝑥 2 ) −
𝑛−1
SCIENTIFIC DATA ANALYSIS
(𝑥)
𝑛
2
18
STATISTICAL DATA ANALYSIS
• Standard Deviation
– Previous example: Find the standard deviation of the following
values: (1, 2, 7, 9, 10, 10) using alternate method
x
x2
1
1
2
4
7
9
10
10
Σx =
Σx2 =
SSx =
SCIENTIFIC DATA ANALYSIS
s=
19
STATISTICAL DATA ANALYSIS
• Coefficient of Variation – while standard deviation
computes a value which indicates the range of data around
the mean value, coefficient of variation (CV) will indicate it
as a % .
𝑠
𝐶𝑉𝑓𝑜𝑟 𝑎 𝑠𝑎𝑚𝑝𝑙𝑒 = × 100
𝑥
𝜎
𝐶𝑉𝑓𝑜𝑟 𝑎 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 = × 100
𝜇
–
–
–
–
s = sample standard deviation
𝑥 = the sample mean value of the distribution
𝜎 = population standard deviation.
𝜇 = the population mean value of the distribution.
SCIENTIFIC DATA ANALYSIS
20
STATISTICAL DATA ANALYSIS
• Histograms
– Sometimes it is difficult to see how data is
distributed by just looking at the numbers. To see
how data is distributed, a histogram is used.
– A histogram is a type of bar graph with the
exception that all of the bars touch, and the width
of the bars represents something.
SCIENTIFIC DATA ANALYSIS
21
STATISTICAL DATA ANALYSIS
• Histograms
Probability Test
# of students
10
8
6
4
2
0
59.5 - 65.5 - 71.5 - 77.5 - 83.5 - 89.5 - 95.5 65.5 71.5 77.5 83.5 89.5 95.5 101.5
Test Scores
SCIENTIFIC DATA ANALYSIS
22
STATISTICAL DATA ANALYSIS
• Histograms Procedure
1.
2.
Decide how many classes (bars) you want. It will be given by the
problem.
To figure out the width of the bars, divide the range by the # of bars
and then round up to the next whole number. (NOTE: Always round
up even if the number is less than 5, i.e. 5.41 rounds to 6.0)
𝐵𝑎𝑟 𝑊𝑖𝑑𝑡ℎ =
3.
(ℎ𝑖𝑔ℎ𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒 −𝑙𝑜𝑤𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒)
# 𝑜𝑓 𝑏𝑎𝑟𝑠
Take the bar width and add it to the lowest value to get the range of
the first bar, then add the bar width to the last value to get the range
of the next bar. Keep going until you get all of your bar ranges. (i.e.
if the lowest value was 60, your bar width was 6 then the first bar
would be 60 – 66, the second bar would be 66 – 72, etc.)
SCIENTIFIC DATA ANALYSIS
23
STATISTICAL DATA ANALYSIS
• Histograms Procedure (continued)
The problem occurs if your data point is 66 as in the example.
In order to alleviate this problem, a boundary is calculated for
the bars.
4. Calculate the boundaries of each bar:
a.
b.
c.
Find the interval of the data. Is the data given down to whole
numbers, tenths, hundredths, etc? (Note: the data will always
have the same interval)
Take the interval and divide by 2. This is the boundary
adjustment. (i.e. whole numbers means intervals of 1, so ½ = 0.5)
For each bar range calculated previously in step 3, subtract the
upper and lower limit by the boundary adjustment value. These
will be your new bar ranges or boundaries. (i.e. 60 – 0.5 = 59.5
and 66 – 0.5 = 65.5; first bar 59.5 – 65.5)
SCIENTIFIC DATA ANALYSIS
24
STATISTICAL DATA ANALYSIS
• Histograms Procedure (continued)
5. Calculate the midpoint of each bar:
a.
Take the upper and lower limit of a bar add them together
and divide by 2. This will be the midpoint. (i.e. (59.5 +
65.5) / 2 = 62.5)
𝑏𝑎𝑟 𝑢𝑝𝑝𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 + 𝑏𝑎𝑟 𝑙𝑜𝑤𝑒𝑟 𝑙𝑖𝑚𝑖𝑡
𝑏𝑎𝑟 𝑚𝑖𝑑𝑝𝑜𝑖𝑛𝑡 =
2
b. Do this for all of the rest of the bars.
c. The midpoint is sometimes used instead of the boundaries
to graph the bars.
SCIENTIFIC DATA ANALYSIS
25
STATISTICAL DATA ANALYSIS
• Histograms Procedure (continued)
6. Construct a frequency table by using tally marks.
59.5 –
65.5
65.5 –
71.5
71.5 –
77.5
77.5 –
83.5
83.5 –
89.5
89.5 –
95.5
95.5 –
101.5
||
|
||
|
||
||||
||
||||
||||
7. Graph the frequency table using a bar graph
arrangement.
SCIENTIFIC DATA ANALYSIS
26
STATISTICAL DATA ANALYSIS
• Draw the histogram for the following data.
Put it into 5 classes. The data is the number of
passing touchdowns for the top 20 rated
quarterbacks in the 2011 season.
45
46
39
31
41
15
29
29
17
21
27
16
13
18
21
18
9
20
13
20
SCIENTIFIC DATA ANALYSIS
27
STATISTICAL DATA ANALYSIS
• Histograms
– If the midpoint of each class is plotted, they can be
interconnected with a straight line.
– This straight line graph of the midpoints is known
as a Frequency Polygon
SCIENTIFIC DATA ANALYSIS
28
STATISTICAL DATA ANALYSIS
• Histograms
– What did the histogram indicate?
– Histograms can be used as a means of predicting
outcome or probability. These are known as
probability distributions.
– One of the famous probability distributions is the
normal distribution, also known as the normal
curve or bell curve.
SCIENTIFIC DATA ANALYSIS
29
STATISTICAL DATA ANALYSIS
• Normal
Distribution
– The graph to the
right is an example
of a normal
distribution. Not
only does it
indicate the results
of the scores, but it
can also be used for
probability or
predictions.
SCIENTIFIC DATA ANALYSIS
30
STATISTICAL DATA ANALYSIS
• Normal Distribution Properties
– The curve is bell shaped with the highest point at
the mean value.
– It is symmetrical about a vertical line through the
mean value.
– The curve approaches the horizontal axis but never
touches it.
– The transition points (between cup down and cup
up) occur at (mean + standard deviation) and
(mean – standard deviation).
SCIENTIFIC DATA ANALYSIS
31
STATISTICAL DATA ANALYSIS
• Empirical Rule
– For a normal distribution the following can be said
about the data:
• 68.2% of the data will lie within 1 standard deviation
on either side of the mean
• 95.4% of the data will lie within 2 standard deviations
on either side of the mean.
• 99.7% of the data will lie within 3 standard deviation on
either side of the mean
SCIENTIFIC DATA ANALYSIS
32
STATISTICAL DATA ANALYSIS
• Normal
Distribution
Properties
• 𝜎 = 34.1%
• 2𝜎 = 13.6%
• 3𝜎 = 2.15%
• >3𝜎 = 0.15%
• These %’s are used to
indicate probabilities.
SCIENTIFIC DATA ANALYSIS
33
STATISTICAL DATA ANALYSIS
• Example: Assume the heights of college women are normally
distributed, with a mean of 65 inches and a SD of 2.5 inches.
– What % of women are taller than 65 inches –OR- what is the probability if
one woman is selected she is taller than 65 inches?
– Shorter than 65 inches?
– Between 62.5 and 67.5 inches?
– Between 60 and 70 inches?
SCIENTIFIC DATA ANALYSIS
34
STATISTICAL DATA ANALYSIS
• Percentiles
– Sometimes it is more important to see the relative
position of piece of data rather than its exact value.
– Percentile refers to where data lies relative to the other
data in the distribution. A data point at the nth
percentile means n% of the data falls at or below that
point and 100 – n% falls at or above that point.
– Example: You scored in the 85th percentile therefore
85% of the people who took the test scored at or below
you while 15% scored at or above you. Note: this does
NOT mean you scored 85% on the test.
SCIENTIFIC DATA ANALYSIS
35
STATISTICAL DATA ANALYSIS
• Percentiles
– The median is a type of percentile. It is the middle
data point in the distribution therefore it is at the
50th percentile.
– A special type of percentile known as the quartile
is also used to evaluate the position of data.
– Quartiles split data into fourths.
– The 1st quartile (Q1) is the 25th percentile, the 2nd
quartile (Q2) is the median, and the 3rd quartile
(Q3) is the 75th percentile.
SCIENTIFIC DATA ANALYSIS
36
STATISTICAL DATA ANALYSIS
• Quartiles
Q1
Q2
Q3
– Interquartile Range (IQR) = Q3 – Q1
SCIENTIFIC DATA ANALYSIS
37
STATISTICAL DATA ANALYSIS
• Quartiles
– Procedure to compute quartiles:
1. Order the data from smallest to largest.
2. Find the median; this is the 2nd quartile, Q2.
3. The first quartile Q1 is then the median of the lower
half of the data. It is the median of the data falling
below the Q2 and not including Q2.
4. The third quartile Q3 is then the median of the upper
half of the data. It is the median of the data falling
above the Q2 and not including Q2.
SCIENTIFIC DATA ANALYSIS
38
STATISTICAL DATA ANALYSIS
• Quartiles
– Example (even # of data):
– Find Q1, Q2 & Q3 & IQR for the following data:
1.
2.
3.
4.
(3, 4, 9, 13, 20, 24)
Find Q2. Find the median of all of the data. No center
data point so take mean of the two center data points. 13 +
9 / 2 = 11.
Find Q1. Find the median of the first half of the data not
including Q2. Q1 = 4
Find Q3. Find the median of the second half of the data not
including Q2. Q3 = 20
IQR = Q3 – Q1 = 20 – 4 = 16.
SCIENTIFIC DATA ANALYSIS
39
STATISTICAL DATA ANALYSIS
• Quartiles
– Example: A study of ice cream bars was done.
Twenty seven bars tested were rated as tasting
“fair.” The cost per bar is listed below. Find the
quartiles and the IQR.
0.99
1.07
1.00
0.50
0.37
1.03
1.07
1.07
0.97
0.63
0.33
0.50
0.97
1.08
0.47
0.84
1.23
0.25
0.50
0.40
0.33
0.35
0.17
0.38
0.20
0.18
0.16
SCIENTIFIC DATA ANALYSIS
40
STATISTICAL DATA ANALYSIS
• Quartiles
– Knowing Q1, Q2, Q3, highest value and lowest
value in a table of data is known as a FiveNumber Summary.
– In order to graphically represent the five-number
summary, a Box-and Whisker Plot will be used.
SCIENTIFIC DATA ANALYSIS
41
STATISTICAL DATA ANALYSIS
• Quartiles
– Box-and Whisker Plot (Shown vertically but
can be done horizontally as well)
Highest Value
Q3
Q2
Q1
Lowest Value
SCIENTIFIC DATA ANALYSIS
42
STATISTICAL DATA ANALYSIS
• Quartiles
– Proceure to make a Box-and Whisker Plot :
• Draw a vertical scale to include the lowest and highest
data values.
• To the right of the scale draw a box from Q1 to Q3.
• Include a solid line through the box at the median level.
• Draw solid lines called whiskers from Q1 to the lowest
value and from Q3 to the highest value.
– EXAMPLE: Go back to the ice cream problem
and create a box-and-whisker plot.
SCIENTIFIC DATA ANALYSIS
43
STATISTICAL DATA ANALYSIS
• Outliers
– Sometimes data can skew the average of a range of
data.
– When data is 1.5X the difference of the 1st and 3rd
quartiles, than it may be considered an outlier.
– Outliers are sometimes removed from the data so
that is does not skew the results.
SCIENTIFIC DATA ANALYSIS
44
STATISTICAL DATA ANALYSIS
• Scatter Plots
– Remember from last unit that data can be plotted
as a series of x and y points known as a scatter
plot.
– We estimated a line of best fit. In doing this, we
were finding a linear correlation that exists
between the x and y points.
– We shall analyze the data of a scatter plot more
closely in the next couple of slides.
SCIENTIFIC DATA ANALYSIS
45
STATISTICAL DATA ANALYSIS
Time
(seconds)
Position
(meters)
0.7
3.8
1.8
3.2
2.6
2.8
3.4
2.2
3.8
1.8
4.1
1.4
4.9
0.8
6.0
0.2
6.5
0
SCIENTIFIC INQUIRY AND ANALYSIS
46
STATISTICAL DATA ANALYSIS
• Scatter Plots
– The y-distance that a data point is away from the
line of best fit is known as a Residual.
– The optimal line of best fit occurs when the sum of
all of the square of all of the residual values is the
smallest. This is know as finding the line of best
fit through Least Squares method.
SCIENTIFIC DATA ANALYSIS
47
STATISTICAL DATA ANALYSIS
• Least Squares Method
– Recall that the slope of a linear line is in the
format:
𝑦 = 𝑚𝑥 + 𝑏
– This method will allow us to find the optimal slope
(m) and the y-intercept (b) based on the data.
– We will use a similar method here as we did for
calculating standard deviation.
SCIENTIFIC DATA ANALYSIS
48
STATISTICAL DATA ANALYSIS
• Least Squares Method
𝑦 = 𝑚𝑥 + 𝑏
– To find the slope m, the following equation is used:
𝑚=
𝑆𝑆𝑥𝑦
𝑆𝑆𝑥
where
Σ𝑥 Σ𝑦
𝑆𝑆𝑥𝑦 = Σ𝑥𝑦 −
𝑛
2
Σ𝑥
𝑆𝑆𝑥 = Σ𝑥 2 −
𝑛
and
– To find the y-intercept b, the following equation is
used:
𝑏 = 𝑦 − 𝑚𝑥
where 𝑦 is he mean of y and 𝑥 is he mean of x
SCIENTIFIC DATA ANALYSIS
49
STATISTICAL DATA ANALYSIS
X -data
Time
(seconds)
Y-data
Position
(meters)
0.7
3.8
1.8
3.2
2.6
2.8
3.4
2.2
3.8
1.8
4.1
1.4
4.9
0.8
6.0
0.2
6.5
0
Σx =
Σy =
𝑥=
𝑦=
x2
xy
Σx2 =
Σxy =
SCIENTIFIC INQUIRY AND ANALYSIS
50
STATISTICAL DATA ANALYSIS
• Example:
1. From the example on the previous page find the slope:
Σ𝑥 Σ𝑦
𝑆𝑆𝑥𝑦 = Σ𝑥𝑦 −
𝑛
2
Σ𝑥
2
𝑆𝑆𝑥 = Σ𝑥 −
=
𝑛
𝑆𝑆𝑥𝑦
𝑚=
𝑆𝑆𝑥
=
=
2. From the example on the previous page find the yintercept:
𝑏 = 𝑦 − 𝑚𝑥
3. Write the equations for line of least squares.
𝑦 = 𝑚𝑥 + 𝑏
SCIENTIFIC DATA ANALYSIS
51
STATISTICAL DATA ANALYSIS
Graph 1: Movement of a Car
4.5
4
3.5
Position (meters)
3
2.5
2
y = -0.6974x + 4.419
R² = 0.9886
1.5
1
0.5
0
0
-0.5
1
2
3
4
5
6
7
Time (seconds)
SCIENTIFIC INQUIRY AND ANALYSIS
52
STATISTICAL DATA ANALYSIS
• Measuring the Spread of Data
– There are three methods for measuring the spread
of the data around the line of least squares:
• Standard Error of Estimate
• Coefficient of Correlation
• Coefficient of Determination
SCIENTIFIC DATA ANALYSIS
53
STATISTICAL DATA ANALYSIS
• Standard Error of Estimate
– In order to do this we look at how far away the y
data point is away from the least squares line for
each of the data points.
– This method will calculate a value that is
representative of spread of all of the data.
– We will use values that were already calculated in
figuring out the least squares line.
SCIENTIFIC DATA ANALYSIS
54
STATISTICAL DATA ANALYSIS
• Standard Error of Estimate
𝑆𝑒 =
𝑆𝑆𝑦 − 𝑚 𝑆𝑆𝑥𝑦
𝑛−2
– Why would it be n – 2? (In other words, why does n
have to be >2)
– Use the same method as before to find m, SSxy and
SSx:
𝑆𝑆𝑦 =
Σ𝑦 2
−
Σ𝑦 2
𝑛
SCIENTIFIC DATA ANALYSIS
55
STATISTICAL DATA ANALYSIS
X -data
Time
(seconds)
Y-data
Position
(meters)
0.7
3.8
SSxy =
1.8
3.2
m=
2.6
2.8
3.4
2.2
3.8
1.8
4.1
1.4
4.9
0.8
6.0
0.2
6.5
0
Σx =
Σy =
y2
Previously
Calculated
Data
Σy2 =
SCIENTIFIC INQUIRY AND ANALYSIS
56
STATISTICAL DATA ANALYSIS
• Example:
1. From the example on the previous page find the
following:
Σ𝑦
2
𝑆𝑆𝑦 = Σ𝑦 −
𝑛
2
=
2. From the above calculation and previous
calculated data find the standard error of
estimate:
𝑆𝑒 =
𝑆𝑆𝑦 − 𝑚 𝑆𝑆𝑥𝑦
=
𝑛−2
SCIENTIFIC DATA ANALYSIS
57
STATISTICAL DATA ANALYSIS
• Linear Correlation Coefficient, r
– So far, we have been able to figure the line of best fit by
using the line of least squares (which is also known as the
“least squares regression line of y on x”)
– We then wanted to determine the quality of our line by
using the standard error of estimate.
– The problem with the standard error of estimate is that it
has units of y; therefore, when looking at two different sets
of data, you cannot say that one graph is better than other
because the units may skew the result.
– The linear correlation coefficient helps to alleviate this
problem by calculating a number that is unitless and
therefore independent of the units.
SCIENTIFIC DATA ANALYSIS
58
STATISTICAL DATA ANALYSIS
• Linear Correlation Coefficient, r
𝑆𝑆𝑥𝑦
𝑟=
𝑆𝑆𝑥 𝑆𝑆𝑦
– The value of r
r
Indication
0
There is no linear relationship of the data points
1 or -1
There is a perfect linear relationship between the x and y data
points; all points lie on the least-squares line.
Between 0 and 1
The x and y data points have a positive correlation (+ slope)
Between 0 and -1 The x and y data points have a negative correlation (- slope)
SCIENTIFIC DATA ANALYSIS
59
STATISTICAL DATA ANALYSIS
X -data
Time
(seconds)
Y-data
Position
(meters)
Previously
Calculated
Data
0.7
3.8
SSxy =
1.8
3.2
SSx =
2.6
2.8
SSy =
3.4
2.2
3.8
1.8
4.1
1.4
4.9
0.8
6.0
0.2
6.5
0
SCIENTIFIC INQUIRY AND ANALYSIS
60
STATISTICAL DATA ANALYSIS
• Example:
1. From the example on the previous page find
the following:
𝑟=
𝑆𝑆𝑥𝑦
𝑆𝑆𝑥 𝑆𝑆𝑦
=
2. What does the value of r indicate about the
correlation of the data points?
SCIENTIFIC DATA ANALYSIS
61
STATISTICAL DATA ANALYSIS
• Coefficient of Determination, r2
– Another way of looking at the quality of your data is to
look at how far away some y-data point (y) is from the
mean of the y-data (𝑦). This is simply the deviation.
𝑦−𝑦 .
– The deviation is made up of two parts:
• The first part indicates how far away the least squares line (yp) is
from the mean of the y-data (𝑦). This is simply 𝑦𝑝 − 𝑦 , and this
is known as the explained portion of the standard deviation.
• The second part indicates how far away a particular y-data point (y)
is from the least squares line (yp). This is simply 𝑦 − 𝑦𝑝 , and this
is known as the unexplained portion of the standard deviation.
SCIENTIFIC DATA ANALYSIS
62
STATISTICAL DATA ANALYSIS
• Coefficient of Determination, r2
– Recall that when the deviation is squared we get
the variance or variation. Based on the
explanation before the variance has two parts: the
explained variation and the unexplained variation.
– The Coefficient of Determination is a ratio of the
explained variation to the total variation and is
simply calculated by taking the Correlation
Coefficient (r) and squaring it.
SCIENTIFIC DATA ANALYSIS
63
STATISTICAL DATA ANALYSIS
• Coefficient of Determination, r2
– So what does r2 indicate?
– Change r2 into a %
– The % indicates what % of the variation of the y
data is explained by the variation of the x data if
we use the least squares line.
– 100% − 𝑟 2 indicates what % of the variation of
the y data is due to random chance or some other
variable beside the x that may influence y.
SCIENTIFIC DATA ANALYSIS
64
STATISTICAL DATA ANALYSIS
• Example:
1. From the previous example find the
Coefficient of Determination, r2:
𝑟2
=
𝑆𝑆𝑥𝑦
𝑆𝑆𝑥 𝑆𝑆𝑦
2
=
2. What does the value of r2 indicate about the
explained and unexplained portions of the
variation?
SCIENTIFIC DATA ANALYSIS
65
STATISTICAL DATA ANALYSIS
• Correlation vs Causation
– Correlation refers to one variable changing as
another variable changes.
– Causation refers to one variable changing because
of another variable changing. (Cause & Effect)
– Just because there is a correlation between two
variables does not mean there is a causation.
SCIENTIFIC DATA ANALYSIS
66
DO NOW / HW Unit 2-1 Check
• Have out your homework and do the
following: Find the mode, median, mean and
standard deviation.
60%
86%
94%
100%
63%
89%
94%
100%
66%
89%
94%
100%
74%
91%
94%
100%
74%
91%
97%
100%
77%
94%
97%
100%
100%
SCIENTIFIC DATA ANALYSIS
67
HW Assignment 2-1 Check
• 10, 12, 14, 18, 36, 37, pg. 449 – 50
10. 8.33
9
9
18. 14
4
12. 85.625
85.5
91
36. $233,071.43
$142,000
none
14. 2.77
2.9
2.9
37. $645,000
$213,242.66
SCIENTIFIC DATA ANALYSIS
68
DO NOW / HW Unit 2-2 Check
• Have out your homework and do the
following: Make a histogram of the following
data in 7 classes. These were the top 32
quarterback ratings in the NFL in 2012.
108.0
99.1
90.7
87.4
83.3
81.2
77.4
72.6
105.8
98.7
90.5
87.2
82.6
79.8
76.5
72.2
102.4
97.0
88.6
86.2
81.6
79.1
76.1
66.9
100.0
96.3
87.7
85.3
81.3
78.1
74.0
66.7
SCIENTIFIC DATA ANALYSIS
69
DO NOW / HW Unit 2-2 Check
RANGE :
CLASSES:
BAR WIDTH:
BAR STARTING POINT:
UPPER BAR RANGES:
INTERVAL:
BOUNDARY ADJUSTMENT:
BOUNDARIES STARTING POINT:
BOUNDARY RANGES:
# OF QB'S
41.3
7.0
6
66.7
72.7
0.1
0.05
66.65
72.65
78.7
84.7
90.7
96.7
102.7
108.7
78.65
84.65
90.65
96.65
102.65
108.65
66.65 - 72.65
4
72.65 - 78.65
5
78.65 - 84.65
7
84.65 - 90.65
7
SCIENTIFIC DATA ANALYSIS
90.65 - 96.65 96.65 - 102.65
2
5
102.65 - 108.65
2
70
HW Assignment 2-2
SCIENTIFIC DATA ANALYSIS
71
EXPERIMENTAL DESIGN
• Standard Deviation
– Example: Find the standard deviation of the following values:
(1, 2, 7, 9, 10, 10).
𝑥
𝒙−𝒙
(𝒙 − 𝑥)2
1
1 – 6.5 = -5.5
30.3
2
2 – 6.5 = -4.5
20.3
7
7 – 6.5 = 0.5
0.3
9
9 – 6.5 = 2.5
6.3
10
10 – 6.5 = 3.5
12.3
10
10 – 6.5 = 3.5
12.3
Mean = 39/6 = 6.5
– s2 = 81.8 / 5 = 16.4
Σ = 81.8
s = 4.05
SCIENTIFIC DATA ANALYSIS
72
EXPERIMENTAL DESIGN
• Standard Deviation
– Previous example: Find the standard deviation of the following
values: (1, 2, 7, 9, 10, 10) using alternate method
x
x2
1
1
2
4
7
49
9
81
10
100
10
100
Σx = 39
Σx2 = 335
– SSx = 335 – 392/6 = 81.5 s = 4.04
SCIENTIFIC DATA ANALYSIS
73