Download Chapter 6

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
6.1 What is Statistics?
•
Definition: Statistics – science of collecting,
analyzing, and interpreting data in such a way
that the conclusions can be objectively
evaluated.
3 Phases:
1. Collecting data
2. Analyzing data
3. Interpreting data
6.1 What is Statistics?
• Descriptive Statistics – summarize and describe a
characteristic of a group
example: batting average
• Inferential Statistics – used to estimate, infer, or
conclude something about a larger group
example: polls
• Sample – subset of the group of data available for
analysis
6.1 What is Statistics?
• Population – the entire set
• Bias – favoring of certain outcomes over
others
• Census – collects data from all members of
the population
• Parameter – characteristic value of a
population
• Statistic – characteristic value of a sample
6.2 Organizing Data
• Stem and Leaf Diagram:
data – 35, 52, 37, 44, 51, 48, 45, 12
Stem
5
4
3
2
1
Leaves
12
458
57
2
6.2 Organizing Data
• Frequency Table:
data – 35, 52, 37, 44, 51, 48, 45, 12
Range
50-59
40-49
30-39
20-29
10-19
Frequency
2
3
2
0
1
6.3 Displaying Data
• Ways to display data:
–
–
–
–
–
–
Frequency histogram
Relative frequency histogram
Multiple bar graph
Stacked bar graph
Line graph
Pie chart
6.3 Displaying Data
Frequency Histogram
30
25
20
15
Series1
10
5
0
1
2
3
4
5
6
7
8
6.3 Displaying Data
Relative Frequency Histogram
Relative Frequency
0.3
0.25
0.2
0.15
Series1
0.1
0.05
0
1
2
3
4
5
6
7
8
6.3 Displaying Data
Multiple Bar Graph
5000
4000
lower
3000
upper
2000
graduate
1000
Ar
ts
N
ur
s
i
S
oc
sc
i
du
E
at
sc
N
C
om
m
0
6.3 Displaying Data
Stacked Bar Graph
8000
7000
6000
5000
4000
3000
2000
1000
0
graduate
upper
Ar
ts
N
ur
s
i
sc
So
c
i
u
Ed
at
sc
N
C
om
m
lower
6.3 Displaying Data
Line Graph
5000
4000
lower
3000
upper
2000
graduate
1000
Ar
ts
N
ur
s
i
S
oc
sc
i
du
E
at
sc
N
C
om
m
0
6.3 Displaying Data
Pie Chart
Pie Chart
Comm
Edu
Natsci
Socsci
Nurs
Arts
6.4 Measures of Central Tendency
• Central Tendency – the propensity of data to
be located or clustered about some point.
• Arithmetic Mean – sum of the values of all the
observations divided by the total number of
observations
n
• For sample data, mean is
x
x

i 1
n
i
6.4 Measures of Central Tendency
n
• For population data,
the mean is

x
i 1
i
n
• Median – the median is the middle value
of a set of data when data is arranged in
ascending order
6.4 Measures of Central Tendency
• Finding the median:
1. Arrange the data in increasing order or
decreasing order.
2. Determine if n is even or odd.
a. If n is odd, pick the middle value
b. If n is even, take the average of the two
middle values
6.4 Measures of Central Tendency
• Mode – is the value or values that occur most
frequently.
Note: If all values occur with the same frequency,
then there is no mode.
• Symmetric Distribution
Mean, Median, and Mode
6.4 Measures of Central Tendency
• Distribution skewed to the left
Mean
Median
Mode
• Distribution skewed to the right
Mode
Median
Mean
6.5 Measures of Variability
• Definition: The range of a set of n
measurements, x1, x2, x3, … xn is the
difference between the largest and the
smallest amounts.
N
• Variance -
2 
2
(
x


)
 i
i 1
N
6.5 Measures of Variability
Problem with the variance: the units are the original
units squared.
• Standard deviation – population standard
deviation is the square root of the population
variance.
n
2
( xi  x)
• Sample variance -
s 
2
• s = square root of the
sample variance

i 1
n 1
6.5 Measures of Variability
• Short cut formulas for s2 and 2 are given
on page 495 (provided with test).
• Short cut formula for frequency data is
given on page 499 (provided with test).
• Short cut formulas are genuinely easier to
calculate.
• Approximating the standard deviation:
s  (R/4) where R is the range.
6.6 Measures of Relative Position
• pth percentile - for a data in increasing order
- p% of the data are less than that value and
(100 – p)% of the data are greater than that
value.
6.6 Measures of Relative Position
• Z-scores – The sample z-score for a
measure x is:
xx
z
s
The population z-score for a measure x is:
x
z

z-score represents the # of standard
deviations away from the mean.
6.7 Normal Distribution
•
•
Definition: Standardizing – converting
data to z-scores.
Some empirical rules:
1. About 68% of data is within one  of the
mean.
2. About 95% of data is within two  of the
mean.
3. About 99% of data is within three  of the
mean.
6.7 Normal Distribution
•
The normal distribution looks like:
1. Bell-shaped
2. Symmetric
3. Mean = median = mode
6.7 Normal Distribution
• Definition: Standard normal distribution –
normal distribution with  = 1 and  = 0.
The standard normal distribution table (page
511 or in appendix page 647) can be used to
determine probabilities for a range of zvalues
6.8 Confidence Intervals
• Central Limit Theorem: For a large sample
size, the random variable x is approximately
normally distributed with mean  and
standard deviation /n where  is the
population mean of the x’s and  is the
population standard deviation of the x’s.
6.8 Confidence Intervals
•
x  Z
2

n
-  may be replaced by s
• Common levels of confidence (n  30):
Level of Confidence
z/2
80
90
95
99
1.28
1.645
1.96
2.575
6.8 Confidence Intervals
• Margin of Error: margin of error of an
estimate of a sample proportion is given by:
Z
2
2 n
6.9 Regression and Correlation
• Scatter Plot – a plot of data consisting of 2
variables
• Linear Regression – modeling the data with the
line that “best fits” – usually a “least squares” line
or regression line
• Least Squares Line – is the line that minimizes the
sum of the squared errors for a set of data points
(formulas given on page 531 and shortcut
formulas are on page 532 – formulas to be
provided on test)
6.9 Regression and Correlation
• Correlation Coefficient r – is a measure of
the strength of the linear relationship
between the 2 random variables x and y.
Note: The closer the correlation is to 1 or –
1, the stronger the relationship between the
x and y variables. A correlation of zero
means there is no evidence of a linear
pattern.